Uncategorized

Graff noon and welcome to the introductory session to Amazon redshift my name is Pavan petha Coetzee I lead the product team for redshift and I’m glad to have with me namwon from RetailMeNot who is going to talk about their journey with red shift so in today’s session we’re going to talk about what red shift is and why you would want to consider using it we’ll talk about some of the key features of red shift and their benefits we’ll talk about how customers are using Amazon redshift and of course we’ll have Nam talk about how redshift fits into their analytical platform at RetailMeNot and we’ll have some time for questions so this is the ever-changing snapshot of the AWS big data portfolio so you can bring large amounts of data into AWS using Direct Connect we have import/export and as of today morning snowball we have kinases through which you can stream data from hundreds of thousands of data producers and process them across multiple services within AWS and of course is another service there that got added Kinesis firehose on the storage side depending on the data format you can store just about anything in s3 you can if the data has relational semantics you can use Amazon or aura for that for key value data we have dynamo DB for lifecycle policies and moving data to a lower storage low cost storage system you can do that through place here from Amazon s3 for text search and indexing we have cloud search that can be utilized once you store all of this data we have a set of services through which you can analyze them we have Amazon EMR that can be used to process unstructured data very quickly through Hadoop once you process that data you can bring into Amazon redshift for quick sequel axes so we have machine learning through which you can run predictive analytics on data sitting in Amazon redshift we have a data orchestration layer called data pipeline that can help you move data across these various services and schedule jobs again this picture guard a little bit little bit bigger as of today morning on the analytics side we have quick side on the orchestration side we have the data migration service that can help you move data across database engines so one of the occupational hazards of working for AWS is our slides tend to get stale very quickly so what is redshift in simple terms redshift is data warehousing made fast simple and very cost effective redshift is a sequel based relational data warehouse it’s based on a massively parallel processing architecture and uses columnar storage it’s an order of magnitude faster than a lot of the comparable row stores out there it’s very easy to use you can provision hundreds of gigabytes or hundreds of terabytes with in under three minutes you can scale you can add capacity you can remove capacity through a few clicks on the management console it’s a managed service in that we manage backups we make recovery very easy we manage patching we manage a bunch of other aspects around disaster recovery and make it very easy for you to set it up it’s very cost effective like a described earlier it’s an order of magnitude cheaper than a lot of other data warehousing solutions available in the market so the traditional view of data warehousing and if you have been to Andy’s keynote it sort of reflects some of the points that he has in Tehran it has largely centered around multi year deployments multi-year deals multi-million dollar deals and multi-year commitments so all of this means that data warehousing has largely been in the realm of usage within the top thousand to 2,000 global enterprises and it has largely been sold to central IT teams within enterprises we believe that this is a very narrow view as a lot

of you in this audience today can vouch small companies have big data – especially in the last five years or so that the proliferation of mobile gaming social and IOT this is increasingly the case even within enterprises long deployment cycles high operational complexity and a lot of cost costs that are involved across both provisioning managing and procuring warehouses warehouse infrastructure means that departmental groups within enterprises are not able to move as fast as they want to on projects this leads to what we believe is a lot of dark data and this is a analysis from one of the analyst companies that shows more than accept and it more than an exabyte of dark data sitting in enterprise data stores so our view of data warehousing is more inclusive so we believe that red shift is appealing to enterprises because of the cost advantage it’s also appealing because it’s extremely easy to provision it and so central IT teams can provision hard data warehouse infrastructure very quickly and scale it very easily increasingly we are also seeing central ID setting guidelines and letting the departmental groups go build their own infrastructure as a see fit red shift also enables hi DBA productivity given the manager aspect of the service we’re also seeing of course a lot of big data companies leveraging great shift very efficiently and the key selling point for big data companies is redshift is an order of magnitude faster than how do sequel on harue analyst within companies don’t need to learn scripting don’t need to look don’t need to learn programming it’s a standard sequel interface that can go and that they can go and execute queries against redshift also ties pretty well into a lot of other technologies that companies with big data use it ties very well with Hadoop you can run your ETL jobs or MapReduce jobs on how dupe you can bring the data into redshift through a copy command that is available on the data on the database directly you can parallel e load data from services like EMR you can leverage data science technologies like machine learning on top of redshift we also are seeing increasingly seeing a lot of SAS companies leveraging redshift and that the key selling point for them is they want to provide more analytical horsepower to their customers and don’t want to wait until all the data is processed they want to show the analytical capabilities right within the process flows of their applications they also like the fact that they can pay on demand they can grow on demand easily with redshift and the manage nature with availability manage for them dr manage for them is also appealing so this sort of a view is leading to a pretty broad growth for redshift across a variety of verticals and variety of custom segments so we have companies like Nasdaq which which is very security focused a traditional enterprise that is using redshift to load about five to eight billion rows of data all day for trading day and analyze orders codes and trades you have Yelp using redshift for analyzing ad campaign performance as well as understanding how their end users are using their mobile app features see if companies like des comm which is the SAS company that is owned by salesforce.com now that is leveraging redshift to provide more analytical capabilities to their end customers an upon and for the last two years as Andy pointed in the keynote today redshift has been the fastest growing aw service until Aurora uprooted it a bit ago and with all your help I’m hoping we can grab that title back by next year it’s taking a step back I’m going to go over a little bit of detail around redshifts architecture so we talked about redshift being an MP baby NPP based engine so we have a leader node through which you access the

cluster if you will and the leader node stores the metadata for all your database objects it generates an optimized query plan and pushes code into the compute nodes so that the queries can be executed all and pathol the compute node themselves store the data locally in columnar format and processes a variety of operations in a distributed and paralyzed manner for fast performance you can start with redshift at 25 cents an hour for one sit there for 460 gigabytes and you can scale it up to 2 petabytes of compressed data and we have customers running multiple petabytes today on redshift so some of the key features on their benefits redshift is fast because of a variety of reasons it uses columnar storage which means you perform dramatically less higher than row stores which fetched a lot of data discard significant percentage of it that is not required redshift does automatic data compression which again means faster i/o performance but also lower storage costs so we typically see customers getting an average of four times compression ratio with redshift so if you’re bringing a hundred terabyte data warehouse it could as well be 25 30 terabytes of redshift we use direct attacks or storage and the block sizes that redshift uses is a megabyte which is pretty large compared to transactional systems and this enables very fast scan rates so with redshift you can get up to four gigabytes per second or node of scan rate given data warehousing is very read intensive scan performance is pretty important redshift also uses zone maps for fast performance so the way zone Maps work is each column within a table gets an MB of several MB of contiguous blocks and each block as we discussed is a megabyte in size and we store the min max value of each block in memory so that for a starter data set when when you’re executing a query we fetch the minimal number of blocks possible to give you answers very quickly we’ll talk a little bit more about zone maps and sort keys in a bit next slide so here is an example of sort of zone maps and how sort keys play a significant role in redshift so as you think about getting started with redshift understanding sort keys and what does what do they mean in your environment is very important so here is an example of a table with unsorted data so it’s a date field and each block as you can see here it has a min and Max value but there’s significant amount of overlap because the data is not sorted so if you perform a query where you’re selecting count star from the stable where you’re trying to get data for a specific date and this sort of a query will hit three of the four blocks so we have to read three of the four blocks to process this query because ninth June falls in the first block second block as well as a fourth block if you have a sorted data set and you can see here that there’s no overlap between the ranges in any of these blocks it’ll be just one block and this is a simplistic example but the difference in the execution times can be thought through in terms of complexity so the first data model has an order of n complexity the second data model has an order one complexity so if you have a million blocks the second query will return results significantly faster than the first one and we also have an intermediary stage there with what we call as interleaved sort keys that enable you to get performance that’s somewhere in between so there’s some trade-offs with using interleaved sort keys we have an advanced session tomorrow about red shift encourage you to attend if you want to learn more about it red shift is also fast because of the way the various operations that are needed for processing queries for maintenance of the clusters so all of these happen in a parallel and distributed manner so queries happen perrolli and and they’re distributed loads unloads backups illustration recites all of these operations are parallel which means the performance of your cluster across these operations grows

linearly as you add more nodes distribution Keys is another concept that you want to understand before getting started with redshift so here is a simple example where we have a table with a set of rows and typically you want to load data into redshift through s 3 there are the mechanisms if you use in dynamodb other services but by far as three tends to be the most common source for customers to load data into redshift and when you’re loading data through the copy command each core within a compute Nord picks up a file in parallel and starts loading that data so if you have three compute nodes in this example let’s say each compute node has 30 cores so you have 90 course that is available for processing loads and paddle each code will go to s3 and pick up a file so we encourage you to have multiples of 90 in this particular case or multiples of the number of cores so that the data load happens very quickly so once the data gets loaded each of the data points that you have will need to find a home based on how you are distributing the data so one way to distribute the data is evenly so in this particular case is just a round-robin way of distributing the data so we have two other ways through which you can distribute data you can have all of the all of this data sit in all the compute nodes instead of distributing it so you have the exact same copy of the data in all the nodes this is typically the option for smaller dimension tables where you perform new cups so the idea would be any joint that you are performing you would want to aim for having the joint performed locally so that there’s no chattiness across the nodes the other way to distribute data is through a key so if you have a large fact table and a large dimension table that you’re joining you would want to distribute the data using that particular key so that all of your giants between those tables get served within each context of a compute node and there’s not much interaction between the nodes at that point there’ll be more discussion around it in the advanced session my idea was to just give you a taste of the things that you need to keep in mind before you get started with redshift sort and distribution keys are very important is the bottom line redshift is also fast because we run on optimized hardware hardware there is optimized for i/o intensive workloads so we talked about the 4gb per second scanned rate per node we have n were enhanced networking that enables a million packets per second of retrieval you have the choice to pick between SSD nodes and magnetic storage SSD storage offers 10 to 15 times more performance than magnetic storage but it’s 5 times more expensive so depending on your cost performance requirements you would pick one versus other we also have a very easy way for customers to migrate from one generation of hardware to another generation one recent example as we launched support for ds2 instance drive which is a next-generation hard disk based platform and it provides 2 times more memory 2 times more compute capacity 1.5 times more bandwidth as a previous generation of the instance family at the exact same cost so as we get more efficiencies in the hardware we passed it along to customers across all the services as you may be familiar the AWS pricing philosophy that takes us to the cost aspect redshift as we talked about is very inexpensive compared to a lot of other options in the market so you can process data at less than thousand dollars per terabyte per year when most of the other options in the market costs anywhere between 10 to 25 times in some cases 50 times more we don’t charge for the leader node pricing is pretty simple we have two different pricing types you can choose on-demand or you can choose our eyes if you’re familiar with how our eyes work in the ec2 world we talked a little bit about the managed nature of red shell so one of the things that is very appealing the customer see

is very appealing to them is red ship does continuous and incremental backups so you have whenever you write data into red shift copies of the data are propagated to other nodes in a synchronous manner for higher data durability and data is also backed up into s3 in a continuous manner and incremental is a very important concept so if you have a terabyte large data warehouse that you’ve just backed up and you loaded hundred gigs of data the next backup will only pick the delta of one hundred gigs and not the entire 1.1 terabytes the often I think probably the understated feature here is streaming restore the this is something that a lot of our customers say they’re pleasantly surprised by after we talked to talk through word so streaming restore enables you to create a cluster from your backup even if it’s no matter the size it could be 100 terabytes it could be a petabyte you can start querying your restore cluster within under three minutes you can do writes on it you can do reads on it the data behind the scenes gets streamed from s3 and the queries that you are performing on the cluster will determine the priority of the blocks that are being fetched in most cases and data warehousing 1989-90 eighty to ninety percent of the data is you know not frequently accessed so the ten percent of the data is loaded first and the rest later that enables quicker processing cooker you know time to query for restores the red ship is also fault tolerant so we have we monitor the health of each node within your redshift clusters and and detect failures and take Recovery Act remedial actions based on that so we tolerate red shift can tolerate multiple disk failures multiple node failures and whenever the failure occurs did we detect it and then we provision a replacement and the replacement process works very similar to the streaming restore process where the replacement node is provisioned with the bare minimum schema that is necessary to continue the queries and then the node gets hydrated behind the scenes from the copies of the data that we have in other nodes it’s very easy to set up disaster recovery with redshift with a few clicks on the management console you can say here is my source region and here are my backups move those backups through destination region and keep them in sync and the restoration from the backups and the other region happened in a streaming restore manner so it’s very fast security is an aspect that I’m sure is near and dear to most of you here and from the get-go redshift has been designed with enterprise level security in mind so we have end-to-end encryption starting with loading the data in an encrypted manner in a s3 loading from that loading that data from s3 into redshift in an encrypted manner or transit you have variety of encryption options available to you you can use flour HSM you can use on-premise HSM to secure the data at rest and have control over it there are several levels of audit logging available so that you can understand who is executing queries what roles do they have what times are those are those queries executed who is logging in etc we also have several compliance initiatives that we have been through and notably BAA which is in the healthcare specific compliance requirement FedRAMP which is the federal government standard which is pretty stringent so all of these are enterprise level compliance requirements that redshift satisfies so one of the interesting aspects that that is made possible in a cloud-based environment is the ability for us to do continuous deployments as I’m sure a lot of you strive to achieve on the services so this on the product side we strive to achieve continuous deployments as well so which means you have cold coming in features new features coming in reliability improvements security improvements all of those streaming in in a very continuous manner we we actually have patterns coming every two weeks for the most part so we released over 100 new features in the last two plus years that redshift has been around and most on-premise data warehousing systems you know you have to wait for six months one year to get new features redshift provides you a lot of

analytical horsepower it enables you to do a lot of interesting things around data science so we have approximate functions that enable you to do count distinct sort of queries extremely quickly with some margin of error which is typically in the range of one to two percent we have recently released user-defined functions which makes thousands of Python functions available to you you can define your own functions you can bring your own libraries call those functions from sequel for doing advanced analytical functions as well as data related functions as well as optimization research or some operations research our optimization related function so Python makes all of these available and with the support for the user-defined functions you can take advantage of them we talked a little bit about machine learning and the ability to run define predictive analytics models and use redshift as a data source and run these models again straight through the machine learning servers you also can use a variety of partner tools so we work with sass pretty closely so if you’re using advanced statistical technologies access you can take advantage of the processing power of redshift and then feed the results into size for advanced analytics you can do the same with our as well so we have a fairly large ecosystem and we understand that in the warehousing world a lot of you probably a migrating from other systems you have existing investments in bi to laying and ETL to laying that we respect so we work with a lot of data integration partners we work with a lot of bi partners in addition there are a lot of system integrators that we work with to enable you to either migrate or get started with redshift very quickly so we follow a service-oriented architecture across AWS so we have multiple services that you can use as building blocks to put together any solution that you desire in the analytics side so the variety of services we talked a little bit about machine learning you can plug trout search into red shift for doing text related search and you know other analytics with red shift so we have kinases and our firehose so with firehose you can set redshift as one of the destinations all you have to do is define a throughput that you require for your stream and then specify a copy command that can be executed by redshift on a periodical basis typically every five minutes so that you are able to do analytics in a real near near real-time manner so a few use cases and how customers are leveraging these features and functionality to enable interesting things in their analytics environments so first one is sort of a hypothetical use case so we looked at how much would it would cost for somebody to stream the entire Twitter firehose and analyze it through the various building blocks we have within our analytics portfolio so you have the Kinesis stream that can I think Twitter has around it is a year ago it used to generate around 500 million tweets a day and let’s say you stream those bikini scissors move that into bread shift or s3 depending on where you want to purchase them so different costs for each of these services and it comes to around comes to less than three dollars an hour to stream the entire Twitter firehose and perform analytics on and using these various building blocks less than how much it would cost you for a Starbucks coffee so the idea here is you can perform really powerful analytics with with redshift and Kinesis and the other services that you have a very cost effective price point you can scale these services as you need you have our eyes through which you can get more discounts if you decide that’s a path that you want to go for a year more or three years another interesting example is Amazon com so amazon.com as you know is one of the highly trafficked website in the world so we get around 150

million visits on a monthly basis so in an in commerce sort of a use case you want to get this data you want to get the clickstream data and analyze purchase patterns and behaviors who is purchasing what products who is moving through the cart and on purchasing eventually and then go back into the reasons of why that’s happening this is a big data workload it’s a petabyte scale they’re generating 2 terabytes of clickstream data ad and growing it around 67 percent year-over-year so the largest table is 400 terabytes that’s a significantly large footprint also their legacy data warehouse solution which is based on Oracle RAC through that they were able to do scans of a1 scan of one week of data per hour so in an hour they were able to process one week worth of data and that’s what that was not enough for them they wanted to do more longitudinal analysis and there’s also space and storage constraints if you’re using rack the number of nodes that you can have there so they moved to a hardwood based system and that enabled them to grow to do this much faster so they were able to query one month of data in an hour with redshift which is when they two years ago that they started looking at this project which was when redshift was released they were able to query 15 months of data in under 14 minutes so they went from querying one week of data to one month of data to 15 months in less than 15 minutes they are able to load five billion rows in under ten minutes and join and perform really complex joins across huge data sets and tables in in this particular case one of the queries that used to take three days started taking two hours in redshift and their load pipeline for ETL became significantly faster as well so it’s a fairly large footprint they have lots of clusters and lots of nodes across lots of different groups within Amazon that is using that is now using redshift and all of this are able to do with 2d beers and I think that’s a significant part of this equation it’s just not the DBA aspect there’s also the data engineering aspect and data engineers typically spend a ton of time trying to figure out the ETL process trying to recover from failures of jobs fail and so that is also another significant investment in time all of that reduced considerably with redshift another example that we talked a little bit about in the keynote is NTT DoCoMo so they are one of they’re probably the largest telecom company in Japan and generate petabytes of data corresponding to their cellphone towers and mobile usage within the company and the the work on a hybrid sort of a model where all of this data gets generated in their on-premises environment then they move this data to redshift and AWS to analyze it so this is another use case that is that it had very stringent security requirements to move the data so the various components within redshift that enable the secure security requirements to be satisfied include some of the things that we already talked about encryption at rest aspects with the various levels of key management available so redshift were runs in a you can wonder shift in V PC and define your own subnet subnets and networking topologies to secure your date you can audit all your sequel queries and user logins so all of this is made possible and you know done in a very fast time frame the migration their migration took I think a few weeks to get it all the way so at this point they believe that cloud is at least as secure or more secure than the environment that they are being running in so another interesting use case and we somehow tend to see these in Japan so there’s a sushi chain with 380 art stores in Japan they have they use their plates the sushi plates as IOT devices they plug them in onto the plates the plates move across the store and of course you know people order and consume sushis and they stream the data related to the food consumption

through Kinesis and then that data finds its way into redshift within 5 minutes or so so this sort of architecture enables them to combine their inventory in with information with consumption and confirm a so that they can better understand the supply see and behavior control cause and bring in more efficiencies so one of the interesting things around big data is it tends to with art has a more batch oriented workflow and immediate operation and intelligence is something that we see we hear customers are saying extremely difficult or hard to do so that services like kinases and the ability to move that data very quickly into redshift enables you to enables you to do near real-time analysis it’s it’s not real time but it’s near real time and you can mix and match environments like NTT DoCoMo they even run your keep running your own from his environments and move your analytics into the cloud to unleash value so overall if there’s one thought that I want you to take from this session is these continuously strive so that you spend less and less time on your data bases and more time on your data I think we have achieved that quite a bit over the last two years and we have you know more ways to go and you know happy for happy that you came here and listen to this and I’m gonna turn it on to now and have them talk about how redshift fits into their analytical platform thanks so much come on so my name is Nam I’m an engineering manager at RetailMeNot lesser known my sister’s used to call me pepperoni nipples when I was growing up to be honest I feel like that nickname applies to a lot more people in this room than they realize so who is RetailMeNot you tell me not is a digital offers destination that connects consumers to savings we have 50,000 stores we typically have half a million coupons on at all time and we connect every year to half a billion users this is our overall data flow it kind of starts at the top with visits and you know people come to your site click around that’s kind of the presentation platforms activity section and we track everything from impressions to your favorites to your likes or dislikes you know when you come to our site and find something that you like you’ll typically go to like a Kohl’s or Macy’s sports stores for me my personal like triathlon equipment but whenever I click and make a purchase goes to financials all that gets tracked typically we also record a lot of third-party data so Google web toolkit for SEO ExactTarget for email tracking we do a lot of demographics tracking as well as IP location tracking and this data processing box here is probably the majority of we’re going to be talking about this produces a lot of interesting analytics for our content operations marketing and statistics that we then feed back into our presentation activities content operations obviously they’re the ones in charge of making sure the coupons and presentation of information on our website meets our customers needs gives them the best experience marketing operations of course is in charge of making sure that money spent on marketing goes to the right channels and all that is based on information that we do they decide how much money we spend on SEO marketing email campaigns TV ads stuff like that which in turns lead to more visits kind of completing the cycle of coupon activity at RetailMeNot so our data is typically in the hundreds of terabytes now you know I would say that’s kind of the new norm of information back when I first started processing with data I think I had you know systems that were in the tens of gigs and hundreds of gigs and I thought that was big data but typically these days if you’re not in the terabytes you probably could just look under your couch cushion and find a terabyte of data somewhere what’s more interesting for us is kind of our data growth we started out in with our data warehousing strategy or project back in 2012 and I think we had 500 gigs of data starting then every year we had growth over a hundred percent and that created a lot of problems for us using our at the time legacy warehousing strategy which I think was basically taking you know physical Hardware strategies you’re using legacy data warehousing systems and kind of just implementing them in the cloud the like C so I think a lot of people in this room hopefully you guys can relate to different sources of data for warehousing

we have a lot of operational data stores these are our my sequel databases that house user information and coupon information things that get presented to the site we have a lot of log data this is web logs beacon event logs this is probably the larger collection of data that we have and we also have the third-party data that we collected as mentioned before in the legacy world we basically took this data and shoved it straight into our data warehouse and it was a single monolithic system single point of failure and for us this happened to be Vertica you know I think a single monolithic data warehouse is kind of moving towards the wayside I think a lot of people are realizing that these systems are very very hard to maintain and redshift is definitely given us a strategy to kind of escape this paradigm from the data warehouse you know we obviously produce a lot of deliverables reports to various people we did a be testing for our websites and our vertical system and we also produce statistics for content presentation and this produced a lot of pain points for us first one is firefights I don’t know is could I get a raise of hands for anyone that’s ever been on call for support for database systems yeah firefighters kind of suck especially if you’ve ever dealt with a multi parallel cluster database system you know you have the opportunity for disk drives on any of the notes to fail if a node fails you could call in the middle of the night it’s very time consuming to fix especially when you deal with some of these very specific database systems where you have a lot of steps in order to go through recovery and readers data and kind of recovery of data that’s always a pain point the next problem that we had in the central monolithic system were traffic jams I don’t know how many people have actually queried database systems but if you’re in a big shared clustered environment you may start to run a query query just sits there kind of waiting nothing’s going on it’s not your fault select top one it’s just sitting waiting around and you know in some of these cases there are the analysts that you can blame Joe who has the thousand line sequel statement with a whole bunch of cases aggregating on you know the join of a case statement with the case statement on a joint on another table and it’s just ridiculous but a lot of times this can also happen just with too many people on the system and you know in the past we didn’t really have a strategy for dealing with traffic jams other than killing people’s queries or telling people they couldn’t run their queries on the system this is obviously problematic bad PR for your data team you know you want to try to avoid that processing windows when we first started back in 2012 you know it’s really easy to process 500 gigs of data in an hour start our processing typically around 2:00 a.m. you know we finished by 3:00 turned into 5 turned into 7 turned into noon and then someone was like hey come on you guys you got to do better than this it’s starting become problem our consumers neither data first thing in the morning and you have how to get it done somehow in some way the other thing was scaling so in these systems adding more nodes replacing instance types these kinds of things very manual it’d be great if there was some kind of manage database service where you just do a couple button clicks and kind of solve this problem you know we had to do typically do it through our stage environments test environments dev environments and production environments there’s just a lot of extra work for not a lot of productivity gains so how did we kind of solve this well I would classify it as adopting more cloud strategies for data warehousing and overall we didn’t really have to change too much so we still have the same source to bees same log data all of that stuff didn’t really change one of the big changes for us was that we started loading everything into s3 if you’re going to use redshift the fastest way to load data into redshift is from s3 we don’t really have a lot of DynamoDB that’s another option but s3 gives you a lot of flexibility and I’m sure there’s lots of talks about how awesome s3 is they’re true next is Amazon redshift instances notice how here there’s not a single redshift cluster I think for us redshift does solve a lot of the administration problems by itself but if we would have had a single redshift cluster I don’t think it would have solved a lot of the concurrency issues or resource contention issues that we had I’ll dive into the details of why we

have so many clusters in a moment we still have the same number of deliverables we essentially replace the sources of data with data coming out of redshift still did reporting still did AV testing content presentation but we were able after doing this migration start to do other types of things because we had some additional time on our hands after we were able to perform this migration to redshift so the on-demand breakdown I would say if there was anything that you guys were to take out of this meeting aside from the pepperoni nipples comment it would be this slide so from s3 the way that we use redshift the first class of clusters that we have are are only as needed you know we have a lot of analysts they don’t like to learn different languages and things like that they’re very I guess fixed on their sequel and you know with Vertica and sequel 99 compliant sequel they were able to use redshift and spin up on demand clusters for them to do kind of ad hoc analysis these clusters typically live you know only for a day and they shut down and you never need them again so this was a very flexible way for us to expose information to analysts for stuff that they had never seen before they would one crazy queries that you typically don’t want them running on your production system without any impact to any other person in the organization the next class of clusters we have our ephemeral processing clusters these are pretty medium sized clusters they generally are only on one to three hours a day and they kick off early in the morning 2:00 a.m. ish this was actually probably the biggest shift for us we basically split out a lot of the systems and data processing and art editions from our central cluster and said how many of these actually depend on each other and for things that don’t depend on each other we’re able to scale them out horizontally using these ephemeral processing clusters one of the nice things about redshift is that it maintains state when it shuts down so every time you bring it up it has all the data exactly as it was from the prior state or from the priority of processing and you can just pick up where you left off so this was actually a huge benefit for us the next class is business hours most of your analysts I hope do not work 24/7 and you know when we started implementing some of these business our clusters we’ve got some resistance for people that were saying hey I need I need access to my data on Saturday at 10:00 p.m. and I’m like well you know we can give you a way to do it but we’d like to shut it down over the weekends to save money and we’re like you really don’t have to work on Saturday at 10:00 p.m. most of the time so they were begrudgingly accepted and they seemed to be a lot happier on Mondays when they come in I don’t know this is my interaction with some of these guys the next class of clusters are the always up clusters these clusters typically or what our tableau is connect to you know external customers if they have reports that they need to hit they kind of connect to these we’re also an international company so we have internal customer in the UK France they will hit these always up clusters and the workflow here is that the ephemeral processing will turn on data produce aggregates spit it out to s3 and the always up clusters will then consume data and when it becomes available there usually only one to two terabytes of clusters sized clusters because they don’t need all of that raw data and usually if you’re turning on all that raw data for your reports the reports are really slow we like to have slightly faster reports so what does this mean for the data team overall shifting the redshift as Pavan mention redshift is data as a service our administration and firefights have become a lot easier I typically look like this notice that guys not wearing any pants when I get called the middle of the night I’m also not wearing pants but the time that I spend is more just logging in checking to make sure the status I’m not actively doing you know migration or bringing up of instances and a tree adding stuff to nodes like all of that is handled by AWS most recently I think we had a cluster go down probably last month for half an hour in the middle of the night I think we just woke up in the middle and the next day and we’re like did you guys actually do anything no well okay it’s up so it was a lot easier for us after we switched over to redshift scaling the number of clusters is a lot easier I mean it’s if you guys have ever worked with RDS scaling up a scaling number of redshift clusters is basically the same as scaling the number of artist instances and once again this provides us the ability to scale our data processing horizontally

oops greatly expanding the number of jobs that we can do in a single night scaling the size of the clusters is just as easy as doing it with RDS as well you basically click select the instance type select the number redshift goes into a read-only mode state probably for us it typically takes about five ish hours for 32 by 200 terabyte closer sizes and it’s more based on the amount of disk space you use so if you’re using a hundred percent of disk space of a cluster it’ll probably take a little bit longer but when you’re right around the sweet spot at 50 percent disk space uses you’re probably take about five hours the cluster is in a read only once it’s done resizing it automatically switches to the new cluster instance sizes and instance types does all the rebalancing behind-the-scenes for you and you really didn’t have to do very much work or spend time to babysit processing windows you know now we seem to just have so much more time and our analysts have actually started kind of spinning up their own redshift clusters to do their own processing my favorite analysts basically giggled when I told him about this and he’s been a huge adopter and they each get their own little data ecosystems where they can process data completely untouched by other teams and other systems and then redistributed throughout the organization once they unload it back to s3 lessons learned these are my Homer Simpson’s moments that we kind of experienced during our migration I wasn’t allowed to put a Homer Simpson picture on here automated cluster shut down a lot of these can be very very large clusters I think one of ours is I think $350,000 a year which for some of you guys may not be a lot of money for some of you guys it is a lot of money if you leave these clusters up I think we had one cluster that was up for a month and nobody was really using it and we just kind of accidentally left it on I would highly recommend having an automated shutdown process like every night having a cluster shutdown especially for data warehousing when nobody’s using it and starting one up again is a lot easier to do than trying to get a rebate back from Amazon for a cluster that’s been up for an extra month I don’t think they offer that future sorting distribution Keys Pavan mentioned this before it’s hugely important I think to get the sort and distribution Keys correct especially for joinings across multiple datasets if you don’t it starts to do chatter and cross communication between all of the clustered nodes and our queries typically and the worst case took like three to five hours after fixing the sort pains and things like that it went down to 15 minutes I think query optimization and performance optimization optimization on the databases is very very well worthwhile reserved instances you know I think that with RDS and ec2 when other types of instances there’s a lot of options and sometimes because of the size of those instances the benefit for reserved instances aren’t as pronounced redshift only has two instance types or for instance types so I think it’s a lot more compelling to be able and a lot easier to select the instance type that you’re going to be working with with redshift much earlier and since you’re probably dealing with multi terabyte systems the cost is good the cost benefits going to be a lot greater so I think for us choosing reserved instances early on in the process would have been very well worth for us and the last one is automated versus manual backups so with redshift they definitely have the automated backups similar to RDS you can pick the number of days for the retention period but the tricky part is that when the cluster goes down so do the automated backups so automated backups are really good for forking your data or kind of spinning off another or recovering from a prior state while your system is up but if your system goes down and your automated backups disappear you’re kind of Sol so definitely always do manual backups on a regular basis otherwise you kind of might be wishing you had someplace to restore from benefits to the business so obviously the data team loves redshift I think unless the business it benefits the business in some way shape or form you know it’s really just the data toy for us instance type alone you know we had Vertica and our monolithic system on ec2 x’ with multiple terabytes of EBS volumes attached to them switching over a redshift will netted roughly a 50%

cost reduction and ec2 instance types alone the other benefit of redshift is the licensing cost so on these kind of legacy data bit data warehousing systems you’re charged either by scores on your instance types or by storage with redshift that’s already baked into the cost and it’s a very compelling reason for moving over to redshift 50% reduction time on administration I would say this is more database administration you know you still have to do kind of the query optimization but database-as-a-service I’ve loved it I don’t know if I could ever go back to being a DBA and then internal customers so I think the number one measure of success for a project in your company is how adopted it becomes typically we had the BA system in the financial teams our primary customers for the data warehouse team at root Oh me not we basically doubled those when I made these slides it’s gone up even more since then people are just once they see how easy it is to get access to information using this system without impacting other teams or production systems they jump on it right away I mean information is quickly becoming the lifeblood of a lot of companies and quick and easy access to your information is hugely beneficial for the organization with that I think I’d like to open it up to Q&A

You Want To Have Your Favorite Car?

We have a big list of modern & classic cars in both used and new categories.