– All right, so with that, I want to pass it to Simon and Denny So take it away, both of you – Perfect Simon, do you want to give it a start, or do you want me to give it a start? (laughing) – By all means. I did it last time. You can start Why not? – Perfect, okay, well, in that case, the first thing I’m going to do is I’m going to plug the “Data Brew.” (laughing) but okay, in other words, we also talk about coffee eventually in this session But the reason why I’m talking about it is because it’s actually very akin to exactly what we’re talking about today, which is from data warehousing to data lakes And in this case, it’s with Barry Devlin He’s from the UK, from your side of the pond, Simon, and he actually, for many of you who may not know, he was actually the one who I believe is described as the illegitimate grandfather of data warehousing, because he had actually coined the term data warehousing even before Bill Inmon had did it Okay, so for those of you who are familiar with Kimball or Inmon, there you go So we had him, we had Susan O’Connell, and Donald Farmer For those of you that are in the Microsoft SQL Server space, then you are pretty familiar with those names as well So there you go But nevertheless, the reason I start with that, wasn’t just to go ahead and chime in about the Data Brew Vidcast series It’s also to go ahead and call out the fact that that’s actually where both Simon and I are coming from We came from this context of SQL Server, relational database land, and now we’re, you know, flow through to SQL BI, Big Data, Spark, and now into the realm of Delta Lake, Data Lakes, Lake houses, right? And it’s been a fun, fun journey, to put it lightly So, just in case you don’t know who I am, my name is Denny Lee I’m a developer advocate Been working within this space for a long time And yeah, we’re prepared for your questions, technical questions, okay? So please chime in in any of the forums, whether it’s LinkedIn, Zoom, or through YouTube Simon, why don’t you go ahead and do a quick introduction for yourself for those that may not know who you are – Sure, thank you So, yeah, hello, I’m Simon I’m from the UK, doing various things around data engineering and all that stuff As Danny said, I came originally from heavily Microsoft BI land, and these days I spend most of my time building lakes and Spark and teaching people how to do framework-driven engineering, and a lot at the moment, especially these past two weeks, teaching SQL people how to work in a Databricks environment, which has been interesting kind, purely no Python, no Scala, no nothing, SQL And yeah, just trying to figure out the best pattern, the best ways of working and all that kind of stuff So it’s interesting times And I run a consultancy called Advancing Analytics So obviously find us on YouTube and come and talk to us and stuff There we go – Boom – [Simon] That’s enough self-promo – Oh, no, no, no, no, no It’s perfectly fine for self-promo We’re perfectly fine doing that And by the way, just to provide some context, we will actually be talking a little bit about the Data + AI Summit, partially as a promotion to Data + AI Summit, but also because both Simon and myself have some pretty cool sessions And so we’re probably just going to provide a little bit of quick context as a way to encourage you to go attend our sessions, because guess what? We’ve got some cool geeky stuff So nevertheless, I did notice on LinkedIn that some folks were saying there was a bunch of echoes I’m sorry. We’re sorry about that We’re not entirely sure what’s going on on that front It looks like this might be a case of the restreaming service we’re using That’s echoing it out At least from other ends in terms of Zoom and YouTube, we’re not getting those issues So we’ll see what we can do to try to help with the restream But on a worst case scenario, if you can’t hear it properly, lower it, the volume perhaps, and then we can watch it on demand later on when we push it out, and then there should not be any of those echo issues So apologies in advance So let’s dive in because we do have some questions already So Simon, did you want to take tackle the first ones? Or did you want me to tackle the first one? – Oh, what should we go with as a first one? That’s the biggest question? Do we go straight into partitioning? (laughing) – No, no or do we go into ACC on concurrency algorithms? You know, I’m cool Like whatever we think is the right approach just to get to whet folks’ appetite (laughing) – To wet the whistle – Exactly wet the whistle, there we go – I do have to apologize There might be some explosions coming in the background Cause today is November the fifth, which is Fireworks Day in England When we celebrate blowing up someone who tried to blow up parliament, because that’s what we do So, you know, you might hear some things just ignore it It’s fine. No, one’s blowing up Its okay Yeah. Okay I mean, so the first one, so did we do that?
So we’ve got how to partition a Delta table that’s high volume and in each Delta windows, a high variety of data? I don’t think we covered that too much in previous ones We can dig into that a little – Sure. Let’s do that – I mean, it’s one of the biggest things that we get and kind of, you’re building a Lake, you’ve got your various structures, you’ve got your bronze, silver, gold rule-based and whatever you want to call it, your three tiers of I’m building a Lake and I’m segmenting data And almost every time I speak to people, they try and go I need to pick out a partition and I partition the same way all the way through And you just absolutely don’t have to write, for me, there’s kind of source-based ETL partitioning How are you trying to partition for speed to get data in, so you can get through your whole processing window fast And in that it tends to be pure like system load day, right? What data have I changed today? What data I’m loading today? Or what transaction date does this data pertain to? So you can do things like say, you know, replace that partition, you know, you can do mergers and adding the partition constraints So you’re just merging just to that window And it’s all about how do I optimize that one right transaction? How do I get data into the table in the fastest way possible? But then once you’ve landed the table in some kind of customer facing analytics style thing, your gold layer, your curated layer, whatever you want to call it Then it’s all about, who’s going to query that? How do you make the query runtime faster? So it’s always like partitioning early stage speed How do I just slot in the latest data and then leave and I’m done? Versus how do I make 80% of the queries faster? And that’s so much harder, right? Because if I’m doing ETL, I know roughly the chunks I know I’m reloading in an hour batch or a day batch or a week or whatever it happens to be Whereas if I’m doing, serving something to the users, I’m having to guess and saying, “what kind of search predicates are people going to put in “the worst statement? “How people are going to open Power BI or Tableau, “and filter and slice that data? “What kind of queries people are going to write?” And that’s really hard to know upfront, you know, so kind of the art is in getting it down, getting it out there, trying to look at how people work with it and then going, you know what, I might have to just lift that whole table up and repartition it in a different way and put it down, just because people aren’t hitting those partition keys and it’s tricky But it’s all how you loading data, and then at some point you switch over to how people are going to query it – Awesome stuff, dude Okay. So then I’ll take the next question in this case It looks like there was some context concerning when you look at Delta Lake, and you know how we have time travel history Well, the question was related to history and time travel is that, do you still recommend implementing a slowly changing dimension type to design? And so for those of you just to provide a little quick context and Simon definitely chime in, if you feel that I’m missing anything We, so did it commonly in any of the OLAP design, not necessarily just about Analysis Services cubes, I mean, OLAP queries in general, right? The context was that you had a dimension table and you had your fact table, right? Don’t worry. I’m not going to go into factless fact table or anything like that Just pure dimension and pure fact tables, and things change over time So for sake of argument, demographics information, like your address, like I used to live in St. John’s,Newfoundland and then over time I moved to Montreal and then over time, I moved to Seattle. Right So the context is that, that’s how slowly you changed dimension, like, you know, and so we typically assign a circuit key, to go ahead and do that, because the surrogate key would be even if the primary key is me, Denny Lee, whatever ID or name associated with it Different facts at different time ranges would be associated with different dimension values based on the address that I was at So in other words, where I was 25 years ago, any facts associated with that, there you go Like it’s associated with St. John’s, anything that was done 20 years ago, it was Montreal And then the rest in the Seattle area The reason I bring that up, is that when people are trying to, well, yeah, I don’t need to do that, now with data lakes anymore. Do I? And in fact, it’s definitely one of those Yes, you definitely can do it and yes, you should do it But also it depends with question too So in other words, it’s not like you can just go ahead and put everything in the facts, you absolutely can’t So, in other words, for sake of argument, I go ahead and put all the demographic information into the facts and it’s all there and that’s fine, except exactly the same problem as we had with data warehousing traditionally or OLAP cubes traditionally, your dimensions don’t change that often. Right? At least I certainly hope not I hope you’re not moving that often, right That will resort in massively large dimension tables, there are definitely those situations, by the way And that’s where the debate will come in
But for slowly changing, not fast changing, slowly changing dimensions, that information changes so slowly that you’re going to use a lot less space by simply placing it into a dimension table and having that surrogate key as opposed to placing everything into facts So then their turn statements that, yeah, that’s even true for data lakes And especially if you have a really wide demographics table, like if you include all the content or facets about a particular person as part of your machine learning algorithms, right That you actually want to understand the various segmentation or demographics that they belong to So it’s still slow. It’s still not fast yet, right It’s still slower than changing. Right Even if you include in all the other facets, right? Like, Oh, you have kids, right You’re not having kids fast enough, to warrant that as a fast changing dimension. Right So, because you’re doing stuff, its because it’s still a slow change dimension That table, that dimension table is still going to grow significantly slower than your fact And then if you were to put all that information into the fact table, quote-unquote, it would just massively grow for no good reason right At all. Right So yes, you absolutely would want to go ahead and do slowly change the dimension design And then there’s about the technique So I’m not going to get too deep into that And the reason that I’m not is because I’m just simply going to send you all a link I’m going to paste it into LinkedIn in a second and also into Zoom and to YouTube, the reason why, is because in fact, Douglas Moore and myself, Douglas Moore for also comes from a very relational database background That’s exactly the topic And we actually spent an hour just on that topic about SCD type 2 By the way, there’s also a topic just on surrogate key generation as well, just because exactly to that point, those techniques are just as valid today as they were in the past So like I said, if you were to ask me that question about fascinating dimensions, Simon and I would go on for about five hours to explain where we would deviate and why But the reality is because when it comes to fascinating dimensions, it does work the idea of us going in and actually putting them into facts But then that goes into factless fact tables and fun stuff like that. Right But when it comes to slowly changing dimensions, at least there, you absolutely are warranted for doing such a thing So hopefully that answers that question I’m going to go ahead and paste the link now just because I happen to have it handy It’s because it’s on the same tech talk series that we’re currently doing right now So it becomes really handy But yeah, I think hopefully that answers that question concerning, should I do SCD type 2 in data lakes? And the answer unequivocally is yes. You should – Type two, awful your choice – True point taken I just didn’t really want to go the type four route I was trying to avoid talking about that Okay. If you don’t mind. All right That’s a fair statement. Yes – That is fair (laughing) – Cool. Watch you tackle the next question while I go ahead and send it over the link – Okay. I’m gonna minor little grant There’s one of the questions Very specific. It’s like a super niche question – Okay do it – Save as table function Does that use a separate catalog to the other home catalogs? So there’s like real drilling down a thing Basically, if you’ve got a data frame, you’re writing something in spark and you go, I’ve kinda got what I need, dot save as table and one that to register and homesick and then go and just start from table at a later date. Right Super, super useful That for me is the most misleading, evil function known to man, because yeah, its great, No, it doesn’t use a separate high cap Look, it does veggie to it with your high catalog So it’s offended registering But what that does is that makes a separate copy of your data in DBFS So you’re I’m just saying I want to take something from my Lake, my nice curated, secure managed Lake And I want to make a copy of it in the local data Databricks storage Which is then not accessible to anywhere else and neat data request system For me, that’s generally a bad thing Cause like whenever we’re doing things, we nearly always use external references in hive You know, so if you’re doing, it’s going to get your data frame, write it down to the Lake and in Delta format, with potish name, with whatever structures that you actually want to have to keep it as a managed, curated good data set And then as a separate thing afterwards, go create a replace table using Delta location And then you point it to the light And that means it’s an unmanaged spark type So it’s basically, there’s just a little bit of metadata in hive that points it where that data lives in the Lake And then if you go a drop table and you delete that hive reference, your data’s fine, your data is safe, your data is still in the Lake Whereas if using that saver’s table and it
kind of intrinsically ties together your data, making another copy of your data with that hive reference And if you go drop table, it drops your data And if you wanted to use like, especially in Azure, you want to use like data factory or Power BI or various other things to get hold of that data You can’t see outside of the Databricks ecosystem So your cost has to be spun up to be able to query it and all sorts of madness So say this table when you first see it, it’s like, ah, that is super useful. Yeah I just wanna save it to a table, but we nearly always recommend not using it just because you have intrinsically to tightly coupling the hive registration with your actual data And it nearly always leads to problems down the line So we never use it I don’t know if you agree with that, Denny.(laughing) – No, actually I completely understand the context because in all seriousness, this is a much more complicated answer than most people realize exactly to your point, right? Because when it comes to using your catalogs, you actually need to understand your overall ecosystem It’s not just about the Delta Lake. don’t get me wrong Delta Lake is super important, obviously It is at the risk of sounding like Teradata, which is horrible Yes I said that It’s your central source of truth. Right But the thing is that even having a central source of truth, the value is what you get out of it. Right And so some folks are going to be able to go ahead and have more homogenous design So I’m actually trying to answer very vendor agnostic So I’m just going to say homogenous versus heterogeneous Okay. That’s it That’s all I’m going to say. Okay And because if you’ve got a homogenous system, you could theoretically get away with just simply saying, okay, I’m going to use one metal store that’s tied super closely to my system and be done for the day If you’re in a heterogeneous system that is almost consistently never going to happen. Okay And then just to add a new wrinkle to it, if you’re multicloud, multi-environment, like in the words on-prem and incloud and multi-cloud, which a lot of enterprises are actually either doing or starting to do right now That in itself has its own set of problems And so exactly to Simon’s point then how you decide what you do with your catalog and how you connect to it Whether it’s save as table, even though it just starts with a save as table statement, that’s the implication The implication is how you’re going to interact with cuddle And that’s actually why there are vendors that build their own catalog services Some of them are tightly integrated to their environment Some of them are completely meant to be vendor agnostic Some of them, you have to maintain yourself, which blows, but it’s actually important in some cases So thats why, even though it’s a small little statement save as table, it actually implies so many different things that, literally it’s obvious that you and I could probably go on for hours just on this statement to So to try to cut it short so we can actually answer other questions. Yeah The reality is exactly to Simon’s point It really depends on your environment before you decide something as simple as save as table or not Cool Do you want to dive into the next question or you want me to dive into it? – You can pick a question. Why not? – Perfect. I’m going to pick one run from LinkedIn live There’s a question here from Thomas I hopefully I said your name correctly Can you use Delta Lake open source in AWS without Databricks runtime? So that’s an easy answer That’s why I decided to pick that one Absolutely. Go to delta.io, go to github.com/delta-io You can absolutely download the code base yourself You can run it on AWS without any problem We have actually, there are many customers or many in the community I should put it that way Many enterprises that actually do that There are some issues when it comes to exactly which version of Spark you’re using So that’s the key thing So you have to make sure the version for example, if you’re either doing it yourself or through EMR or whatever else, right Whatever flavor you’re doing, that you’re actually going ahead and making sure you’re using, for example, like Spark 2.4.1 at least that minimum level, and especially if you’re using like Delta Lake 0.7.0 or onwards, then in fact you need to be using Spark 3.0 right now Because Spark 3.0 was the one we tied Delta Lake 0.7.0, is tied specifically to Spark 3.0 So you have to make sure you get the versioning numbers right But outside of that, you absolutely 100% can go ahead and use it in open source And it is an open source project You can go to GitHub, take a look at the code base, give us PRs, give us issues because that’s actually what we want to have So yeah, absolutely Cool Alright. Next questions Simon you want to tackle this next one?
That’s coming up the wire – Yeah. What are we doing? Well, is that a LinkedIn one or is it, we’ve got one on Zoom actually – Yeah sorry. I meant to say the one on Zoom. Sorry (laughing) You want to start with that one or I can fit – So many different places Yeah sorry for that – Okay. So if use Delta Lake or spark with Tableau, do I need a caching layer for interactive analysis to get enough performance? You know what I’m going to say? What’s enough? Give us enough performance (laughing) How much is enough performance? And it is a delicate balance It’s the consultants answer of it depends Generally you can do it a lot You can get a hell of a lot of performance over large data sets, querying the spark engine directly Can you do millisecond style? I want to click on things and have things immediately act? No. It’s a parallel system There’s always an overhead of parallelism So it’s always about what is enough? So people are aware, actually, I’m creating a few billion rows here, if I put a slicer in and it takes three, four, five seconds to come back, you know what that’s cool Cause I’ve never able to get at this data before, but now I can. And that’s amazing. Great If people are looking at it and going, what’s it doing? What? No, that’s too slow, then yeah, you need to cache anyway So it’s depends on the style of data, the style of interactivity and the expectations of the users in terms of what that turnaround speed needs to be In terms of what I would recommend for an intermediate caching layer it’s ecosystem dependent You know, so I’m tend to be in Azure People tend to use Power BI for that And you got all sorts of different modeling techniques, good news Most of the caching layers tend to have something these days, which is all this kind of, there’s a new modern approach, which is actually exactly the same as the old whole lap approach, which essentially saying, I can keep some stuff in memory and some stuff on your spot down here, you can get that balance That’s what I look for in a caching tool, right? So then where I can go If someone tries to run a query on this grid, you know, they try and summarize things by product That’s cool. That’ll get it from the cache And if they in that same dashboard can drill down, and say actually give me the transaction level Then that is intelligent enough to go And that one I’m going to throw to Spark because that’s where the big data engine is, that’s it As long as the caching end, you can do that So all of your Tableau’s, your driving energy Power BI, is you can have various different analytics, semantic layers They all have things like that baked in So as long as that makes sense And you’re building models in a small way So you’re saying I’m only going to put essentially cached data is expensive To put something in an in-memory cache tends to cost money You know, you’re paying a premium for that stuff that has that interactive fast reaction So if you can split out and say, 80% of our queries are at this level we put that in cache and then everything else, the occasional user who asks the whole question and wants the million row answer, then they have to wait a few seconds, until it goes back to this back engine, then that’s the kind of design you’re looking for So whatever tool can do that in your ecosystem makes sense – Perfect. I definitely would love to answer, it in a two-parter by the way So the first part is as a small plug, you know, have to plug it a little bit There’s the Photon Engine from Databricks that actually, which we announced during the Microsoft Ignite Conference, where it significantly improves the performance of your speed on your data lakes. Okay So it’s designed specifically for that It doesn’t preclude what Simon is saying though I want to be very clear about that It basically just means the part that what Simon was saying about, I want to be able to look at my billions of rows and I want that to come back, not in five hours Okay. Yes That’s what the Photon Engine is designed extremely well for that purpose Versus they’re going to be absolutely some scenarios, especially where you look just like you said, 80% of your first where frankly, you don’t need to have all the data, you need to make it and you need to make the queries more efficient, for that purpose, for those canned reporting, for that help reporting for the line of business, things of that nature. Absolutely Then it makes a ton of sense for you to go ahead and build whatever caching layer And again, that’s very ecosystem dependent The one thing I would like to call out is actually, it’s actually very reminiscent and sorry, Simon, for me pulling up this old one again Of the Yahoo 24 Terabyte cube, right? Where basically we built this cube where we could get the queries down to six seconds This is about 10 years ago now So in which the source of the data was actually multi-petabytes in thousands of nodes to do cluster And then we ended up processing the largest cube on the planet, 24 terabytes Why did we do that? Its precisely the same reasons actually that we’re talking right now There was this idea that for a segment of the queries,
yeah, we were just going to use the Analysis Services cube to go ahead and deliver that result because it could deliver it in seconds as opposed to hours or never delivering the answer at all Versus, now I really need to go to the details Yeah. This’ll take hours, but at least you can get it done as opposed to never getting it done Which was then using Hadoop, that’s why we would do that Now obviously our technology has changed over time, but the reason I’m bringing up the scenario, is not to recommend using Analysis Services and Hadoop No, that’s definitely not the reason The reason I’m saying this is because the context is that even though the technology has changed, our reasoning has it There are scenarios in which you’re going to need super fast and it’s a subset of the data and it makes sense for it to be a subset You don’t want it to be all of the data And there’s the scenarios where you absolutely need to go ahead and look at all of the data And those two rarely intersect, because it rarely makes sense to say, give me in like less than a second, all of the data Just because, okay, this is great You got an answer except it’s not very useful to you So that’s more or less the point So relative how the technologies have changed in how we sped things up and all that stuff There are definitely going to be times where a caching layer is appropriate And there are definitely times where the caching layer, especially if it’s fast changing data, it’s not going to be appropriate and that’s okay So just understand how to categorize which data is which, so that way you can go ahead and organize it accordingly – And a lot of time comes back to what was i saying last time there, the whole real time, like, do you need to be real time? It’s a lot of, it’s the same kind of, because you’ve got that cost on the cache And as you’re saying, sometimes the cache isn’t that useful And it’s like, do you need cache or leave it? You need to get that much in? And it’s yeah, it’s all my wire I can, – No – I can spend up like building the thing and engineering it, rest of the cost of actually running it – Yeah it is exactly just point, like often people what do think about that, it’s like, well, do you have unlimited money and unlimited resources? Sure Maybe it’s possible then But the things that last time I checked you don’t have a lot of money in the loan resources, number one And number two, again, by not thinking about it allows you to you’re being a little harsher, but it’s almost like being lazy. Right You actually do need to think about what subset of data is actually meant for that super fast cream, because it’s directly line of business, versus what side of queries that are designed specifically for the idea that you are trying to develop your new lines of business. Right And so it’s not like one’s more important than the other They’re both equally important, but they’re designed to do two very different things And there, there has to be that separation when you do that Cool I think we probably nailed this one, so I think we’re good (laughing) Okay I know you’re an Azure guy more than an AWS guy So I could probably possibly go with this question that I just saw on YouTube live, one of the top five biggest differentiators between Azure Databricks versus Databricks on AWS. But yes And I can definitely start if you’d like, and then you can do color-coded. Sure Okay So honestly there isn’t much difference between the two So it’s actually very ecosystem driven So like what’s a big difference, how you set up your VPC/VNets security IAM roles versus service principles, The Databricks ecosystem, whether it’s on AWS or an Azure are pretty much the same thing for all intents and purposes There isn’t really that much of a difference here So what it really is about is more or less, which cloud environment, which cloud ecosystem you really want to use And for those environments. Yeah Like they’re going to be things like, okay, well now I have to care about, IAM security roles Well, that’s obviously always specific or AED security principles or how you set up your VNets or express routes or whatever else. Right But you’ll notice that my answers are very infrastructure related, not environment related You can use for sake of argument SageMaker and do this, or you can use Azure ML on Azure Okay. Well, but those are your differences, right? The differences are more of the, which cloud environment are you more comfortable with And in reality, it’s getting more and closer to where it’s just not which one you’re in, which one you’re comfortable with You’re going to have to be comfortable with more than one
So it’s more about like, okay, well then how do you simplify using the various components on those cloud environments, more than anything else? So at least that’s, that’s my answer real quick I mean, I don’t know if there’s anything you’d like to add to that Simon – I mean, honestly, I don’t have too much clarity Just cause I very rarely work in the AWS ecosystem There are one or two things that occasionally from an infrastructure point of view, you’ll see pop up in one and then balance out to the other a few months later I think the only example I’ve got is spot begins I think you can do that currently in AWS, it’s coming in Azure way later this year they announced at the Ignite, you know, some of that kind of ability to say I’ve got a job, honestly, I don’t mind if it just disappears halfway through when it comes back because it’s not priority, but it’s nice and cheap, that kind of thing But that’s an infrastructure related, right That’s because that kind of functionality was available in and so it got implemented It’s coming in Azure, so it’ll get it from there And it’s like how it can take advantage of that underlying layer, but the engine itself – Yeah, exactly There’s really no differences between the cloud environments. Right It’s pretty much what we’ve released based on what infrastructure environments and the cloud environments they are able to provide And that’s pretty much it – I think there’s some black, deeper techie stuff with like 80 last Gen two, it works slightly differently than it works with S3 buckets, which is slightly different than Blop just in terms of – Absolutely. If we, exactly, if we want to go do that, for example, we can talk about how the fact that 80 last Gen 2 actually has different right consistency patterns versus S3 But the thing is that we really are diving deep now (laughing) So.Yeah – But that, that actually that I kicked my answer a few weeks ago – Oh true, okay. What happened? – Because I was using auto loader – Okay – So auto loader, all sorts of cool Essentially in the Azure side, it used to go to bed grid So it’s basically an event watcher, you know, so you have your block file, you land a new file in there It goes, got a new file and it kicks off autoload and it starts running some stuff, specifically using event grid kicks a thing, puts it into a queue Databricks is looking at that queue and goes, what’s happened since I last ran or it’s streaming And so it just picks all this stuff up as it goes, because I think we were using ATLs gen two for it, rather than landing the file, landed the file as a temp file and then did a rewrites And so, cause it cause can rename the thing And so it has landed the first file that kicked off the event grid job that had had a Blop create and then the renamed changed it to the right file format So the file didn’t get picked up because of the way they were writing it down, whereas doing the same thing in Blop, but it was happy cause it doesn’t do the rename (laughing) It was just trying to unpick. It can see the file and it was all because we were writing it down and it was renaming it after we saw it But by the time we looked at the file, it had already been renamed, man, that took me a long time to figure out what was going on. And it’s like put a trine I mean.(laughing) – Yeah. So exactly, what’s interesting and there’s a question that came up on LinkedIn, but I’m not sure how to answer it because it’s a little too vague But it actually goes exactly with what you’re talking about Simon When you’re dealing with situations that are much closer to real-time or at least near real-time right Streaming style and the volume is so high Right. Exactly To your point, then all of a sudden things like how the underlying disc works starts impacting you This gets us into, for example, the small file problem The massive amount of, if you have lots of little files, what ends up happening? That basically, it allows you to optimize the rights, especially for your streaming application, but it certainly doesn’t allow you to optimize your reads So then you actually have to do file compaction of some type in order to be able to read those. Right So, that’s why when you run streaming applications, there’s typically flooding this stuff into memory So that way you’re actually, you don’t actually have to hit the, disc IO, excuse me Or storage IO that’s comes into play every time you’re trying to read the stuff right off the disc. Right So fun stuff like that Those all come into play massively And so in the end the deeper you go into running stream applications And even though it’s just auto loader, it’s actually starting to go down that route The more likely you’re going to have to be cognizant or aware of the underlying storage system
that you’re playing with – But I think there was a question about, there’s two questions I need to hook into that One about optimizing and vacuum and how often you run it And we can come on to that in a sec And that was the last of, one of the benches There’s one about leaving a cluster on when you’re streaming. Is that kind of a requirement? I think that is in there somewhere – Okay And that’s the interesting pattern I’ve been doing recently is the trigger one streaming process – Oh yes. A fan of that one – It’s doing essentially most of the most people write batch queries, right Most people need to process data hourly, daily, whatever it is And we’re going to get you go down a rabbit hole of trying to work out. How do you programmatically say, just get the files I need just get a process up to that Maybe you watermark it Maybe you got some kind of CDC, maybe something’s dropping files and you’re collecting it, whatever it happens to be But we’ve been using auto loader recently, not in the streaming thing, but we’ve triggered once So there’s a little option You can just put it in saying, I’m going to trigger this, run a streaming job and use this full structured Sparks framing sets up rows through, does a micro batch and it goes, I’m finished And so we’re using that at the moment to say, get anything that’s come in since the last round And so previously we had like all the frameworks saying, this is how you work out The files that came in, this is the file patent or match for files for this hour, pass a parameter and look for this file and all that kind of stuff Now it’s just run it and it’ll get anything gets hasn’t got yet And so you can run hourly and you can run it daily, you can run it once a month and it will still just go, I’ll just get anything that came in since I last did, but you’re still dealing with all those streaming problems So we’re basically doing, we’ve essentially taken a batch process and added streaming problems just to make life interesting But if you’re worried about having a using streaming, but not having the, the cluster 10 normal time, you can do that kind of patent just to go I still want to stream, I still don’t have things coming in land to get an event hub turn on every hour and just get everything out, go and then get that patent working And then if you then switch it and go, actually, we’re now getting enough value to leave it turned on all the time There’s no code change That’s the, just change the parameter you’re putting into the trigger on the same thing We’re just run as a full-time streaming job – Exactly. Yeah And I did want to add, like, for example, from summit, I want to say like two years ago now Comcast actually went on stage to talk about their environment And it’s exactly this trigger once that you’re talking about. Right They basically, because they had converted their batch jobs into streaming jobs What they literally did was go from, I want to say if I recall the numbers correctly, 84 batch jobs that they had to maintain down to just three streaming jobs in essence And then because they were able to get down to that, it was easier to maintain, but even better still, they went from something like 640 VMS down to 64 So they saved a massive amount of resources and time because even though ultimately they had an always on scenario, they actually were saving massive amount of money in the process for doing that. Right So it’s actually powerful, and then actually this is a very apropos We’ll probably answer this one other question before we go into Data and AI stuff, how do you match backfills in that case? Right? It just came through on YouTube and that’s actually extremely appropriate And so this is, by the way, why you’ll often hear us talk about the medallion architecture, you know, bronze, silver, gold The idea is that because you built your pipelines to, because you’re doing trigger once and you’re using that acute event grid or whatever it is that you want to do for your, in essence, your cue mechanism You don’t actually have to move the files. Okay So traditional system is more like, okay, I take the files, I copy them over to indicate that, I need to go reprocess instead Your backfill is more like, okay, if the streaming job shuts down, turn it back up It has a checkpoint, it goes back and grabs from the last time you actually need the files But then the reason I brought up the medallion architecture is like, how about if there’s actually errors in the silver and gold, as you’re trying to filter augment, as you’re trying to aggregate or build your features because you’ve got all the original data, it’s all sitting in the original location, change the checkpoint, change the start point, instead of having it from the most recent file, go back to the beginning, reprocess, rebuild your silver, rebuild your gold, using streaming Obviously you’re gonna need to knock up the number of nodes in order to catch a faster, right But that’s a temporary process Knock it up, process once you’re did roll it back down and you’re back up and running again. Right And that process now is more complicated to do in batch, than for streaming because from a standpoint of streaming, all you’re doing is like, you aim from a new starting point
That’s all you’re doing So it’s an important call – We’ve made it more complicated because you can go to the next stage, right? – Sure. Sure – So again, I don’t like the term medallion architecture, but yet in a sense, you know, you’re landing into your first layer rule or bronze and that’s just independent stream. Right That’s just a duplicate, I don’t care if its a duplicate, I don’t care if I’ve seen it before I don’t care if it’s an update, you just tracking this day – Exactly. Yeah – And that’s just it And we’re streaming Delta streaming So role is a Delta table So would insert appending into there We’re doing stream from a Delta table into a middle tier silver base, whatever you wanna call it, Delta table and did a video want to estimate, you can do merging on a stream, you know, so you can say micro-trach for each batch, do a Delta merge So actually you can just drop a new file And even if you’ve loaded that file previously it’ll work out the merging and it’ll just kind of slot those files in Now, not the most performance for doing a high volume live stream. Right But for doing the trigger once, if you just saying, just use this mechanism, I’d actually just turn on and just churn through anything that’s come in It’s quite a neat way of just getting the changes and just kind of just merging it into silver carrying, but not having to reset checkpoints, not having to do the load of a mannequin jet going backfill, checking your phone. It just reload the file into Blop And it’ll just go – Exactly. So last time I checked that wasn’t that complicated then What do you, what are you talking about, dude? (laughing) Exactly. Okay, cool All right. We only got a few minutes left, so let’s end with do a quick call-out on our respective sessions So you want to start or you want me to start on this one? – Yeah. I mean, I can go on that – Please go I’m doing a hu okay – Okay the name of your session for Data and AI So let’s start with that and then we’ll go from there (laughing) – I think it’s something along the lines of designing Lake House Models for SQL and Databricks, something along those lines Essentially. Fair enough – Sure – I can’t come out with a proper title I don’t know it off the top of my head. Is that bad? Should I never – It’s okay. By the way, this tells you everybody, who’s still watching us that haven’t been basically disgusted with our banter hair That yes, these questions are really quite live because sometimes even though we’ve talked about this, I still managed to surprise my co-host (laughing) – Achieving Lakehouse models for spark 3.0, – Damn – which is essentially some of this stuff, how you model things for the end user So not the engineering, not the getting data in, but actually what are the things that you can do because of Spark 3, because of adaptive query execution, dynamic partition processing, because of some of the Delta functions, how do you actually get the SQL people? A good plane to actually work on? – Perfect – Thank you – Bam, that was beautiful Definitely take a look at his session. It’s awesome Especially for our European friends, because it is designed for that time, but it will be available on demand if you don’t want to wake up three in the morning, and middle enough, that’s not my favorite thing to do, but there you go Okay. And then in my case, I’ll freely admit that I forgot the name of our title too Okay. This is my cell phone broccoli office (laughing) We either can it call under unpacking the transaction log V2 or under the set of ins V2 (laughing) I forgot which one we chose, but it is a tactical dive into how the transaction log of Delta Lake works This is not obviously we’re showing demos on Databricks, but this is not a Databricks specific This is a very much into how the open source Delta Lake works We dive into what the V2 checkpoint works That’s a follow-up to our unpacking, the transaction log tech talk that we did originally, but it falls up with the V2 design So it’s, it’s not just V2 of the video or the tech talk It’s actually V2 of the actual checkpoint design itself as well. So it’s actually a fun play on words So, there’s that session We also have three fun AMA sessions as Karen had called out in the beginning. I’ll be emceeing two of them One will be for VIP’s specifically on Lake houses. Okay If you want to think about it from the concept or the paradigm of it The other part absolutely is if you want to dive deeper into tactical
The other one that I’m seeing is absolutely on the Photon Engine itself and the Delta Lake itself and understanding the technical details So it’s myself with four other engineers So come prepared with your geeky questions because that’s the stuff that we’re more comfortable answering anyways And just, and for all those who are data scientists, we did not forget about you We actually have an ML flow session specifically for you to talk about where you get to ask questions from the ML flow engineers and like a Patriot Juul mg He’ll be emceeing that. So lots of fun And then of course, as Karen called out, right in the beginning, we will have a fun AMA session, With Malcolm Gladwell If you have not already read his book Outliers, do it Another good book for if you are a geek and you don’t like talking to people, talking to strangers, that’s a great book as well And the reason I’m promoting him is because he’s a fellow Canuck that’s right Keep go Canadians But nevertheless, also, if you don’t want to read, but rather listen, he’s got an amazing podcast series called The Revisionist History. It’s an excellent podcast series I highly recommend that it’s probably a lot more organized than myself and Simon I would hazard to guess Yes. What do you think Simon? I think that’s a fair statement – How dare you.(laughing) – Yes. So, okay I think that does it for our little call-outs, Karen why don’t you finish the show for us? (laughing) – Thanks Denny Thanks for adding on some added details on Summit, Denny is way more involved in Summit with sessions and whatnot so I’m glad he’s able to share a little bit more about what we have going on for you all We also have three meetups, two they’re on the Data + AI on my meetup group page So if you’re interested in joining us there as well I also see I have a, I just checked my email I have a few folks that emailed me for VIP passes, which I’m excited about So I do have a few left, so email me or message me and LinkedIn And, and I’ll send you one your way We’d love to have you there With that, thank you so much for your time, Simon and Denny We’ll let you get back to your session, planning for Summit and thanks everyone for joining us Take care, have a good day