Uncategorized

A very warm welcome to all of you for this Spark Tutorial from Edureka Before I start, can I get a quick confirmation from all of you if I am loud and clear, on your right-hand side you will find a chat option or pushing back, you can just type it either options there, very good, thank you thank you Saurabh So what you need to do is as you can find out like I have asked you a question and you have just posted a response here, so feel free to interrupt me anytime in middle, whenever you have any time and I will be answering okay So you can just interrupt me in the middle from there and we can take up your questions, lets start What all things you can expect from this webinar what is Apache Spark, why Apache Spark why we are learning this new technology. In today’s world you must be hearing a lot about this Apache Spark there Apache Spark is the next big thing in the world, why, why people are talking about that Apache Spark is the next big thing, what are the features in Apache Spark due to bit we are talking like that? That Apache Spark is the next big thing again, what are the use cases related to Apache Spark? How Apache Spark ecosystem looks like? We will also do some hands-on example during the session and in the end I will walk you through a project which will be related to Apache Spark, so, that is what you can expect from this session Moving further, now first we before we even talk about what is Apache Spark, it’s very important to understand Big Data because that is or what we are going to choose right Apache Spark will be used on Big Data? Now what does this key word Big Data that is the first thing which we are going to discuss Now if I ask you what is Big Data? What do you understand by Big Data? What would be your response, can I get some answers On your right hand side you will see a question panel you can just answer it from there, seriously please make this little interactive, it will really help you to understand this topic well, I will assure you by the end of this class you will all go with a good knowledge about what is Apache Spark, but you need to help me to make it interactive So you can tell me, what do you understand by Big Data keyword? Very good, true Throw is saying that it refers a huge light data that is generated every minute on the internet from various resources, very good answer So what are we saying that large amount of data generated on in the corporate network, okay They can be text, image, video, stream, very good Though sort of just please see your statement you are saying that large volume of data that you are calling it as a Big Data but is it really the case can I call just not volume of data as a Big Data? No, tract is just one of the property of the data if I need to still define what a Big Data I need to define in a broader term, I need to say not only the volume but from various resources what the data is getting generated for example, Facebook is generating lot of data or news being use, medical domain all these domains are generating Big Data Now if I talk about various kind of resources which is generating it then we are also talking about this print where REDDIT, right and in the end we will also be talking about vector speed with which this data is growing up because what about Facebook, please look at just 10 year old company it is not very old company they just 10 year or 12 year old company Now in 10 to 12 years at self, Facebook have grown that data exponentially, they are dealing with very huge amount of data Few months back I heard a tweet from Mark who is the CEO of Facebook, he mentioned that in his Facebook Timeline he mentioned it is of sponge back and mentioned that Facebook today have number of users equivalent to number of people living when this group 100 years ago That’s a big statement, no Sammy, we can also deal with unstructured data I’m coming to that point, so they are talking a big thing, right So now this is a challenge with Facebook you can imagine how much big amount of data is talking about Now with respect to number of users itself it is sounding such a use data, now what are the activities what you do on Facebook? You tweet right, maybe you can type a message you also upload your pictures, you upload your video, you upload your audio right, you do all that stuff Now are they kind of formatted data what we used to use another idea my sister, answer is no right, definitely they are not kind

of a very forum good formatted data they are the different category of data and that category is called Unstructured Data Now your are DBMS a system can they deal with that kind of data, answer is no Our DBMS can deal with all the structure can a data which have some sort of pattern Now when we talk about Hadoop we also talk about audio, video which we in other words we call it as Unstructured Data, okay So that is also a format a variety of data what we deal with with data, so we cannot just say that looking this the data is huge then we call it as a Big Data no, that is just one property because what if I have a unstructured data, even if it is small in nature but still if we have to you still use this Hadoop roots, Big Data tools to solve them, so in those cases also use the data tools because our DBMS is not efficient to solve all those kind of problem, so that is one thing Now whatever data you get they also can have some sort of problems, there can be a missing data there can be a corrupt data, how to deal with that data that is called Veracity that is also one property of Big Data So you can see Big Data is not just about no volume but it consists of multiple of the factors like velocity, variety, veracity All these are important component of Big Data like I just said Facebook in 12 years is able to grow the data so much, when we in terms of number of users itself it is sounding like a Big Data and after users are doing the activities on their platform, imagine how much data Facebook might be dealing with similarly not only Facebook even if we talk about Instagram, every minutes so much of post are getting like, like almost 70 likes, 36,000,111 I am talking about every minute, I’m not even talking about like in a day basis, YouTube every minute, three hours of video are getting uploaded but when you search anything in YouTube does it make you query slow, no How they are able to handle all that data so efficiently we can talk about Facebook every minute so many users are posting the things or liking something so much of event is occurring, we can talk about Twitter every minute three likes, 47,000,222 tweets are happening so so much of activity is happening per minute We are talking about per minute they can imagine what must be happening now So there’s a funfair richness a statistics in fact what’s tells that every two years, data is getting doubled You want to reach moon just burn all the data what do you have right now with you and you will be able to reach the moon twice that is the amount of data what we are dealing with at the current moment Moving further, now imagine what’s gonna happen in 2020 whenever I take the batteries I always tells West to let you all are sitting on a data bomb and at a dog bomb this is going to happen very soon because what is currently happening is that only 4 to 5% companies who works with data have realized the potential of the data Now the challenge with them is they are hesitant to move towards Big Data safe to use the Hadoop tools or not and the reason is they’re afraid that what will happen in case if tomorrow the shift to Big Data domain and will they will be getting a good support, will they get the number of users who will be able to solve the problem for them all these problems they are still thinking, they’re hesitant for the same reason to use the technologies like Big Data tools Now but they cannot stay for long like this because definitely there will be a stage where they will not be able to use a DBMS at all or any traditional system at all in that situation they need to make this transition So it is expected by 2020 this 5% of the company will grow to 40% and imagine right now itself when you go to this indeed.com or nok3.com you see so many jobs popping up for a purchase path, Big Data and all, imagine what is going to happen in 2020 There will be a huge demand and less supply of paper I definitely say this so in your company if you are working let’s say in a database company you must be senior managers they are maybe senior directors, maybe VP’s and you must be thinking sometimes of that these people

are really lucky they started their careers 20 years back then Oracle DB or a DBMS who are just coming up and today they became VP and I am still sitting at the software developer position that’s a very general thought which stands at some of your mind I am pretty sure about it Now you are exactly sitting at the same position, tomorrow generation, your future generation is going to exactly think in a similar manner, they will also be thinking in a same way that these guys were lucky man this Big Data domain was just coming up they actually shooted with apache Spark and they became today VP and I am still sitting at this position, so you will be occupying that very soon because this is the domain which is going to blast that is for sure And this is not me, my I’m not telling it this is all the predictions from the start agents and analysts and I’m talking not about the small and it you can just leave the blocks, you can easily get all those things, in fact lot of people have also come to this level and said that, people in next five years the companies who will not be transforming towards data or apache Spark they will not even be able to survive in the market This is also being said by the analyst Now imagine by 2020 how much of data we will be dealing it, you talk from any mall, shopping cart, vehicles and these sort of event which is generating data imagine the amount of data we are going to delay In fact this you might have heard about this term, IoT, Internet of devices That is itself requiring Big Data right because that is generating so much of data So, so many things will be happening around you, talking about Big Data Analyst and what exactly this Big Data Analytics is, what exactly we do there? Now this is a process where so first of all let us understand what is Analytics Analytics is the process where you have your given the data and you generate some insight from it Some meaningful insight from it, you want to get something some information from the data because currently when the data is sitting with you, you don’t have any idea about the data, what is there in the data and you don’t have any idea about it but then you are working with respect to that data with as an analyst then you want to generate some meaningful information off of the data that is called Analytics, but now the major challenges with Big Data because the data has grown up in volume in such a great extent, how we can analyze their data? Can we use their data to gain some business inside all those points we want to understand, then this dooming is called Big Data Analytics Now there are two sort of analytics which are generally done the first sort of analytics is called Batch Analytics, second sort of analytics is called Real Time analytics What is all that? Let’s understand it one by one What is exactly this Batch Analytics and Real Time Analytics? Now everybody must be using washing machine at your home or at ease have heard about washing machine Now what you can really do, when you collect your clothes and then wash it someday or you just generally as soon as you take out the clothes you first wash it and then go or a bath and then use it So you generally do this part right you collect usually the clothes and maybe someday you just put it in the washing machine and just process all your clothes when it’s a process a yoga means washing of all your clothes this kind of processing is called Batch purposing Where you collect some data and then process it later so this kind of processing we will be calling it Batch processing, so you can see on historical data when you do some sort of processing that is called as Batch Processing Real Time Processing, let’s see one example Let’s say you are doing or your credit card transitions and pretty sure that most of you must be using credit card or debit card online, now even if you do a payment to Edureka, you might be doing it online so definitely everybody must be using their cards, now let’s say if you are sitting right now in India, sitting in Bangalore City and doing a credit card transaction, now immediately after 10 minutes your part is also swiped in US is it possible, definitely no but do you think it makes sense for banks to kind of leg the transaction still happen and later they can just see that whether it is a genuine connection or not right, definitely they don’t want to wait otherwise if the foreign happens, it will be their loss, so they what they do, as soon as any Real Time event when they receive that a person is trying to swipe the card at some location which do not looks like a genuine connection, they will either start sending you an OTP or they will block that connection they will immediately give you a call, they will ask you

whether you have done this connection because this looks unusual to us, all those questions they will start asking and once you approve then only they will let that transaction happen, processing is happening on the historical data or the current data, current data So, which means what, we are doing this processing in the Real Time as and when the data is coming I am doing the processing must immediately as soon as I swipe the card and the Real Time, my system should get activated and start and running the algorithms and checking whether I should allow this transaction or not Now this second type of processing is called Real Time Processing So, just to explain you the difference between Batch Processing and Real Time Processing So Batch Processing or books on the historical data by at the same time the second kind of processing works on the immediate data that’s the difference between them While we are talking about all that, if we talk about Real Time Analysis, I just talked about few use cases just like my credit card in banking it’s very important for government agencies you’re applying for our dark arts or not, so if you are in India you might be doing it can you give one more instance for Real Time Processing this is in front of use Amina now if we talk about any Stock Market Analysis right, Stock Market Analysis If we talk about that, now immediately what happens, lot of companies are there, I am not sure whether you have heard about tolerance search Goldman Sachs have you heard about these companies, Morgan Stanley, Goldman Sachs, Event of research, if you have heard about these names what they do? They have developed a smart algorithm what that will go out and do, that you apply your money you give them your money off your stocks to them, what they will do that algorithm will kind of do a prediction and tell that okay, this stock price is going to be high this stock price is going to be low, they are not making their algorithm public because that, they are better but hurt should update leave they do that it will be their loss but what they do is they have a smart algorithm and that algorithm is happening at the Real Time means at any event if they see any unusual event because of which the market can go down or the stock trust firm of Plexus top is can come down, what will have to is, they will immediately send itself, so that the customer do not go in loss, if they find some event at the Real Time when any stock can make profit they will by default buy that stock, so this set of algorithms are running at the Real Time scale, what I know something So these all companies are using this Real Time processing part Similarly there can be multiple example telecom companies, health care, no for health care is very important a patient is coming, now as and when the patient came we immediately want to get some insights on whatever information is given and based on that do some processing means start treating the patient, so all those things are also happening at the Real Time Now when to use why apache Spark? When Hadoop is already there, why we were talking about this Batch processing and Real Time processing? Let us understand that part Point number one, which is very important in Hadoop you can only work on Batch time processing means your Hadoop is not meant for Real Time processing So now what happens let’s say you have collected a data on day one on day two, only you will be able to process it something of that sort I’m not just saying that you have to do it in a day itself even the data is let’s say one R word that’s a historical data only but you will not be immediately able to access that data that is what being done in Hadoop systems but when we talked about Apache Spark there is no time what you can do here this as and when the data is coming you can immediately process it, immediate processing can happen in the case or spot Now you can ask me another question, so Spark is only for Real Timed Data? No, Spark and deal with historical data means Batch kind of processing as well as it can do Real Time of per second So it can do both kind of processing that’s an advantage with the apache Spark, is it the only advantage? No, let’s understand two more things with respect to apache Spark Now when we talk about Hadoop, so has it just a Hadoop Spark like it happens Batch processing Now when we comes to Spark it happens with respect to your Real Time Processing Now so the same thing which I just explained you so what I do you can have handle the data from multiple sources you can process the data in Real Time, it’s very easy to use Now anybody who have already written MapReduce programming? No, if you have done that you might be knowing that, MapReduce it is becoming a stated,

it’s not that easy like Samir have done that Samir, you can easily convey that So it’s not really easy like for the beginners need to learn MapReduce is not an easy task it takes time it’s complicated in terms of writing the program With Spark things are very easy, even Spark have one best advantage, faster processing This Spark can be processed very fast in comparison to your MapReduce program that is one of the major advantage with Apache Spark Let us go and understand now in detail, once I explain you that part you will be all clear my MapReduce for slower, why Apache Spark is faster, why we are making all this statement what is Apache Spark, how it works, so let’s understand this part So I’m going to my white gonna just give me a moment now, if I let me share my screen, okay So let’s go step by step, let’s understand what MapReduce boss, what was the problem with MapReduce, remember I just said that MapReduce is slower, what is the reason? So I am just going to take you to little detail, so let’s understand this part, now let me take some example and let’s say that example is having a file, that file let’s say it is gonna have some data, let’s say apple, banana So I am assuming that all of you already have knowledge about what is Hadoop systems you know about how data of escaping process in Hadoop system, if you do not you need to let me know like we split the data into 128 MB, so I’m assuming that you all already are aware of this topic Now orange let me copy this Let’s say my data is of this sort Let’s say this is my friend, now I’m telling you already let’s say this file what is there, is let’s say of 256 MB, now if I’m dividing this data into my default size how many blocks it will create, two blocks So it is going to be 128 MB, 128 MB Now this is going to create two blocks, 128 MB and 128 MB Now what let’s say your boss came to you and said to you, you need to I have a problem like this and you need to give me of both of this problem when I say what count of the problem, now in this file I have only three key words, apple, banana, orange how many times apple is opening in this file, how many times banana is opening in this file, how many times orange is opening in this file? You came out and started working on this way you thought it’s a easy problem because I can divide my file into two parts, 128 MB 128 MB each and what I am going to do? I am going to distributed fashion I am going to work on it In order to work on distributed fashion what I am going to do is I am going to let’s say, start solving this problem in this way I will say okay, I’m going to you know two apple by A, orange by four, banana by B just to make it a little simple, now what you are going to do? Let’s say you set up will take I want two apple, one in front of this, second time banana chip, now you want an apple one in front of this because this is offering first, an orange cake you started appending one in front of them now they came again, you saw that apple have already occurred before and the count was one so this time you will be increasing the font by one and you made it to dispatch Now again you did this algorithm in the similar manner for banana, you kept on doing this for your first block of work, for the second block of code which may be working on some other machine, you did exactly the stimulus you did that exactly the similar step Now what was the next step what you will be doing in this case? Now in this is the next step will be that you will be combining for the outputs whatever came up so let’s say first of all you want to combine apple how many time it offered, so let’s say from here you go down put off let’s say apple, 20 in the class, from just block two, let’s say you got the output of a,34 in the mass, similarly for bananas, you did that, so for banana, let’s say you bought 56, similarly you did for second banana, this orange and this orange, then in the end you will combine this and give the output So you will be doing something of this sort next

a,20,34 and then here also you will do for banana and then you will do for orange and then in the end you are telling that okay a,1 you bring the solution to your boss and tell him I solve this problem Your boss is going to be not happy with you, why? There is not for telnet you there is a problem with this approach, this is not the right approach Can anybody tell me where is the performance bottleneck here where is the performance bottleneck, why I am telling you that this is not the right approach, can anybody see this form statement tell me where is the problem? So, need is saying aggregation part by something this aggregation part is Apollo, what about other says there, one has to wait for other now let’s say that problem is not there, let’s say it is very quick, one has to wait for other, no, that’s not the right solution, that actually is not a problem, so if this is a kind of connected in a way that it cannot be they may not wait for other What is the other solution? What is the problem here? And then what will be the solution, can you see the problem here, this 128 MB file, do you think it is small, when I only have text data do you think it is going to be small, no Now when you are doing this step, don’t you think you are decreasing your performance? Because every time an element is coming you are going and checking back whether that element have offered before or not and secondly, then you are adding of this number So don’t you think this is a bottleneck for us? I don’t want to do this because every time a new entry is coming every time we need to go back and check whether that element has occurred before or not this is the major bottleneck for your algorithm Now how MapReduce have solve this, how MapReduce of solving this what is the correct solution process? So let us see how we can solve this So from where this bottleneck, I’m not real excited because we were looking back, how about if we remove that bottleneck, so let’s see, so let me remove this solution from here and now what we are going to do it, let us give a better solution so what we are going to do, so let’s say I am going to make apple, one this time, I am going to make banana, One where I am going to make orange, one Now when the actual came again this time I am not going to go and look back Now again even here, I am going to only put comma one in front of this, so whatever key one is coming I am just appending one in front of this, similarly, I did for my second drop also so for my second block also I did the exactly the same stuff I am not waiting for my, I am not going and checking for my previous increase Now in the next step what I am going to do wherever apple came I want to bring them together so what would be just I’m going to combine these entries from both this machine I’m going to combine this interest and what I’m going to do wherever I feel apples or cream, let’s bring them together after apple comma one, apple comma one, apple comma one, from both the machines for wherever a thing was there just bring them together, how we can do it? By doing a sorting okay Bringing everything together in one machine and then by doing a sorting step Similar, thing I am going to do for banana so let’s say I say banana comma one, banana comma one, let’s keep on doing this, now similar thing I can do for orange as well So I can keep on doing that Now what is the next step, in the next step it is going to just combine up all the one like this so wherever a boss one was coming I’m just bringing up together similarly for banana I’m going to do that can everybody smell the solution now We can smell the solution, what is the next thing I need to do, I need to just combine everything, aggregate everything, so let’s say this one is offering three times that will give an output, eight comma three, b comma three whatever the number of one would be there I will be combining that output so let’s say, a comma three, b comma three whatever we do okay, I’m just giving an example here

now this is how MapReduce solves a problem so if you see what are the steps we did the first step what we did is called as Mapper Phase Second step what we did these two steps is called a sort and shuffle pitch and a third step what you are doing here is called Reducer Phase So these are the three steps involved in MapReduce programming Now this is how you will be solving your problem Now okay, we understood this back but why MapReduce was lower, there’s a still a mystery to us, because we want to definitely understand that why we were talking about that MapReduce is lower in order to solve this what we are doing is resuming that my replication back to this one I didn’t am assuming that you know all these facts from Hadoop systems if you do not know you need ask me, okay So that I can give you an appropriate example for that so I am right now assuming that you know the replication so I’m assuming the replication factor is one in this is, now what is going to happen now if I see this actually I am doing this so I have these two machines, add these two machines and then these two machines, so right now my all the operations are happening so this is, let us say on Dino again I am assuming you know this if you do not you have to stop me this is these two are your data nodes, okay These two are data nodes, so where the data resides factor data node, so what is gonna happen, let’s say this is your block b one and let’s say this is your block b two so what happens, this be one block, if I am considering that my replication factor is one this be one block is next a residing in the hardest of data node one and this b two block is residing in the hard disk of data node two This is let us say data node two, this is an data node one Now if you notice what’s gonna happen, where you perform your processing do you perform your processing in your disk level or you perform your processing in the memory level? Can I get an answer, where your processing happens? Memory, it always that memory where the processing happens So now what we have to do is, now let’s say the mapper code came because the first code which is going to run will be mapper code when the mapper code will come to this machine this block b one, will be moved out from the disk means copy from the desk to the memory of this machine, so b one block will come to the memory of this machine and the mapper code will be executed Similarly, b to block of this machine will also come towards the memory of this machine and it will be getting executed Now, if you are quality from computer science program or even if you are not, you might have thought that whenever there is a input-output operation happen when I say input-output operation I mean when you read any of the data from your disk or you write the data to your disk so that is called your input output operation So what I’m saying is, so you might have heard this that whenever there is a input/output operation happen it degrades your Performance because you have to do a disk seek and all those stuff, so that the reason it makes their performance slow now in this example can you notice I’m doing an input-output operation Now this is my input path of copy the data to my memory now the map for output is one Mapper output is this, this is a mapper output, now mapper output let me call this as let us say O one, let me call this output as O two, now what is going to happen? All this output will be given back to the disk now this O one will be stored back here, O two will be stored back here, what happened? I have stored back my O two of mapper output here now if you notice this is again an input-output operation I’m doing an output operation to my desk, question controller, what would happen if the block size is large, will it efficient to use of the memory, right now we have to assume that this memory is good enough to at least hold up 128 MB of data, otherwise you will get an error In MapReduce programming, you will simply get another if you have let say 128 MB of the data but if you have less than 128 MB of memory you will get another but max pack have

a very smart way to solve this problem Spark do not have any problem you can be care less memory, it still take care of it, that is a very interesting story about Spark but when it comes to MapReduce it says no error Clear Sarah, that is the reason we actually divide our data to 128 MB, so that at least our memory should be enough to handle it Now, what will happen? So I got my first O one and I have already observed my input output operation, now start and shuffle will happen sorting shuffle will happen on one machine, let’s say right So let’s say this step is happening in one machine so if you notice the data is coming from all this machine to one machine, so let’s say this they decide to do sort and shuffle on let’s say data node one, so this auto machine will be having a network transfer of data, this O two will come here, now after that this sort and shuffle step will happen let’s say the output which is coming out from this is O three, this is O three Now again this O two will be sent to the memory, O two will be sent to memory and O three will be again saved in the disk, after that again you will be sending the reduce at what reducer what will bring the O three into the memory and I’ve been pushing the final output to the disk and this, so many input-output operation happening in just one program, mapper that input-output sort in Japan have done Network transferred and the less input-output Third step reducer have done the input-output operation, again can you see so much of input-output operation in one program tactical recent your math reading programs are slower in nature, everyone can hear why MapReduce programs are slower in nature? What if you have already executed robot for example in O’Neil MapReduce you might notice that when you execute, it do not give an immediate output, it takes a good time to execute the over and why that happened this because there are so much of input/output operations, thanks Ratish Let’s move on, so this is the problem with MapReduce, now let us see that how apache spark is solving this problem how apache spark solve the problem and why it is faster, why we are saying that my that I will be able to give the output in faster time? So let’s understand that Now in order to explain you this, let me first of all, so let’s say I have a file again here and let’s say my data is like this one, three, five, six, seven, eight some more data Some more data, so let’s take this data is there, similarly there is more data 34, 78, three, six Now let’s say this is one more data similarly let’s say we have more data here let’s say 23, 67, one, nine promoted Now I’m telling you this the file is of 34 MB, 384MB, sorry This file is 384 MB and second thing is let’s say the name of the file F.txt Let’s say, this is the name of the File Now I am writing some alien word for you, please do not worry about that because I will be explaining you this portion, now let’s say if I have, do not worry about what is this I’m just texting Let’s understand what exactly we are doing Now in this example also let’s say I have created a cluster this is my name not these are your data nodes now what is happening here and telling you that this file F.txt is of 384 MB, so it’s very obvious that my file will be divided into three parts b one, b two, b three block Now again I am assuming here that made application factor is but I am calling this as b one block,

calling this as let say b two block, calling this as let say b three block, each of 128 MB Now what would be my next step? So I have just understood that we have these blocks now let’s say this file is residing in my HDFS So where it resides in the disk, so let’s keep it in the disk, b one block here, b two block here, b three block here Now where it will residing in the data node at the disk, you think this is where our NTFS my data base Now, then as soon as I hit this first mapping first of all before even understanding this part let me explain you one more thing, what is the main entry point in Java? Without which you cannot write any program, anybody Know, main function Without main function you cannot do anything Now in Apache Spark also there is one main entry point without which, none of your application will work and that entry point is called Spark Context We also denotes Spark Context as SC Now this is the main entry point and this the sides at the master machine, so we will be keeping this SC you know, for when you write your Java programs, let’s say you have written one project for one project there will be a separate main function for another project they will be a separate main function similarly, this SC will be separated for each individual application Now let’s understand this first line of code what they do, so ignore this oddity for some kind what is this oddity doing just ignore that part you can write not relate this oddity as some data type like for example in Java we have string data, so you can just replace this oddity thing for something not definitely from replacing this RDD with a string that means number is unbearable, so let’s assume that number is inaudible for some time, let’s see we have just seen that SC means Spark Context without which your Spark application will not have been executed, now this text file this is an API of Apache Spark, we use we understand this in more details you’ve read other staff sessions but I will just give you an idea about what is this text file do, what this text file API will do, in Apache Spark would be whatever file you have written inside that F.txt, it will go and search that file and will load it in the memory of your machine, what does I mean by that? Now in this case F.txt yes well in three Machine, F.txt is b one block, b two block, b three block So what is going to happen would be your b one block, let me create this, let’s say this is my RAM, let’s say this is my RAM Now what is gonna happen in this case would be just b one block will be copied I’m not saying move will be copied to the body of this machine, b two block will be copied to the memory of this machine, b three block will be copied to the memory of this machine So this is how your blocks will be sent to this machine memory, now what is going to happen? So we have just understood that b one, b two, b three block here, I am assuming that my memory is big enough to hold all this data Now what happens in case we’ll all the blocks it is not mandatory sort of, it is not mandatory that all your block size should be same it can give different as well, it doesn’t matter Whatever would be the block size in respective of that it is going to copy the block towards the memory that is what happened with my first line of code Now, these three files which are combined least sitting in memory is called as RDD, so these three files which is stating combinely in memory is called as our RDD What is the name of RDD, we have given number RDD So we have given the name to this RDD as number RDD

What is RDD, RDD is a distributed data, sitting in memory What is the full form of RDD? Full form of RDD yes, resilient distributed data Now let me ask you one question is this a distributed data or not? Is it a distributed data or not, yes it is It is a distributed data, what do you understand that resilient can you get an answer, what do you understand by just, see what does it read though it’s not a listener but still I just want to understand what do you understand by this key word, resilient? Resilient means real life That’s an English meaning for reliable Now yeah, you can call that sort of, now when I say something is reliable Now which brings me to a question, now whenever I am talking about RAM first of all, in RAM if I’m keeping the data and this is the most volatile thing in my whole system because whenever you have something in that if you restart your laptop everything will get erased from your RAM I get some most volatile thing Now I’m still saying RDD is resilient how because I am going to lose the data hasn’t been immediately I just restart my laptop or something I am going to lose my data Now how does this will guided? Remembers of application factor, replication factor? So, let’s say the replication factor is two, let’s say replication factor is two now in this case let’s say if I have few more machine, so let’s say b one rock is sitting here let’s say b two block is copied here let’s say b three block is copying here Any of the machine let’s say that is siding like that now what is going to happen? Let’s say in this condition let’s say just block it, lets block it So what we lost, what we lost, yes we are not yet reached to that stage but I will be speaking about that, below this will b lost, b three lost Now is b three anywhere else, yes between 0 so What is going to happen? It will immediately Load b three also in this machine it will immediately load between this machine, now b one and b three both will start reciting together in this machine and then what’s gonna happen? If these three will consist of energy So again b three will be moved to the memory and immediately that RDD will be created So that is the big I mean in these RDD ensures that it is pressed in there, even if you lose the data or if you lose any of the machine it does not matter it takes care of it So this is called a Resilient Portion Now let’s move further, so we just understood what is RDD and secondly, it is resilient Let’s take one more step So we have created our number RDD now I am also creating a filter RDD but now what I am going to do is I am going to create it on my number RDD, just number RDD.map, again this map is the API, what this API do is I’ll usually be understanding this part in our sessions in the day but let’s just let me give you this part a brief above it, for your introduction Whatever code you will be writing inside this map API will be executed, so whatever code you will be writing inside this line will be executed, so right now just written some English keywords in this place you have to replace this English keyword logic to find values less than 10 get some programming logic, maybe can be a Python program, it can be a scalar program it can be anything, whatever program you want to write, you can mention it, yes So whatever code you will be writing your map function will be responsible or your map API will be responsible to execute it Now what we are doing here, one more point, RDDs are always immutable whenever said I say that are delivered immutable that means if you have already put the block b one into the memory you will not be able to make any change in your block b one, you will not be able

to make any change in your block b one no Now what is going to happen? So let us come here first before I start working on this part, let us see this part So let’s say you have written some scalar function or a Python function which is some function which is just finding out all the values which are less than 10, so let’s assume in this b one block, in this b one block, you have let’s say these four bunch all the banch, so in that case what would be the output here what would be output here, this block one comma three because these two values are less than 10, one comma three, so can I call this block SB four block, let’s call it as B four, what would be the output from here, it will be three comma six Let’s call this also as B five block Similarly, if you notice this I have let’s say, one comma nine and let’s call this as maybe six block So let’s say this b four block, this is b five block and this is b six block Now what is happening here? that this B one block which is sitting in memory I will be doing this when this code will be executed that execution will happen on this B one block and a new block before will B created I am not going to do any change in B one block I am just doing the cross on this B one block and creating a new block which I am calling it a B four block Similarly, from this B two block, you will be generating this B five block it will be again sitting in the morning Similarly, here B six block will be generated Now in this case your B one block and B four block both will start reciting in memory together Similarly B two and B five will be residing together and B three and B six will be reciting together Collectively all these three, B four, B five, B six will be obtained called as an RDD the name of that RDD would be filter one RDD Clear everyone? What is RDD, how RDD works? This concept is clear to everyone So this is how Spark works Now let me ask you a question, don’t you think this will be faster? Are we doing money input-output operation just like I was bringing Map Reduce, no Only the input-output operation happens at the first page, when I was using F.txt file After that my data was always taking in memory and that’s the reason I am not doing any sort of input-output after that and that is the reason it is going to be giving you faster output, so that is the reason Spark is faster in comparison to MapReduce but I think the RAM become yeah, definitely that is there But still Spark have a big if your RAM is know also it can handle it and that concept is called Pipelining Concept I am not going to cover it in this session but yes, there is a big event if your memory is less Spark takes of it Very interesting concept in fact okay, yes Again that’s a very interesting concept that Spark can still handle if you have a little less memory So that makes park very smart framework that is the reason people are going for this paper Now, so you have you’ve leaned over I do recall sessions we go over all these topics in detail that power stuff and is it what if this situation happened then what will happen to all those things we go on, it will spill the extra danger to my desk, no It will not load any data to the desk but still will be able to handle it, that was bad thing right? You must be wondering curious about how but that just might what about pipeline Is there any limitation on number of concurrent client requests, no You can read as many number of times as you want if the right thing you want to do then only it’s a problem there is no limitations on that Now let’s take one more step, so we have just read this part, now if you notice what is happening here, so initially what I worth having so right now I have a filter one already, so let me code a notice my f one, this is my filter by RDD this filter one RDD is dependent on something, yes, it is dependent or my number RDD, on my number RDD, my number are it is dependent

or something, yes F.txt, so this file it is sleeping? No, can you see this is a graph which I just deleted here this graph is maintained by stop context as soon as you execute all this statement and this tag this is called dat, directed acyclic graph is also called as lineage, So in lineage what happened, it maintains all data which maintains all the information about that dependencies like f one is having a dependence your number, number is having a dependency on F.txt so this dependency graph what it is maintaining is called a lineage So, this is very important part of the total Now if you notice what is happening just B four block got generated due to B one block, this B five block got generated due to B two block and this B six block will be generated into B two block Another terms, I can say this F filter RDD got generated with the help of number RDD, so number was also not RDD but from that number RDD I will created a new entity that is called as filter one RDD, this F is called a Transformation step but this step we called it as transformation step Now are we printing any output here, no We are keeping only the data in memory, in Java we used to use this print statement In a Spark we don’t have print statement but instead of that we have a collect statement, so let’s say if I want to print B four, B five, B six that means I want to print filter one RDD I can write filter1.collect, this will print B four, B five, B six to your SC Now this thing what you are doing here this collect agony what you’re doing means whenever your printing any output B called this as a Word action So this step in a spot context is called S action so this is how you work on that, there are two major step one is transformation where you can convert one form of RDD to another form of RDD and second thing is called Action where you can print your output So these are very important points to keep in mind by working on apache Spark Let’s go back to our site any question on this before I move back I can again come back to it, everyone clear about this step, good Let’s move back, now if you notice here the same thing we will discuss that Batch in Real Time processing moving further, this is how it is done, so this I just discussed about the Spark, that Spark provide Real Time processing, so basically, agreed creation start with the transformation, yes, yes sir, now this faster processing is the part which we have just discussed and can you also see it is very easy to use, it is very easy to use in comparison to my MapReduce if you have already done MapReduce programming or if you remember that apple, orange, banana example, definitely my agony way is much simpler in comparison to your MapReduce Program, if you see the MapReduce code it is complicated in nature but Spark program is very simply in look at, so that’s the reason your Spark programs carries a very simple nature to go on Now moving further, let’s understand a Spark Success Story, what are the things we have, now Spark Success stories there are a lot of people who are using it these days like if we talk about stock market, stock market is using apache Spark lot because of faster processing capability, easier in nature plus a lot of things which are available (mumbling) Twitter Sentiment Analysis may be a trending which is happening based on that some company want to make some profit off of it maybe they start doing some campaign based on that, banking credit card fraud deduction I gave already an example of credit card where let’s say some fraud is being detected maybe they’re expecting that this do not it sounds like a genuine connections can we learn with the package but MapReduce it is impossible to do because they cannot even perform on Real Time processing, secondly, even if we try to apply on historical data it will be slower that’s a challenge there In medical domain also we apply apache Spark a lot, so these are the areas where Apache Spark is getting used

Talking about Spark now, we have already discussed what is fun, so now in Spark we have only seen Real Time processing and everything, now apache Spark is an open source cluster it is available to you, that it’s free of course you may not pay to work on that, that is also one of the very important pathways why apache Spark is famous it can you can perform Real Time processing Batch, kind of processing every kind of processing, you can perform on it You can perform your programming pack, you can do or data parallism, it can also take care of fault tolerance we have already seen the result of resilient part It is reliable that ocean is called fault-tolerant in as well yeah, now multiple it has been on top of MapReduce, what it will I get as an output if I use connect function just after creating the past already, it will just print out the original sort of, in fact I will do an execution practically and show you one example thereafter So that you can remain here that what exactly how it will being done, how you can load the data and how you can see the data inside I will show you a practical also just within few minutes Great, now let’s move further, so this is about the apache Spark, now it’s very easy for me to explain you all these things because we have already seen it, the Spark always use with Hadoop can be used as standalone, yes that’s on a fact, you can use a standalone as well No need for Hadoop cluster, you can simply even create a Spark things on your own simple Windows machine and can start working on it without requiring any other, you can keep the file locally and opponent that was fun part about it you’ll need not require HDFSS at all I will show you one example of that also, so that you’ll remain clear that how we standalone except I can use of apache Spark, I do not even requirement RDMS to go connect that’s a fun fact So many advantages you can make out on your own Spark is giving almost 100x times faster speed don’t you think it’s an awesome speed, 100x, I’m not talking about double or triple the same, I’m talking about 100x time faster which makes Spark very powerful, you might be hearing a lot that lot of companies are migrating from Map Receiver to apache Spark, why? I hope you got your answer, it’s simple as well as it is making your speed so fast, processing speed so fast, caching is very powerful what is this persistence or not be the really going sessions and going detail of data but we can cache your data in memory which is quite helpful as well and most of the cases You can deploy your application within the source YARN or as a standalone cluster Now this is a very good feature that event let’s say you have already configured your Hadoop and all you need not change your cluster specific apache Spark, ping plus that you can use it whatever you are using for your MapReduce for your Apache Spark Similarly, Spark can be programmed with multiple, programming language like add Python (mumbling) So, there are lot of language Java can also use, so these four languages are used at the current moment They both are same sort of, they both are exactly section Now moving further, so using I do through apache Spark, so let’s see how we can do all that Now Spark with HDFS, makes it more powerful because you can execute your Spark applications on top of HDFS very easily Now second thing is Spark plus MapReduce programming where Spark can be used along with MapReduce programming in the state Hadoop cluster, so you can run some application with MapReducee and same cluster you can use regular Spark application, no need to change all that, so that is one of the powerful things that you need not create separate cluster for Spark and separate clusters or mass-produced Similarly if I just explain you even if you have already done configured, you can use it for Apache Spark So this is very powerful because usually all of our older applications for MapReduce were deployed on YARN and now Spark and taking the bridge of that so like companies who want to migrate from MapReduce to apache Spark for them it is making the life very easy because you can just directly talk it doesn’t matter you need not change the cluster manager you can directly start working on it For people who do not know what is YARN just a brief about it, this is a cluster resource manager, let’s see few more things Now what happens with Spark? So with Hadoop you can combine of the things that was one thing so Spark is not intended to replace

Hadoop, keep this in mind in fact you can take it as a extension of your Hadoop framework People have this confusion a lot they say that Spark we’re going to replace Hadoop, no It is not going to replaceable because they are still depleting all the things, you are using HDFS, you are using YARN but just that the processing style you are changing so Spark is not going to replace Hadoop, in fact you can call it as an extension of the Hadoop framework second time When we talk about Spark with MapReduce, now they can also work together, so sometimes they are not new applications, no not now they’re very rare applications but there can be applications there’s some part of the code they write it for splits back and some part of the code they write with MapReduce this is all possible, so let’s say a company’s transforming the codes no need for MapReduce to apache Spark they require time so maybe some part of the foot which is really important for them they can start processing it with respect to Apache Spark and rest of the map really what they can leave it as it is So you can keep on slowly converting like that because combinely also they can work, so if you Sparkle stand alone it does not provide any distributed by their definitely stepped it takes I mean because if you are already using it as a standalone let’s say if you are not using as the data that in that case definitely you are not liberating there’s Apacaha Spark making it as a single process Now moving further, what are the important features in apache Spark, definitely the speed, polygon, polygon means multiple languages which you can use are kala, Python, Java, so many languages You can perform so much of analytics in memory computation, when we are executing everything in memory this is called In-Memory Computation You can integrate Hadoop, you can also apply machine learning and this makes apache very powerful, it is so powerful that Hadoop obviously use not even or do this even now we have massout, anybody who have not heard about massout, I hope everybody must be having, if not and let me just explain you massout is a MapReduce programming framework which is used to write your machine learning algorithms so you can write your machine learning algorithms with Mahal now MapReduce is somahow to convert the problem in MapReduce pay and you get down but now MapReduce itself is slower plus, machine learning algorithms are very highly hydrated in nature because of this your execution will become very slow in mahal because machine learning algorithms are already slower in nature plus MapReduce programming is slower in nature now because of that mahao and got emptied sometimes asked to give an output, I am not talking about even minutes some time to execute even a smaller data set it’s something can go even hours Now this is a major problem with mahao, know what Spark did Spark come up with a very famous framework called SMLA, Spark MLA, his is a substitute for mahao Now in MNLA every processing is going to happen in memory so that there will be knowing to talk about operation even the hydration what is happening will be happening in the memory so this will make the things very fast, now because of this what happened that MapReduced programming which was used by mahal, people stopped using that Now what happened with this part they stopped using this mahal in fact the core developer of this mahal but did themselves migrated to words the MLA, now even if you talk to those core developers of mahal, they themselves are recommending that if you want to execute machine learning progress better execute it in the Spark framework only Executed by using Spark MLA rather than executing it in your Hadoop, I so that’s the reason and machine learning algorithms on Big Data everyone is moving to what Spark MLA Let’s see all this part in detail now, when we talk about now in the space bar fight

and we just go discussing about this features Spark can run 100x time faster, why we already know we have already speed network, now when we talk about polygons we have just discussed that you can write and scale of Parquet Java and Hive and so, many languages are being supported Now next Spark this is important Lazy Evaluation, let me again take you back to my PPT, so in this case, now what actually happens, how this execution happens here? So, first of all what is happening here is it is not like that as soon as you hit this acid or textile it will immediately load this beyond the memory it do not work like that, in fact what it do is that as soon as you hit this line it will generate this B one block but it will be empty initially it will not be keeping any data, then what will happen? You generated this number dot now it will again generate B four block, B five block and B six blocks but they all will be empty, there will be no data inside it but as soon as you change filter1.collect, now what happens as soon as you get this filter1.collect it will go to your F one means filter one which is nothing but B four, B five, B six, they’ll say that I want to print your data, now what is going to happen? Filter one will say I don’t have any data I am currently blank, now the filter one will go and request number RDD to give the data Now this B one, B two, B three they are also empty right now but they will also say that I am blank, it will go to F.txt, F.txt will be loading the data to num, num will be loading the data to filter one and then this filter one will be giving the output So this thing is called a Lazy Evaluation means till that time you will not hit an action it will not print, it will not do any execution beforehand So all the execution start only at the time when you hit and action if you are coming from big programming background, you might have already seen this feature till that time there you do not do a dumb statement, it do not execute anything which is beforehand, now this portion is called Lazy Evaluation Why lazy evaluation because we do not want to unnecessary but important the memory till that time we are not printing the output means when we do not want to display something they will not be doing any institution, so that the data should not remain in the memory unnecessarily, so this is called a Lazy Evaluation Here about this part, this is called a Lazy Evaluation Let us come back to the slides now Now look at this part, so this is the lazy evaluation property, now the Real Time computing like at the Real Time as and when the data is coming you can immediately start cross the thing in the memory it said this is the fourth property which we have already see, now the fifth property you can work start with this DFS, you can work start with MapReduce same thing what we discussed and you can also perform your machine learning like things like that is the that is part about this So this is how you will be applying your machine learning these are the major features about the Spark Now let’s take a break and after that I will be talking about ecosystem because this will be a detailed topic there I need to spend a good time, so let’s take a break and then we will start, so after the break there are still lot of topics to talk about, we will be also doing a practical and followed by a project in the end, we will just walk to the project that what kind of project you will be learning then you will be doing the next sessions of Apache Spark, so let’s do all that after the break, so let’s take a break of 10 minutes, so let’s be back by 4:30 friends okay, so we will start about ecosystem and practicals and this is going to be very important, so please be on that, so please be back by 4:30 okay So everyone back can I get that for confirmation everyone back, able to hear me, shouting there So let’s move further, now Spark consists of the things

what we are working on like for example creating RDDs that is a part of Spark Core, now Spark Core is the major engine on top of that all the libraries are built, for example we have Spark sequence there what you can do, you can write a query in a SQL programming way and ontology it will be getting converted with respect to your path which means the computation will happen in the world Second thing is fast screening this is the major component because of which it was possible that we are able to perform, Real Time processing, so spot streaming helps you to perform Real Time processing SparkMLib because the machine learning logarithim I just discussed about this part when I was discussing about mahau, SparkMlib is a mostly a replacement for mahao because here the algorithm which were taking us in YARN Hadoop, they’ll take only few seconds in SparkMLib, that’s a major improvement in number land that the reason people are shifting towards those five graphX where you can perform your class kind of computation you can connect your prints recommendation and Facebook friends so there it generate, internally graph and give you offer Any graph sort of computation is done using graphX Sparks R this is a newly developed member they’re still working on it, it’s right on the beat of these versions, now here R is an open source language used by analysts Now what Spark emitted what that they want to bring all those analysts to the Spark simple and for that they are working hard by bringing this back on Stock are have already made it and this is going to be the next big thing in the market Now how this ecosystem looks like, so there will be multiple things for example when we talk about Spark sequence most of the times every computation happen with respect to RDDs but like in Spark he could be have something called SQL, now data frame is very analogous clear RDD but the only difference would be because the data which will be sitting in the body will be in the tabular format now in this case the data what you are keeping it going to also have column in function along with row information you will also have column information that’s the reason we do not call it as RDD in fact we call it as the top three Similarly, in your machine learning also we have something called a ml pipeline which helps you to make it easier to combine multiple algorithms so that is what your ml pipeline do in terms of MLM Now let’s talk about Spark Core, so Spark Core we already discussed any data which is residing in the body we call that data as RDD and this is all about your Spark Core component where you will be able to walk on large-scale parallel system because all the data will be finally sitting distributing this again so all the computation will also happen firmly So this is about your Spark Core component when we talk about the architecture of Spark, now you can relate this as your name node where your dragon program is attend which call as master machine, so your master machine your Spark context on video similarly Worker Node is called as theta node, so in stock we denote this data nodes as broken one, now there must be a place in memory where you will be keeping your block that space in memory we called the Tax executor as you can see there are two data nodes here or work on order we are having executed means the space in your RAM where you will be keeping all the block will be called as executors Now the blocks which are residing, for example you were doing that dot map logic to get or the values less than 10, now that logic the code what you are executing on an RDD is called as task, so that is called as task Now in middle there the store manager just like YARN or something YARN missus whatever you want to keep that will be an intermediate thing, now everything will be moving towards this cycle path context then YARN will be take injure of the execution then inner execute on your cord will be sitting where you will be performing your task, you can also cache your data if you wish to, you can cache or for process the data Now let’s talk about Spark Streaming, Spark streaming we have already discussing from good time that you have Real Time kind

of processing available, so what happens is here you will be as soon as you’re getting the data you will be splitting the data into data, just small small data and you will immediately do processing on it in memory that is done with the help of Spark Screaming and the micro backup data what you are creating is also called as Dstream Now we are just talking at a very high level of all this profit because we just want to give you an idea about how things works but when we go in the stations these things are all in the stream Definitely in just two and a half to three years it is impossible for us to cover everything but it is going to be kind of overview all the topics what I’m giving you, is it same as backward for what the Spark in general, yes Spark engine is helping you it is just working like a Spark or it is converting your things to your are giving and helping you to process the data, that is the role of the Spark processing Now similarly when we talk about Spark streaming, now Spark streaming as I was talking about you can get the Real Time data in are, now the data from where you can be pulled off, it can be for multiple sources You can use Kafka, you can use Hbase it can be pulled up from packet format any sort of data and the Real Time bring the data into the Spark system after that you can apply anything, you can apply Sparks SQL means, you can execute your SQL on top of it, you can execute your machine running code on top of it, you can apply your simple RDD code on top of it anything and store back on the output either in your Hps, your memory SQL kafka last bit search whatever you want to do but main yes, when the data leaves here at the Real Time you can immediately start doing cross the same So even the other libraries can pull up the data immediately and can start acting on it Now so this is the same example like you can just pull up the data from kafka, HDFS/S3 from any of the sources bringing it to the stock streaming then save it in HDFS or database or maybe a UI dashboard wherever you want to Similar things are there like you will be getting an input data stream you will be converting into a Batch of small phone data and then in the back itself you will be outputting everything, so what are happening, so you are what the practice of data what you’re creating so I can call all those things maybe small small RDDs What I am generating, so that’s the reason it is denoted here, so you are getting a deep feel in small for a Batch of data, so maybe this is activity which is getting generated for a short a time Now all the outputs will be given afterwards so this is a very high-level picture of how paths streaming is going to work Similarly in Spark SQL, now this is very powerful thing because it can give you an output very quickly now if you have a SQL with Spark you can execute it and that is called your Spark SQL Now Spark SQL can handle your structured data, semi structured data but cannot handle your unstructured data because anyway we’re performing SQL kind of query, so it makes sense for it to perform on semi-structured and the structured data only not on unstructured In the streaming data structure it will be structured it will be structured data but this is going to be my structured data, now support for various format you can bring the data from multiple formats like, Parquet, Jason, hive anyway similarly all the queries what you can working to different are beliefs who can do that as well, you can be using data frames you can shuffle it to RDD as well, so all those things are possible in your Spark SQL The performance if I compare with your Hive is very high, in her own system if this is a red mark which other is there is your Hadoop system you can easily see that we are taking so less time in comparison to your Hadoop systems but that is the major advantage when using this Spark QSL Now, it uses the JDBC Java which is a Java driver or ODBC driver which is the Oracle driver for your connection for creating connection you can also create your user defined function just like in Hive, so you can also do that in Spark as well So if you have already of pre-built API you can use it if you do not have it creator you do and then you can execute it, if you do not know UDF in high incidence of concept of mediums not only in height which is a general concept where you can create your own function, you can write your own Java PUD and can use it as a function in your subsequent or in type that is called your UDF back, so this is how your Spark SQL go Now what usually is a workflow correct? You are going to have a data source from where you will be getting the data, you will be converting to a date API, data API means nothing but analogous to RDD but it will be a tabular format,

so it will be having the rows as well as column information Now you are going to have the name column you are going to have interpreted convert it to a pathway doing the computation your Spark SQL services will be running and in the end you will be trying to offer So this is a high level picture of how to pass the SQL votes Now let’s talk about Mllib, which is machine learning library there are two kinds of alogarithms one is supervised alogarithm, second is unsupervised logarithm Unsupervised algorithm you already know the output you already know some part of the output using that you are predicting something new, inside to provide learning you do not know anything about your data, you don’t have even the previous date output and you want to get some output from it, this is what unsupervised learning So Mllib can handle good ratings, now in supervised we have multiple examples like classification, regression, similarly unsupervised we have clustering, SVD all the things are available for unsupervised in package what it has given very less here just creates failure, is there any limitation of particle? No, there is no such limitation Sammy, okay you can execute all the things which are available in fact there is something called as make your Spark Context you also have a high context Now if you want to execute your high query choose what you can do it with the help of pipe context So there is no such limitation, so you can still keep on writing the code in height and can execute directly up Now moving further, what are the techniques we have what are the various data sources in Sparks SQL? So we just already discussed the same thing so we have Parquel, we have Jason, let me go back to quickly show you again you can get it from CSV, HBase from database, Oracle, DB, my SQL packages and all These are your data’s much this is there available here so you can just get it from all these data sources so lot of data source you can use it further yes or no? No, so in classification generally what happens? Just to give you an example, you must have what is the spam email box, I hope everybody must be having, I’ve seen that Sparking for your spam email box in your Gmail Now then any new email comes up, how Google decide whether it’s a spam email or a non spam email, that has done as the example of classification plus three Let’s say you might have see in the Google News like when you type something it cooked all the news together, that is called your clustering Regression, Regression is also one of the very important fact, it is not here, no regression is, lets say you have house and you want to sell that home and you have no idea what is the optimal price you should lease for your house Now this regression will help you to achieve that, collaborative Bentley you might have seen when you go to your Amazon back page that they show you a recommendation, you can buy this because you are buying there but this is done with the help of collaborative filtering So this algorithm is used for your recommendation pack graphX an important confident again in graphX you will you can apply all your problems, you can solve all your problems at the graphX Now there are multiple things we have edges which denotes the relationships now can you see this back, Bob, Carol these are what exits where you can call it also a leaf, now the connector between them is called as H, that is just being shown here now if there is an arrow here that is called a directed graph like we have seen in the lineage also something so that is your directed graph Now what are the use cases for it there can be multiple but let’s see few examples Now all of you must have seen this Google map, Google map you must test it, now Google map and the back in there you the graphX what you do is when you apply something you do not just search for one part it in fact goes for multiple parts and it shows your mud optimal path which is may be less this time or maybe the links distance now that computation what all is happening to compute all the graph checking that pit will take less time make computation of all that is done with the help of graphX Similarly there are a lot of examples for protections also thanks for using this graphX you can also see this Twitter or LinkedIn they give you recommendations of friends, that is all can be done that the end of them

so all your recommendations happens because they generate the graph and all that and based on that they compute and give you the output, so there also graphX execution, so graphX is a very strong logarithm available with us Now before I move to the project I want to show you some practical part how we will be executing Spark things So let me take you to the VM machine which will be provided by director, so this machine’s are also provided by a director, so you need not worry about from where I will be getting the software, what I will be doing this point it role there everything is taken care by director Now once you will be coming to this you will see a machine like this, let me this so what will happen you will see a blank machine like this let me show you this but this is how your machine will look like, now what you are going to do in order to start working you will be opening this permanent by clicking on this black option now after that what you can do is you can now stop go to your Spark, now how I can work with Spark, in order to execute any program in Spark by using scalar program you will be entering it as (mumbling) If you type Spark attention it will take you to the ELA Pro where you can write your path program but by using a ELA programming language You can notice this, now can you see the Spark it is also giving me one point five point two version so that is the version of your Spark Now you can see here you can also see this part context available as you’ll see when you get connected to your Spark shake you can just see this will be by default available to you, let this get an attack it takes some time Now, we’re all connected, so we got connected to this scale prom, now if I want to come out of it I will just type exit, it will just let me come out of this block Now, secondly I can also write my program with my freighting, so what I can do if I want to do programming in Spark but with Python programming language I will be connecting with Spark box So I just need to type my Spark in order to get connected with your data, I’m not getting connected now because I am not going to require Python I will be explaining everything that scalar right now but if you want to get connected you can type in spot so let’s again get connected to my Spark Now meanwhile this is getting connected let us create a file, so let us create a file so currently if you notice I don’t have anything I already have F.txt, so let’s say I say cat.txt I have some data one, two, three, four, five This is my data which is with me now what I am going to do let me push this file and do select them check if it is already available in my system that means DFS system, hadoop dfs.cat.a.txt just to quickly check if it is already available Okay, there is no such file so let me first put this file to my system look put a.txt so this will put it in the default location of dfs now if I want to read it I can see this path So again I’m assuming that you are aware of this as the best marks, so you can see now this one, two, three, four, five is coming from Hadoop file system Now what I want to do, I want to use this file in my system of Spark, now how I can do that so let me come here so in scala, in scala we do not have integer float and unlike in Java when you use the define like this like integer is equal to 10 like this we use to define but in scalar we do not use this ticker tape in fact what we do is we call it as back so if I use back a equal to 10, it will automatically identify that it is integer value not in feature than you notice it will tell me that a is of my integer type Now if I want to update this value to 20 I can do that, now let’s say if I want to update just to ABC like this, this will move an arrow because a is already defined as integer

and you are trying to assign some ABC string type so that is the reason you got this arrow Similarly, there is one more thing called as Val, Val B is equal to 10 lets say if I do it works exactly similar to that but have one difference now in this case if I do B is equal to 20, you will see an error and why this error because when you define something as val it is a constant, it is not going to be bearable anymore, it will be a constant and that is the reason if you define something as val it will be not updated, you will they should not be able to update that value So this is how in kala you will be doing your program, so back for variable part of val for your constant val Now, so you will be doing like this now let’s use it for the example what we have learnt, now let’s say if I want to create a car TV so val number is equal to sc.textfile Remember this API we have learned to say file already sc.textfile now let me give this file a.txt if I give this file a.txt it will be creating and hardly see the Spark, it is telling that I created an RDD of string type, now if I want to read this data I will call number.collect This will print me the value what was available can you see, now this line what you are seeing here is going to be from your memory this is your from memory it is reading up and that is the reason it is showing up in this particular manner So this is how you will be performing your step Now, second thing I told you that Spark and walk on standalone system says that, so right now what was happening was that we have executed this part in our history of this now if I want to execute this on our local file system can I do that, yes it can still do that What you need to do for that so in that case the difference will come here, now what the file you are giving here would be instead of giving like that you will be denoting this file keyword before that and after that you need to give your local file for example what is this path /home/ I do they come this is a local path not as deep as path so you will be writing /home/cat Eureka/ a.txt Now if you give this, this will be loading the file into memory but not from your hdfs instead what does that if just loaded it from your hadoop, so that is the difference so as you can see in the second case I am not even using my hdfs which means what? Now can you tell me why they set up this was interesting by the side of input part does not exist because I have given a typo here okay, now if you notice why I did not get this error here why I did not get this other here this file do not exist but still I did not got any error because of Lazy Evaluation, Lazy evaluation kind of made sure that different if you have given the wrong path it created an empty are dealing but it has not executed anything, so all the output or the error mistake you are able to the scene when you hit that action of connect Now in order to correct this value I need to connect this edeka and this time if I execute it will work, you can see this output one, two, three, four, five So this time it work fine so now you should be more clear about that leave the evaluation as the same when if you are giving the wrong file name doesn’t matter suppose I want to use Spark in production unit but not on top of Hadoop is it possible, yes You can’t do that, you can do that sorry but usually that’s not what you do but yes if you want you can do that there are a lot of things which you can you can also deploy it on your Amazon cluster that lot of things you can do there how will it provide the distribute in that case will be using some other distribution system So in that case you are not using this pack you can deploy it will be just definitely you will not be able to kind of go across and distribute in the cluster you were not able to liberate that redundancy but again news even Amazon is enough for that, so that is how you will be using this, now you’re going to get

So this is how you will be performing your path because as I say how you will be working on this path I will be explaining you as I told you so this is how things works Now, let us see an interesting use case so for that let us go back to our PPT this going to be very interesting so let’s see this use case, look at this is very interesting, now this use case is for earthquake detection using Spark so I think Japan you might have already seen that there are so many earthquakes coming you might have heard about it I definitely you might have not figured that you must have heard about it that there are so many quack which happens in Japan now how to solve that problem with the budget so I am just going to give you a glimpse of what kind of problems we solve in the sessions, definitely we are not going to walk through in detail on this but you will get an idea how often Spark is Just to give you a little bit of brief here but all of these targets will learn at the time of sessions, now so let’s see this part how we will be using this case, so as everybody must be knowing whatever question so I’ll crack is like a shaking of your surface of the earth your home start shaking no, all those events that happen in fact if you’re from India you might have seen recently there was a earthquake incident which came from Nepal fight even recently two days back also there was that incident, so these are quick keeps on coming, now very important part is let’s say if the earthquake is a major earthquake like earthquick or maybe tsunami may be forest fires maybe a volcano Now it’s very important for them to kind of estimate that crack is going to come they should be able to kind of predict it, beforehand, it should not happen that as the last moment they go out to do that the flag is come after came are threatened no it should not happen like that and they should be able to estimate all these things beforehand they should be able to predict beforehand So this exhaust system that Japan is using oil today so this is a Real Time kind of use case is what I am presenting so Japan is already using this path penguia in order to solve this earthquake from them so we are going to see that how they are using it okay Now let’s say what happens in Japan earthquake model so whenever there is a earthquake coming For example at 2:46 p.m. on March or 2011 now Japan earthquake early warning was predicted now the thing was as soon as I predicted immediately they start sending the alert to schools, to the lift to the factories every station, through TV stations, they have immediately kind of told everyone, so that all the students were there in school they got the time to go under the desks, bullet train before running they stopped otherwise except immediately the will start shaking now the bullet trains are already burning at the very high speed they want to ensure that there should be no sort of casualty because of that so all the bullet trains stopped, all the elevators the lift which were running they stopped otherwise some incident can happen in 60 seconds, 60 seconds before this number they were able to inform almost everyone, they have send the message they have broadcast on TV all those things they have done immediately to all the paper so that they can send at least this message whoever can receive it and that have saved, millions of life, so how they were able to achieve that we have done all this bit of elbow Apache Spark, that is the most important before how they got you can see that everything what they are doing there they are doing it on the Real Time system If they cannot just collect that data and then later the processes they did everything at the Real Time system, so they connected the data immediately processing and as soon as they detected that earthquake they immediately informed it in fact this happened in 2011 Now they start using it very frequently because Japan is one of the area which is very frequently and of affected by all this So as I said the main thing is we should be able to process the data and we’re fine they have to media thing you should be able to handle the data from multiple sources because they can will be coming from multiple sources may be different different sources they might this thing some other other event fixed because of which we are predicting that this or it can happen it should be very easy to use because if it is very complicated then in that is for a user to use it it will be very become complicated if you may not be able to solve the problem

now even in the end how to send up a lot message to the bottom right, so all those things are taken care by your Spark Now there are two kinds of layer in your earthquake, number one there is a prime giving and a second building There are two kinds of ways in our Spark, priming wave is like bender or when is this about to start it, start with the Dickey Center and expand go or 20 going to start Secondary wave is more severe being which Sparked after friend even, now what happens in secondary failures once that start it can do maximum damage because primarily that you can say that initial wave but the second we will be on top of that so they make is found details with respect to that I’m not going in detail of that but here there will be some details with respect to that now what we are going to do using Sparks we will be creating our honesty So let’s go and see that in our machine how we will be cheap calculating over our OC which using which we will be solving this problem later and we will be calculating this alpha with the help of Spark system, let us again come back to this machine, now in order to work on that let’s first exit from this term so once you exit from this concern now what you are going to do I have already created this project and kept it here because we just want to give you an overview of this let me go to my download section there is a project called src, so this is your project Initially what all things you will be having you will not be having all the things initial part so what will happen so let’s say if I go to my downloads from here I have alt two project Now initially, I will not be having this target directory project directory I think we will be using over SBT symbol if you do not know SBT the scissors scalable tool which takes care of all your dependencies check takes care of all your dependencies enough so it is very similar to mebane if you already know Mebane you this is because very similar but at the same time I prefer this SBB because SBB is more easier to write in comparison to your method, so you will be writing this bill thought at beating so this find able to write you build.sbt Now in this point you will be giving the name of your project your what’s a version of is beating using version of scala of what you are using what are the dependencies you have with what versions dependencies you have like four Spark for example I’m using 1.5.2 version of Spark, so you are telling that whatever in my program I am writing if I require anything, related to Spark work go and get it from this website org.apache.spark, download it, install it If I require any dependency for Spark streaming program for this particular version 1.5.2 go to this website or this link and execute it Similar thing for among the best share just telling that Now, once you have done this you will be creating a folder structure, your folder structure would be you need to create an SRC folder after that you will be creating your main folder from main folder you will be creating again a folder called as ELA Now inside that you will be keeping your program so now here you will be writing your program so you are writing a can you see this streaming.scala, network.scala, r.scala So let’s keep it as a black box below So you will be writing the code to achieve this problem statement, now what we are going to do let’s come out of this, go to your main project folder and from here you will be writing sve package, it will start downloading with respect to your is beeping it will check your program whatever dependency you require for Spark path path streaming, Spark MLlib it will download and install it, it will just download and install it, so we not going to execute it because I’ve already done it before and it also takes some time, so that’s the reason I’m not doing it, now once you have filled this packet, you will find all this directories are the direct a spot project directory these got created later of these Now what is going to happen, once you have created this, you will go to your Eclipse so your Eclipse we will open let me open my Eclipse so this is all your equipped your file now I already have this program in front of me but let me tell you how you will be bringing this program now, you will be going to your in both option with import you will be selecting your existing project into workspace, next once you do that you need to select your main project for example you need to select

this r2 project for your crater and click on okay, once you do that there will be a project directory coming here this a tool will come here, now what you need to do go to your as/main a lot ignore all this program I require only this are chlorella because this is where I’ve written my main main function code Now after that once you reach to this you need to go to your run as healer application and your code will start executing Now this will return me errors, let’s see this output Now if I see this, this will show me once it’s finished executing See this often area under ROC is this so this is all computed with the a low path program Similarly there are other programs also which will help you to screen the rate I not I’m not walking over all that Now let’s come back to my PPT and see that what is the next step what we will be doing So you can see this there will be an excel sheet getting created, now I’m keeping my ROC here now after you have created your ROC you will be generating our graph Now in Japan there is one important, Japan is already more affected area of the earthquake and now that problem here is that whatever it’s not like even for a minor earthquake I should start sending the alert right I don’t want to do all that for the minor affection in fact the buildings and the infrastructure what is created in Japan is in such a way if any earthquake below six magnitude comes there there of the homes are designed in a way that there will be no damage, there will be no damage there So, this is the major thing when you work with your Japan fellows, now in Japan so that means with six they are not even buried but above six, they are worried Now for that they will be a graph generation what you can do, you can do it with back as well once you generate this graph you will be seeing that anything but you’re going above six, if anything which is going above six we should immediately start them and now, if you know all this programming site because that is what we have just created then shown you this execution path, now if you have to visualize the same result this is what is happening, this is showing my ROC but if my earthquake is going to be greater than six then only waves or a lot then only stand up and learn to all the people otherwise stay calm that is what the project, what we generally show in our space program design Now it is not the only project we also kind of create multiple of the product segments for example like I know create a model just like how wall not do it how or not may be creating whatever sales is happening with respect to that they are using apache Spark and at the end there and of making to visualize the output of doing whatever analytics there is so that is ordering this part so all those things we walk you through when we do the for session all the things you learn fare and feel that all these projects are using, right now since you do not know the topic you are not able to get 100% of the project but at that time once you know each and every topic subjective you will have a clear picture of how Spark is ending all these new spaces So there select what we wanted to just discuss it with the second part, so I hope this session is useful for all of you, you got some insight on how apache Spark works, why we are going with about apache Spark and what are the important things available and what was that important (mumbling) Any questions from one of them, please ask What do you guys, apache Spark something in the Real Time near yet I’m usually it is you can create Real Time but it will not be helpful so it is almost near to reactor because we will try some and I am talking to you sort of, it is not exactly, even my voice just reaching to you in few mini seconds at least right or in nanoseconds even if you are looking at my screen you are not seeing that data in the exception of thing in the squatters in Spark Real Time you cannot define, so will there will be even a minor delay that is called near real time that’s what we can decide So generally that is what we are going to design in fact so it will be nearly any other question from anyone, this session is very helpful I learned a lot today thanks for me So, if you want to learn any of the details part you can get in touch with Tilaka, I’m also one other there and let me tell you

this is the hottest topic in the market and right now there are so many jobs available do not go by my word surrender recover just go and explore yourself you will see maximal jobs of Big Data to me and that the reason a lot of people are moving towards apache Spark and a drape I have a head so many students learning it in making ship inter carrier lot of people have got successfully the job in this domain okay So thank you everyone for making this interesting I hope that you love this Edureka session over what path we are once again some time Edureka session I would love to see you once again So, thank you everyone I hope you enjoyed listening to this video, please be kind enough to like it and you can comment any of your doubts and queries and we will reply to them at the earliest, do look out for more videos in our playlist and subscribe to our at Edureka channel to learn more, happy learning

You Want To Have Your Favorite Car?

We have a big list of modern & classic cars in both used and new categories.