Uncategorized       So guys a goal of this stage is to deploy the model into a production or maybe a production like environment So this is basically done for final user acceptance and the users have to validate the performance of the models and if there are any issues with the model or any issues with the algorithm, then they have to be fixed in this stage So guys with this we come to the end of the data lifecycle I hope this was clear statistics and probability are essential because these disciples form the basic Foundation of all machine learning algorithms deep learning artificial intelligence and data science In fact, mathematics and probability is behind everything around us from shapes patterns and colors to the count of petals in a flower mathematics is embedded in each and every aspect of our lives with this in mind I welcome you all to today’s session So I’m going to go ahead and Scoffs the agenda for today with you all now going to begin the session by understanding what is data after that We’ll move on and look at the different categories of data, like quantitative and qualitative data, then we’ll discuss what exactly statistics is the basic terminologies in statistics and a couple of sampling techniques Once we’re done with that We’ll discuss the different types of Statistics which involve descriptive and inferential statistics Then in the next session will mainly be focusing on descriptive statistics here will understand the different measures of center measures of spread Information Gain and entropy will also understand all of these measures with the help of a use case and finally we’ll discuss what exactly a confusion Matrix is once we’ve covered the entire descriptive statistics module will discuss the probability module here will understand what exactly probability is the different terminologies in probability will also study the Different probability distributions, then we’ll discuss the types of probability which include marginal probability joint and conditional probability Then we move on and discuss a use case where and we’ll see examples that show us how the different types of probability work and to better understand Bayes theorem We look at a small example Also, I forgot to mention that at the end of the descriptive statistics module will be running a small demo in the our language So for those of you who don’t know much about our I’ll be explaining every line in depth, but if you want to have a more in-depth understanding about our I’ll leave a couple of blocks And a couple of videos in the description box you all can definitely check out that content Now after we’ve completed the probability module will discuss the inferential statistics module will start this module by understanding what is point estimation We will discuss what is confidence interval and how you can estimate the confidence interval We will also discuss margin of error and will understand all of these concepts by looking at a small use case We’d finally end the inferential Real statistic module by looking at what hypothesis testing is hypothesis Testing is a very important part of inferential statistics So we’ll end the session by looking at a use case that discusses how hypothesis testing works and to sum everything up We’ll look at a demo that explains how inferential statistics Works Alright, so guys, there’s a lot to cover today So let’s move ahead and take a look at our first topic which is what is data Now, this is a quite simple question if I ask any of You what is data? You’ll see that it’s a set of numbers or some sort of documents that have stored in my computer now data is actually everything All right, look around you there is data everywhere each click on your phone generates more data than you know, now this generated data provides insights for analysis and helps us make Better Business decisions This is why data is so important to give you a formal definition data refers to facts and statistics Collected together for reference or analysis All right This is the definition of data in terms of statistics and probability So as we know data can be collected it can be measured and analyzed it can be visualized by using statistical models and graphs now data is divided into two major subcategories Alright, so first we have qualitative data and quantitative data These are the two different types of data under qualitative data We have nominal and ordinal data and under quantitative data We have discrete and continuous data Now, let’s focus on qualitative data Now this type of data deals with characteristics and descriptors that can’t be easily measured but can be observed subjectively now qualitative data is further divided into nominal and ordinal data So nominal data is any sort of data that doesn’t have any order or ranking? Okay An example of nominal data is gender Now There is no ranking in gender There’s only male female or other right? There is no one two, three four or any sort of ordering in gender race is another example of nominal data Now ordinal data is basically an ordered series of information Okay, let’s say that you went to a restaurant Okay Your information is stored in the form of customer ID All right So basically you are represented with a customer ID Now you would have rated their service as either good or average All right, that’s how no ordinal data is and similarly they’ll have a record of other customers who visit the restaurant along with their ratings All right So any data which has some sort of sequence or some sort of order to it is known as ordinal data All right, so guys, this is pretty simple to understand now, let’s move on and look at quantitative data So quantitative data basically these He’s with numbers and things Okay, you can understand that by the word quantitative itself quantitative is basically quantity Right Saudis will numbers a deals with anything that you can measure objectively All right, so there are two types of quantitative data there is discrete and continuous data now discrete data is also known as categorical data and it can hold a finite number of possible values Now, the number of students in a class is a finite Number All right, you can’t have infinite number of students in a class Let’s say in your fifth grade They have a hundred students in your class All right, there weren’t infinite number but there was a definite finite number of students in your class Okay, that’s discrete data Next We have continuous data Now this type of data can hold infinite number of possible values Okay So when you say weight of a person is an example of continuous data what I mean to see is my weight can be 50 kgs or it NB 50.1 kgs or it can be 50.00 one kgs or 50.000 one or is 50.0 2 3 and so on right there are infinite number of possible values, right? So this is what I mean by a continuous data All right This is the difference between discrete and continuous data And also I’d like to mention a few other things over here Now, there are a couple of types of variables as well We have a discrete variable and we have a continuous variable discrete variable is also known as a categorical variable or and it can hold values of different categories Let’s say that you have a variable called message and there are two types of values that this variable can hold let’s say that your message can either be a Spam message or a non spam message Okay, that’s when you call a variable as discrete or categorical variable All right, because it can hold values that represent different categories of data now continuous variables are basically variables that can store infinite number of values So the weight of a person can be denoted as a continuous variable All right, let’s say there is a variable called weight and it can store infinite number of possible values That’s why we will call it a continuous variable So guys basically variable is anything that can store a value right? So if you associate any sort of data with a Able, then it will become either discrete variable or continuous variable There is also dependent and independent type of variables Now, we won’t discuss all of that in death because that’s pretty understandable I’m sure all of you know, what is independent variable and dependent variable right? Dependent variable is any variable whose value depends on any other independent variable? So guys that much knowledge I expect or if you do have all right So now let’s move on and look at our next topic which Which is what is statistics now coming to the formal definition of statistics statistics is an area of Applied Mathematics, which is concerned with data collection analysis interpretation and presentation now usually when I speak about statistics people think statistics is all about analysis but statistics has other parts to it it has data collection is also a part of Statistics data interpretation presentation All of this comes into statistics already are going to use statistical methods to visualize data to collect data to interpret data Alright, so the area of mathematics deals with understanding how data can be used to solve complex problems Okay Now I’ll give you a couple of examples that can be solved by using statistics  because we’re focusing on statistics and probability, correct Now again under probability sampling We have three different types We have random sampling systematic and stratified sampling All right, and just to mention the different types of non-probability sampling, ‘s we have no bald Kota judgment and convenience sampling All right now guys in this session I’ll only be focusing on probability So let’s move on and look at the different types of probability sampling So what is probability sampling it is a sampling technique in which samples from a large population are chosen by using the theory of probability All right, so there are three types of probability sampling All right first we have the random sampling now in this method each member of the population has an equal chance of being selected in the sample All right, so each and every individual or each and every object in the population has an equal John’s of being a part of the sample That’s what random sampling is all about Okay, you are randomly going to select any individual or any object So this Bay each individual has an equal chance of being selected Correct? Next We have systematic sampling now in systematic sampling every nth record is chosen from the population to be a part of the sample All right Now refer this image that I’ve shown over here out of these six Groups every second group is chosen as a sample Okay So every second record is chosen here and this is our systematic sampling works Okay, you’re randomly selecting the nth record and you’re going to add that to your sample Next We have stratified sampling now in this type of technique a stratum is used to form samples from a large population So what is a stratum a stratum is basically a subset of the population that shares at One common characteristics So let’s say that your population has a mix of both male and female so you can create to straightens out of this one will have only the male subset and the other will have the female subset All right, this is what stratum is It is basically a subset of the population that shares at least one common characteristics All right in our example, it is gender So after you’ve created a stratum you’re going to use random sampling on these stratums and you’re going to choose Choose a final sample So random sampling meaning that all of the individuals in each of the stratum will have an equal chance of being selected in the sample Correct So Guys, these were the three different types of sampling techniques Now, let’s move on and look at our next topic which is the different types of Statistics So after this, we’ll be looking at the more advanced concepts of Statistics, right so far we discuss the basics of Statistics, which is basically what is statistics the Friend sampling techniques and the terminologies and statistics All right Now we look at the different types of Statistics So there are two major types of Statistics descriptive statistics and inferential statistics in today’s session We will be discussing both of these types of Statistics in depth All right, we’ll also be looking at a demo which I’ll be running in the our language in order to make you understand what exactly descriptive and inferential statistics is soaked As which is going to look at the basic, so don’t worry If you don’t have much knowledge, I’m explaining everything from the basic level All right, so guys descriptive statistics is a method which is used to describe and understand the features of specific data set by giving a short summary of the data Okay, so it is mainly focused upon the characteristics of data It also provides a graphical summary of the data now in order to make you understand what descriptive statistics is Let’s suppose that you want to gift all your classmates or t-shirt So to study the average shirt size of a student in a classroom So if you were to use descriptive statistics to study the average shirt size of students in your classroom, then what you would do is you would record the shirt size of all students in the class and then you would find out the maximum minimum and average shirt size of the cloud Okay So coming to inferential statistics inferential Six makes inferences and predictions about a population based on the sample of data taken from the population Okay So in simple words, it generalizes a large data set and it applies probability to draw a conclusion Okay So it allows you to infer data parameters based on a statistical model by using sample data So if we consider the same example of finding the average shirt size of students in a class    whether or not a game can be played All right, so that’s why The play has two values It has no and yes, no, meaning that the weather conditions are not good And therefore you cannot play the game Yes, meaning that the weather conditions are good and suitable for you to play the game Alright, so that was our problem statement I hope the problem statement is clear to all of you now to solve such a problem We make use of something known as decision trees So guys think of an inverted tree and each branch of the tree denotes some decision All right, each branch is Is known as the branch known and at each branch node, you’re going to take a decision in such a manner that you will get an outcome at the end of the branch All right Now this figure here basically shows that out of 14 observations 9 observations result in a yes, meaning that out of 14 days The match can be played only on nine days Alright, so here if you see on day 1 Day 2 Day 8 day 9 and 11 The Outlook has been Alright, so basically we try to plaster a data set depending on the Outlook So when the Outlook is sunny, this is our data set when the Outlook is overcast This is what we have and when the Outlook is the rain this is what we have All right, so when it is sunny we have two yeses and three nodes Okay, when the Outlook is overcast We have all four as yes has meaning that on the four days when the Outlook was overcast We can play the game All right Now when it comes to rain, we have three yeses and two nodes All right So if you notice here, the decision is being made by choosing the Outlook variable as the root node Okay So the root node is basically the topmost node in a decision tree Now, what we’ve done here is we’ve created a decision tree that starts with the Outlook node All right, then you’re splitting the decision tree further depending on other parameters like Sunny overcast and rain All right now like we know that Outlook has three values Sunny overcast and brain so let me explain this in a more in-depth manner Okay So what you’re doing here is you’re making the decision Tree by choosing the Outlook variable at the root node The root note is basically the topmost node in a decision tree Now the Outlook node has three branches coming out from it, which is sunny overcast and rain So basically Outlook can have three values either it can be sunny It can be overcast or it can be rainy Okay now these three values Use are assigned to the immediate Branch nodes and for each of these values the possibility of play is equal to yes is calculated So the sunny and the rain branches will give you an impure output Meaning that there is a mix of yes and no right There are two yeses here three nodes here There are three yeses here and two nodes over here, but when it comes to the overcast variable, it results in a hundred percent pure subset All right, this shows that the overcast baby Will result in a definite and certain output This is exactly what entropy is used to measure All right, it calculates the impurity or the uncertainty Alright, so the lesser the uncertainty or the entropy of a variable more significant is that variable? So when it comes to overcast there’s literally no impurity in the data set It is a hundred percent pure subset, right? So be want variables like these in order to build a model All right now, we don’t always Ways get lucky and we don’t always find variables that will result in pure subsets That’s why we have the measure entropy So the lesser the entropy of a particular variable the most significant that variable will be so in a decision tree The root node is assigned the best attribute so that the decision tree can predict the most precise outcome meaning that on the root note You should have the most significant variable All right, that’s why we’ve chosen Outlook or and now some of you might ask me why haven’t you chosen overcast Okay is overcast is not a variable It is a value of the Outlook variable All right That’s why we’ve chosen our true cure because it has a hundred percent pure subset which is overcast All right Now the question in your head is how do I decide which variable or attribute best Blitz the data now right now, I know I looked at the data and I told you that, you know here we have a hundred percent pure subset, but what if it’s a more complex problem and you’re not able to understand which variable will best split the data, so guys when it comes to decision tree Information and gain and entropy will help you understand which variable will best split the data set All right, or which variable you have to assign to the root node because whichever variable is assigned to the root node It will best let the data set and it has to be the most significant variable All right So how we can do this is we need to use Information Gain and entropy So from the total of the 14 instances that we saw nine of them said yes and five of the instances said know that you cannot play on that particular day All right So how do you calculate the entropy? So this is the formula you just substitute the values in the formula So when you substitute the values in the formula, you will get a value of 0.9940 All right This is the entropy or this is the uncertainty of the data present in a sample Now in order to ensure that we choose the best variable for the root node Let us look at all the possible combinations that you can use on the root node Okay, so these are All the possible combinations you can either have Outlook you can have windy humidity or temperature Okay, these are four variables and you can have any one of these variables as your root note But how do you select which variable best fits the root node? That’s what we are going to see by using Information Gain and entropy So guys now the task at hand is to find the information gain for each of these attributes All right So for Outlook for windy for humidity and for temperature, we’re going to find out the information Nation gained all right Now a point to remember is that the variable that results in the highest Information Gain must be chosen because it will give us the most precise and output information All right So the information gain for attribute windy will calculate that first here We have six instances of true and eight instances of false Okay So when you substitute all the values in the formula, you will get a value of zero point zero four eight So we get a value of You 2.0 for it Now This is a very low value for Information Gain All right, so the information that you’re going to get from Windy attribute is pretty low So let’s calculate the information gain of attribute Outlook All right, so from the total of 14 instances, we have five instances with say Sunny for instances, which are overcast and five instances, which are rainy All right for Sonny We have three yeses and to nose for overcast we have Or the for as yes for any we have three years and two nodes Okay So when you calculate the information gain of the Outlook variable will get a value of zero point 2 4 7 now compare this to the information gain of the windy attribute This value is actually pretty good Right we have zero point 2 4 7 which is a pretty good value for Information Gain Now, let’s look at the information gain of attribute humidity now over here We have seven instances with say hi and seven instances with same Right and under the high Branch node We have three instances with say yes, and the rest for instances would say no similarly under the normal Branch We have one two, three, four, five six seven instances would say yes and one instance with says no All right So when you calculate the information gain for the humidity variable, you’re going to get a value of 0.15 one Now This is also a pretty decent value, but when you compare it to the Information Gain, Of the attribute Outlook it is less right now Let’s look at the information gain of attribute temperature All right, so the temperature can hold repeat So basically the temperature attribute can hold hot mild and cool Okay under hot We have two instances with says yes and two instances for no under mild We have four instances of yes and two instances of no and under col we have three instances of yes and one instance of no All right When you calculate the information gain for this attribute, you will get a value of zero point zero to nine, which is again very less So what you can summarize from here is if we look at the information gain for each of these variable will see that for Outlook We have the maximum gain All right, we have zero point two four seven, which is the highest Information Gain value and you must always choose a variable with the highest Information Gain to split the data at the root node So that’s why we assign The Outlook variable at the root node All right, so guys I hope this use case was clear If any of you have doubts Please keep commenting those doubts now, let’s move on and look at what exactly a confusion Matrix is the confusion Matrix is the last topic for descriptive statistics read after this I’ll be running a short demo where I’ll be showing you how you can calculate mean median mode and standard deviation variance and all of those values by using our okay So let’s talk about confusion Matrix now guys What is the confusion Matrix now don’t get confused This is not any complex topic now confusion Matrix is a matrix that is often used to describe the performance of a model Right? And this is specifically used for classification models  and how to study the variables by plotting a histogram Okay Don’t worry If you don’t know what a histogram is It’s basically a frequency plot There’s no big signs behind it Alright, this is a very simple demo but it also forms a foundation that everything Machine learning algorithm is built upon Okay, you can say that most of the machine learning algorithms actually all the machine learning algorithms and deep learning algorithms have this basic concept behind them Okay, you need to know how mean median mode and all of that is calculated So guys am using the our language to perform this and I’m running this on our studio For those of you who don’t know our language I will leave a couple of links in the description box You can go through those videos So what we’re doing is we are randomly generated Eating numbers and Miss storing it in a variable called data, right? So if you want to see the generated numbers just to run the line data, right this variable basically stores all our numbers All right Now, what we’re going to do is we’re going to calculate the mean now All you have to do in our is specify the word mean along with the data that you’re calculating the mean of and I was assigned this whole thing into a variable called mean Just hold the mean value of this data So now let’s look at the mean for that abuser function called print and mean All right So our mean is around 5.99 Okay Next is calculating the median It’s very simple guys All you have to do is use the function median or write and pass the data as a parameter to this function That’s all you have to do So our provides functions for each and everything All right statistics is very easy when it comes to R because R is basically a statistical language Okay So all you have to do is just name the function and that function is Ready in built in your art Okay, so your median is around 6.4 Similarly We will calculate the mode All right Let’s run this function I basically created a small function for calculating the mode So guys, this is our mode meaning that this is the most recurrent value right now We’re going to calculate the variance and the standard deviation for that Again We have a function in are called as we’re all right All you have to do is pass the data to that function Okay, similarly will calculate the standard deviation, which is basically the square root of your variance right now will Rent the standard deviation, right? This is our standard deviation value Now Finally, we will just plot a small histogram histogram is nothing but it’s a frequency plot already in show you how frequently a data point is occurring So this is the histogram that we’ve just created it’s quite simple in our because our has a lot of packages and a lot of inbuilt functions that support statistics All right It is a statistical language that is mainly used by data scientists or by data and analysts and machine learning Engineers because they don’t have to student code these functions All they have to do is they have to mention the name of the function and pass the corresponding parameters So guys that was the entire descriptive statistics module and now we will discuss about probability Okay So before we understand what exactly probability is, let me clear out a very common misconception people often tend to ask me this question What is the relationship between statistics and probability? So probability and statistics are related fields All right So probability is a mathematical method used for statistical analysis Therefore we can say that a probability and statistics are interconnected branches of mathematics that deal with analyzing the relative frequency of events So they’re very interconnected feels and probability makes use of statistics and statistics makes use of probability or a they’re very interconnected Fields So that is the relationship between said It is six and probability Now Let’s understand what exactly is probability So probability is the measure of How likely an event will occur to be more precise It is the ratio of desired outcome to the total outcomes Now, the probability of all outcomes always sum up to 1 the probability will always sum up to 1 probability cannot go beyond one Okay So either your probability can be 0 or it can be 1 or it can In the form of decimals like 0.5 to or 0.55 or it can be in the form of 0.5 0.7 0.9 But it’s valuable always stay between the range 0 and 1 okay, another famous example of probability is rolling a dice example So when you roll a dice you get six possible outcomes, right? You get one two, three four and five six phases of a dies now each possibility only has one outcome So what is the probability that on rolling a dice? You will get 3 the probability is 1 by 6 right because there’s only one phase which has the number 3 on it out of six phases There’s only one phase which has the number three So the probability of getting 3 when you roll a dice is 1 by 6 similarly If you want to find the probability of getting a number 5 again, the probability is going to be 1 by 6 All right So all of this will sum up to 1 All right, so guys, this is exactly what Ability is it’s a very simple concept We all learnt it in 8 standard onwards right now Let’s understand the different terminologies that are related to probability Now that three terminologies that you often come across when we talk about probability We have something known as the random experiment Okay It’s basically an experiment or a process for which the outcomes cannot be predicted with certainty All right That’s why you use probability You’re going to use probability in order to predict the outcome with Some sort of certainty sample space is the entire possible set of outcomes of a random experiment and event is one or more outcomes of an experiment So if you consider the example of rolling a dice now, let’s say that you want to find out the probability of getting a to when you roll the dice Okay So finding this probability is the random experiment the sample space is basically your entire possibility Okay So one two, three, four five six Is are there and out of that you need to find the probability of getting a 2 right? So all the possible outcomes will basically represent your sample space gives a 1 to 6 are all your possible outcomes This represents your sample space now event is one or more outcome of an experiment So in this case my event is to get a tattoo when I roll a dice, right? So my event is the probability of getting a to when I roll a dice, so guys, this is basically what random experiment samples All space and event really means alright now, let’s discuss the different types of events There are two types of events that you should know about there is disjoint and non disjoint events Disjoint events are events that do not have any common outcome For example, if you draw a single card from a deck of cards, it cannot be a king and a queen correct it can either be king or it can be Queen now a non disjoint events are events that have common out For example a student can get hundred marks in statistics and hundred marks in probability All right, and also the outcome of a ball delivered can be a no ball and it can be a 6 right So this is what non disjoint events are or n? These are very simple to understand right now Let’s move on and look at the different types of probability distribution All right, I’ll be discussing the three main probability distribution functions I’ll be talking about probability density Aaron normal distribution and Central limit theorem Okay probability density function also known as PDF is concerned with the relative likelihood for a continuous random variable to take on a given value Alright, so the PDF gives the probability of a variable that lies between the range A and B So basically what you’re trying to do is you’re going to try and find the probability of a continuous random variable over a specified range Okay Now this graph denotes the PDF of a continuous variable Now this graph is also known as the bell curve right? It’s famously called the bell curve because of its shape and the three important properties that you need to know about a probability density function Now the graph of a PDF will be continuous over a range this is because you’re finding the probability that a continuous variable lies between the ranges A and B, right the second property Is that the area bounded by By the curve of a density function and the x-axis is equal to 1 basically the area below the curve is equal to 1 all right, because it denotes probability again the probability cannot arrange more than one it has to be between 0 and 1 property number three is that the probability that our random variable assumes a value between A and B is equal to the area under the PDF bounded by A and B. Okay Now what this means, is that the probability You is denoted by the area of the graph All right, so whatever value that you get here, which basically one is the probability that a random variable will lie between the range A and B All right So I hope all of you have understood the probability density function It’s basically the probability of finding the value of a continuous random variable between the range A and B All right Now, let’s look at our next distribution, which is normal distribution now Normal distribution, which is also known as the gaussian distribution is a probability distribution that denotes the symmetric property of the mean right meaning that the idea behind this function Is that the data near the mean occurs more frequently than the data away from the mean So what it means to say is that the data around the mean represents the entire data set Okay So if you just take a sample of data around the mean it can represent the entire data set now similar to Probability density function the normal distribution appears as a bell curve right now when it comes to normal distribution There are two important factors All right, we have the mean of the population and the standard deviation Okay, so the mean and the graph determines the location of the center of the graph, right and the standard deviation determines the height of the graph Okay So if the standard deviation is large the curve is going to look something like this All right, it’ll be short and wide I’d and if the standard deviation is small the curve is tall and narrow All right So this was it about normal distribution Now, let’s look at the central limit theorem Now the central limit theorem states that the sampling distribution of the mean of any independent random variable will be normal or nearly normal if the sample size is large enough now, that’s a little confusing Okay Let me break it down for you now in simple terms if we had a large population and be Why did it in too many samples, then the mean of all the samples from the population will be almost equal to the mean of the entire population right? Meaning that each of the sample is normally distributed Right? So if you compare the mean of each of the sample, it will almost be equal to the mean of the population Right? So this graph basically shows a more clear understanding of the central limit theorem red you can see each sample here and the mean of each sample Oil is almost along the same line, right? Okay So this is exactly what the central limit theorem States now the accuracy or the resemblance to the normal distribution depends on two main factors, right? So the first is the number of sample points that you consider All right, and the second is the shape of the underlying population Now the shape obviously depends on the standard deviation and the mean of a sample, correct So guys the central limit theorem basically states that eats Bill will be normally distributed in such a way that the mean of each sample will coincide with the mean of the actual population All right in short terms That’s what central limit theorem States All right, and this holds true only for a large data set mostly for a small data set and there are more deviations when compared to a large data set is because of the scaling Factor, right? The small is deviation in a small data set will change the value vary drastically, but in a large data set a small deviation will not matter at all Now, let’s move Vaughn and look at our next topic which is the different types of probability This is a important topic because most of your problems can be solved by understanding which type of probability should I use to solve this problem? Right? So we have three important types of probability We have marginal joint and conditional probability So let’s discuss each of these now the probability of an event occurring unconditioned on any other event is known as marginal Or unconditional probability So let’s say that you want to find the probability that a card drawn is a heart All right So if you want to find the probability that a card drawn is a heart The Profit will be 13 by 52 since there are 52 cards in a deck and there are 13 hearts in a deck of cards Right and there are 52 cards in a total deck So your marginal probability will be 13 by 52 That’s about marginal probability Now, let’s understand what is joint probability And now joint probability is a measure of two events happening at the same time Okay, let’s say that the two events are A and B So the probability of event A and B occurring is the intersection of A and B So for example, if you want to find the probability that a card is a four and a red that would be joint probability All right, because you’re finding a card that is 4 and the card has to be red in color So for the answer to this would be to Biceps you do because we have 1/2 in heart and we have 1/2 and diamonds, correct So both of these are red and color therefore Our probability is to by 52 and if you further down it is 1 by 26, right? So this is what joint probability is all about moving on Let’s look at what exactly conditional probability is So if the probability  Now I’ve bias is a supervised learning classification algorithm and it is mainly Used in Gmail spam filtering, right a lot of you might have noticed that if you open up Gmail, you’ll see that you have a folder called spam right or that is carried out through machine learning and the algorithm used there is knife bias, right? So now let’s discuss what exactly the Bayes theorem is and what it denotes the bias theorem is used to show the relation between one conditional probability and it’s inverse All right, basically Nothing, but the probability of an event occurring based on prior knowledge of conditions that might be related to the same event Okay So mathematically the bell’s theorem is represented like this, right like shown in this equation The left-hand term is referred to as the likelihood ratio, which measures the probability of occurrence of event B, given an event a okay on the left hand side is what is known as the posterior right is referred to as posterior Are which means that the probability of occurrence of a given an event B, right? The second term is referred to as the likelihood ratio or a this measures the probability of occurrence of B, given an event a now P of a is also known as the prior which refers to the actual probability distribution of A and P of B is again, the probability of B, right This is the bias theorem in order to better understand the base theorem Let’s look at a small example Let’s say that we Three balls we have about a bowel be and bouncy okay barley contains two blue balls and for red balls bowel be contains eight blue balls and for red balls baozi contains one blue ball and three red balls Now if we draw one ball from each Bowl, what is the probability to draw a blue ball from a bowel a if we know that we drew exactly a total of two blue balls right if you didn’t Understand the question Please Read it I shall pause for a second or two Right So I hope all of you have understood the question Okay Now what I’m going to do is I’m going to draw a blueprint for you and tell you how exactly to solve the problem But I want you all to give me the solution to this problem, right? I’ll draw a blueprint I’ll tell you what exactly the steps are but I want you to come up with a solution on your own right the formula is also given to you Everything is given to you All you have to do is come up with the final answer Right? Let’s look at how you can solve this problem So first of all, what we will do is Let’s consider a all right, let a be the event of picking a blue ball from bag in and let X be the event of picking exactly two blue balls, right because these are the two events that we need to calculate the probability of now there are two probabilities that you need to consider here One is the event of picking a blue ball from bag a and the other is the event of picking exactly two blue balls Okay So these two are represented by a and X respectively Lee so what we want is the probability of occurrence of event a given X, which means that given that we’re picking exactly two blue balls, what is the probability that we are picking a blue ball from bag? So by the definition of conditional probability, this is exactly what our equation will look like Correct This is basically a occurrence of event a given an event X and this is the probability of a and x and this is the probability of X alone, correct? And what we need to do is we need to find these two probabilities which is probability of a and X occurring together and probability of X. Okay This is the entire solution So how do you find P probability of X this you can do in three ways So first is white ball from a either white from be or read from see now first is to find the probability of x x basically represents the event of picking exactly two blue balls Right So these are the three ways in which it is possible So you’ll pick one blue ball from bowel a and one from bowel be in the second case You can pick one from a and another blue ball from see in the third case You can pick a blue ball from Bagby and a blue ball from bagsy Right? These are the three ways in which it is possible So you need to find the probability of each of this step two is that you need to find the probability of a and X occurring together This is the sum of terms 1 and 2 Okay, this is because in both of these events, we are picking a ball from bag, correct So there is find out this probability and let me know your answer in the comment section All right We’ll see if you get the answer right? I gave you the entire solution to this All you have to do is substitute the value right? If you want a second or two, I’m going to pause on the screen so that you can go through this in a more clear away Right? Remember that you need to calculate two Tease the first probability that you need to calculate is the event of picking a blue ball from bag a given that you’re picking exactly two blue balls Okay, II probability you need to calculate is the event of picking exactly two blue bonds All right These are the two probabilities You need to calculate so remember that and this is the solution All right, so guys make sure you mention your answers in the comment section for now Let’s move on and Look at our next topic, which is the inferential statistics So guys, we just completed the probability module right now We will discuss inferential statistics, which is the second type of Statistics We discussed descriptive statistics earlier Alright, so like I mentioned earlier inferential statistics also known as statistical inference is a branch of Statistics that deals with forming inferences and predictions about a population based on a sample of data Are taken from the population All right, and the question you should ask is how does one form inferences or predictions on a sample? The answer is you use Point estimation? Okay Now you must be wondering what is point estimation one estimation is concerned with the use of the sample data to measure a single value which serves as an approximate value or the best estimate of an unknown population parameter That’s a little confusing Let me break it down to you for Camping in order to calculate the mean of a huge population What we do is we first draw out the sample of the population and then we find the sample mean right the sample mean is then used to estimate the population mean this is basically Point estimate, you’re estimating the value of one of the parameters of the population, right? Basically the main you’re trying to estimate the value of the mean This is what point estimation is the two main terms in point estimation There’s something known as as the estimator and the something known as the estimate estimator is a function of the sample that is used to find out the estimate Alright in this example It’s basically the sample mean right so a function that calculates the sample mean is known as the estimator and the realized value of the estimator is the estimate right? So I hope Point estimation is clear Now, how do you find the estimates? There are four common ways in which you can do this The first one is method of Moment you’ll what you do is you form an equation in the sample data set and then you analyze the similar equation in the population data set as well like the population mean population variance and so on So in simple terms, what you’re doing is you’re taking down some known facts about the population and you’re extending those ideas to the sample Alright, once you do that, you can analyze the sample and estimate more essential or more complex values right next We have maximum likelihood But this method basically uses a model to estimate a value All right Now a maximum likelihood is majorly based on probability So there’s a lot of probability involved in this method next We have the base estimator this works by minimizing the errors or the average risk Okay, the base estimator has a lot to do with the Bayes theorem All right, let’s not get into the depth of these estimation methods Finally We have the best unbiased estimators in this method There are seven unbiased estimators that can be used to approximate a parameter Okay So Guys these were a couple of methods that are used to find the estimate but the most well-known method to find the estimate is known as the interval estimation Okay This is one of the most important estimation methods or at this is where confidence interval also comes into the picture right apart from interval estimation We also have something known as margin of error So I’ll be discussing all of this In the upcoming slides So first let’s understand What is interval estimate? Okay, an interval or range of values, which are used to estimate a population parameter is known as an interval estimation, right? That’s very understandable Basically what they’re trying to see is you’re going to estimate the value of a parameter Let’s say you’re trying to find the mean of a population What you’re going to do is you’re going to build a range and your value will lie in that range or in that interval All right So this way your output is going to be more accurate because you’ve not predicted a point estimation instead You have estimated an interval within which your value might occur, right? Okay Now this image clearly shows how Point estimate and interval estimate or different  This can be anything like the mean of the sample next you will select a confidence level now the confidence level describes the uncertainty of a Sampling method right after that you’ll find something known as the margin of error, right? We discuss margin of error earlier So you find this based on the equation that I explained in the previous slide, then you’ll finally specify the confidence interval All right Now, let’s look at a problem statement to better understand this concept a random sample of 32 textbook prices is taken from a local College Bookstore The mean of the sample is so so and so and the sample standard deviation is This use a 95% confident level and find the margin of error for the mean price of all text books in the bookstore Okay Now, this is a very straightforward question If you want you can read the question again All you have to do is you have to just substitute the values into the equation All right, so guys, we know the formula for margin of error you take the Z score from the table After that we have deviation Madrid’s 23.4 for right and that’s standard deviation and n stands for the number of samples here The number of samples is 32 basically 32 textbooks So approximately your margin of error is going to be around 8.1 to this is a pretty simple question All right I hope all of you understood this now that you know, the idea behind confidence interval Let’s move ahead to one of the most important topics in statistical inference, which is hypothesis testing, right? So Sigelei statisticians use hypothesis testing to formally check whether the hypothesis is accepted or rejected Okay, hypothesis Testing is an inferential statistical technique used to determine whether there is enough evidence in a data sample to infer that a certain condition holds true for an entire population So to understand the characteristics of a general population, we take a random sample, and we analyze the properties of the sample right we test Whether or not the identified conclusion represent the population accurately and finally we interpret their results now whether or not to accept the hypothesis depends upon the percentage value that we get from the hypothesis Okay, so to better understand this, let’s look at a small example before that There are few steps that are followed in hypothesis, testing you begin by stating the null and the alternative hypothesis All right I’ll tell you what exactly these terms are and then you formulate Analysis plan right after that you analyze the sample data and finally you can interpret the results right now to understand the entire hypothesis testing We look at a good example Okay now consider for boys Nick jean-bob and Harry these boys were caught bunking a class and they were asked to stay back at school and clean the classroom as a punishment, right? So what John did is he decided that four of them would take turns to clean their classrooms He came up with a plan of writing each of their names on chits and putting them in a bout now every day They had to pick up a name from the bowel and that person had to play in the clock, right? That sounds pretty fair enough now it is been three days and everybody’s name has come up except John’s assuming that this event is completely random and free of bias What is a probability of John not treating right or is the probability that he’s not actually cheating this can Solved by using hypothesis testing Okay So we’ll Begin by calculating the probability of John not being picked for a day Alright, so we’re going to assume that the event is free of bias So we need to find out the probability of John not cheating right first we’ll find the probability that John is not picked for a day, right? We get 3 out of 4, which is basically 75% 75% is fairly high So if John is not picked for three days in a row the Probability will drop down to approximately 42% Okay So three days in a row meaning that is the probability drops down to 42 percent Now, let’s consider a situation where John is not picked for 12 days in a row the probability drops down to Tea Point two percent Okay, that’s the probability of John cheating becomes fairly high, right? So in order for statisticians to come to a conclusion, they Define what is known as the threshold value Right considering the above situation if the threshold value is set to 5 percent It would indicate that if the probability lies below 5% then John is cheating his way out of detention But if the probability is about threshold value then John it just lucky and his name isn’t getting picked So the probability    So basically you’ll encounter a lot of inconsistencies in the data set Okay, this includes missing values redundant variables duplicate values and so on removing such values is very important because they might lead to wrongful computations and predictions So that’s why at this stage you must can the entire data set for any inconsistencies You have to fix them at this stage Now The next step is exploratory data analysis Now data analysis is all about diving deep into data and finding all the hidden data Mysteries Okay This is where you become a detective So edu or exploratory data analysis is like a brainstorming of machine learning data exploration involves understanding the patterns and the trends in your data So at this stage all the useful insights are drawn and all the correlations Turns between the variables are understood So you might ask what sort of correlations are you talking about? For example in the case of predicting rain fall We know that there is a strong possibility of rain if the temperature has fallen low Okay So such correlations have to be understood and mapped at this stage Now This stage is followed by stage number 5, which is building a machine learning model So all the insights and the patterns that you derive during data exploration are used to build the machine learning So this stage always Begins by splitting the data set into two parts training data and the testing data So earlier in the session I already told you what training and testing data is now the training data will be used to build and analyze the model and the logic of the model will be based on the machine learning algorithm that is being implemented Okay Now in the case of predicting rainfall since the output will be in the form of true or false we can use a classification algorithm like logistically Regression now choosing the right algorithm depends on the type of problem You’re trying to solve the data set you have and the level of complexity of the problem So in the upcoming sections will be discussing different types of problems that can be solved by using machine learning So don’t worry If you don’t know what classification algorithm is and what logistic regression in Okay So all you need to know is at this stage, you’ll be building a machine learning model by using machine learning algorithm and by using the training data set the next But in on machine learning process is model evaluation and optimization So after building a model by using the training data set it is finally time to put the model to a test Okay So the testing data set is used to check the efficiency of the model and how accurately it can predict the outcome So once you calculate the accuracy any improvements in the model have to be implemented in this stage Okay, so methods like parameter tuning and cross-validation can be used to improve the The performance of the model this is followed by the last stage, which is predictions So once the model is evaluated and improved it is finally used to make predictions The final output can be a categorical variable or it can be a continuous quantity in our case for predicting the occurrence of rainfall the output will be a categorical variable in the sense Our output will be in the form of true or false Yes or no Yes, basically represents that is going to rain and no will represent that It wondering okay as simple as that, so guys that was the entire machine learning process A linear regression is one of the easiest algorithm in machine learning It is a statistical model that attempts to show the relationship between two variables So the linear equation, but before we drill down to linear regression algorithm in depth, I’ll give you a quick overview of today’s agenda So we’ll start a session with a quick overview of what is regression as linear regression is one of a type of regression algorithm Once we learn about regression, its use case the various types of it next We’ll learn about the algorithm from scratch where I live To its mathematical implementation first, then we’ll drill down to the coding part and Implement linear regression using python in today’s session will deal with linear regression algorithm using least Square method checketts goodness of fit or how close the data is to the fitted regression line using the R square method and then finally what we’ll do well optimized it using the gradient descent method in the last part on the coding session I’ll teach you to implement linear regression using Python and the coding session Would be divided into two parts the first part would consist of linear regression using python from scratch where you will use the mathematical algorithm that you have learned in this session And in the next part of the coding session will be using scikit-learn for direct implementation of linear regression All right I hope the agenda is clear to you guys are like so let’s begin our session with what is regression Well regression analysis is a form of predictive modeling technique which investigates  Usually Falls in either Big O of x square or big of xn next is comprehensible and transparent the linear regression are easily comprehensible and transparent in nature They can be represented by a simple mathematical notation to anyone and can be understood very easily So these are some of the criteria based on which you will select the linear regression algorithm All right Next is where is linear regression used first is evaluating Trends and sales estimate Well linear regression can be used in Business to evaluate Trends and make estimates or focused for example, if a company sales have increased steadily every month for past few years then conducting a linear analysis on the sales data with monthly sales on the y axis and time on the x axis This will give you a line that predicts the upward Trends in the sale after creating the trendline the company could use the slope of the lines too focused sale in future months Next is analyzing The impact of price changes will linear regression can be To analyze the effect of pricing on consumer behavior For instance If a company changes the price on a certain product several times, then it can record the quantity itself for each price level and then perform a linear regression with sold quantity as a dependent variable and price as the independent variable This would result in a line that depicts the extent to which the customer reduce their consumption of the product as the prices increasing So this result would help us in future pricing decisions Next is assessment of risk and fine Financial services and insurance domain Well linear regression can be used to analyze the risk, for example health insurance company might conduct a linear regression algorithm how it can do it can do it by plotting the number of claims per customer against its age and they might discover that the old customers then to make more health insurance claim Well the result of such analysis might guide important business decisions All right, so by now you have just a rough idea of what linear regression algorithm as like, What it does where it is used when you should use it early now, let’s move on and understand the algorithm and depth So suppose you have independent variable on the x-axis and dependent variable on the y-axis All right suppose This is the data point on the x axis The independent variable is increasing on the x axis And so does the dependent variable on the y-axis? So what kind of linear regression line you would get you would get a positive linear regression line All right as the slope would be positive Next is suppose You have an independent variable on the x-axis which is increasing and on the other hand the dependent variable on the y-axis that is decreasing So what kind of line will you get in that case? You will get a negative regression line In this case as the slope of the line is negative And this particular line that is line of y equal MX plus C is a line of linear regression which shows the relationship between independent variable and dependent variable and this line is only known as line of linear regression Okay? So let’s add some data points to our graph So these are some observation or data points on our graphs Let’s plot some more Okay Now all our data points are plotted now our task is to create a regression line or the best fit line All right now once our regression line is drawn now, it’s the task of production now suppose This is our estimated value or the predicted value and this is our actual value Okay So what we have to do our main goal is to reduce this error That is to reduce the distance between the estimated or the predicted value and the actual value The best fit line would be the one which had the least error or the least difference in estimated value and the actual value All right, and other words we have to minimize the error This was a brief understanding of linear regression algorithm soon We’ll jump towards mathematical implementation All right, but for then let me tell you this suppose you draw a graph with speed on the x-axis and distance covered On the y axis with the time demeaning constant, if you plot a graph between the speed travel by the vehicle and the distance traveled in a fixed unit of time, then you will get a positive relationship All right So suppose the equation of line as y equal MX plus C Then in this case Y is the distance traveled in a fixed duration of time x is the speed of vehicle m is the positive slope of the line and see is the y-intercept of the line All right suppose the distance remaining constant You have to plot a graph between the Rid of the vehicle and the time taken to travel a fixed distance then in that case you will get a line with a negative relationship All right, the slope of the line is negative here the equation of line changes to y equal minus of MX plus C where Y is the time taken to travel a fixed distance X is the speed of vehicle m is the negative slope of the line and see is the y-intercept of the line All right Now, let’s get back to our independent and dependent variable So in that term why is our dependent variable and That is our independent variable Now, let’s move on and see the mathematical implementation of the things Alright, so we have x equal 1 2 3 4 5 let’s plot them on the x-axis So 0 1 2 3 4 5 6 alike and we have y as 3 4 2 4 5 All right So let’s plot 1 2 3 4 5 on the y-axis now, let’s plot our coordinates 1 by 1 so x equal 1 and y equal 3, so We have here x equal 1 and y equal 3 So this is the point 1 comma 3 so similarly we have 1 3 2 4 3 2 4 4 & 5 5 All right So moving on ahead Let’s calculate the mean of X and Y and plot it on the graph All right, so mean of X is 1 plus 2 plus 3 plus 4 plus 5 divided by 5 That is 3 All right, similarly mean of Y is 3 plus 4 plus 2 plus 4 plus 5 that is 18 So it in divided by 5 That is nothing but 3.6 aligned so next what we’ll do we’ll plot our mean that is 3 comma 3 .6 on the graph Okay So there’s a point 3 comma 3 .6 see our goal is to find or predict the best fit line using the least Square Method All right So in order to find that we first need to find the equation of line, so let’s find the equation of our regression line All right So let’s suppose this is our regression line y equal MX plus C Now We have an equation of line So all we need to do is find the value of M and see where m equals summation of x minus X bar X Y minus y bar upon the summation of x minus X bar whole Square don’t get confused Let me resolve it for you All right So moving on ahead as a part of formula What we are going to do will calculate x minus X bar So we have X as 1 minus X bar as 3 so 1 minus 3 that is minus 2 next We have x equal to minus its mean 3 that is minus 1 similarly We have 3 minus 3 is 0 4 – 3 1 5 – 3 2 alight so x minus X bar It’s nothing but the distance of all the point through the line y equal 3 and what does this y minus y bar implies it implies that distance of all the point from the line x equal 3 .6 fine So let’s calculate the value of y minus y bar So starting with y equal 3 – value of y. A bar that is 3.6 So it is three minus 3.6 how much – of 0.6 next is 4 minus 3.6 that is 0.4 next to minus 3.6 that is minus of 1 point 6 next is 4 minus 3.6 that is 0.4 again, 5 minus 3.6 that is 1.4 Alright, so now we are done with Y minus y bar fine now next we will calculate x minus X bar whole Square Let’s calculate x minus X bar whole Square So it is minus 2 whole square That is 4 minus 1 whole square That is 1 0 squared is 0 1 Square 1 2 square for fine So now in our table we have x minus X bar y minus y bar and x minus X bar whole Square Now what we need We need the product of x minus X bar X Y minus y bar Alright, so let’s see the product of x minus X bar X Y minus y bar that is minus of 2 x minus of 0.6 That is one Point 2 minus of 1 x 0 point 4 that is minus of 0 point 4 0 x minus of 1.6 That is 0 1 multiplied by zero point four that is 0.4 And next 2 multiplied by 1 point for that is 2.8 All right Now almost all the parts of our formula is done So now what we need to do is get the summation of last two columns All right, so the summation of x minus X bar whole square is 10 and the summation of x minus X bar X Y minus y bar is 4 so the value of M will be equal to 4 by 10 fine So let’s put this value of m equals zero point 4 and our line y equal MX plus C So let’s file all the points into the equation and find the value of C So we have y as 3.6 remember the mean by m as 0.4 which we calculated just now X as the mean value of x that is 3 and we have the in as 3 point 6 equals 0 point 4 x 3 plus C. Alright that is 3.6 equal 1 Point 2 plus C So what is the value of C that is 3.6 minus 1 Point 2 That is 2 point 4 All right So what we had we had m equals zero point four see as 2.4 and then finally when we calculate the equation of the regression line what we get is y equal zero point four times of X plus two point four So there is the regression line Like so there’s how you’re plotting your points This is your actual point All right Now for given m equals zero point four and SQL 2.4 Let’s predict the value of y for x equal 1 2 3 4 & 5 So when x equal 1 the predicted value of y will be zero point four x one plus two point four that is 2.8 Similarly when x equal to predicted value of y will be zero point 4 x 2 plus 2 point 4 that equals to 3 point Two similarly x equal 3 y will be 3 point 6 x equal 4 y will be 4 point 0 x equal 5 y will be four point four So let’s plot them on the graph and the line passing through all these predicting point and cutting y-axis at 2.4 as the line of regression Now your task is to calculate the distance between the actual and the predicted value and your job is to reduce the distance All right, or in other words, you have to reduce the error between the actual and the predicted The line with the least error will be the line of linear regression or regression line and it will also be the best fit line Alright, so this is how things work in computer So what it do it performs a number of iteration for different values of M for different values of M It will calculate the equation of line where y equals MX plus C Right? So as the value of M changes the line is changing so iteration will start from one All right, and it will perform a number of iteration so after Every iteration what it will do it will calculate the predicted value according to the line and compare the distance of actual value to the predicted value and the value of M for which the distance between the actual and the predicted value is minimum will be selected as the best fit line All right Now that we have calculated the best fit line now, it’s time to check the goodness of fit or to check how good a model is performing So in order to do that, we have a method called R square method So what is this R square? Well r-squared value is a statistical measure of how close the data are to the fitted regression line in general It is considered that a high r-squared value model is a good model, but you can also have a lower squared value for a good model as well or a higher Squad value for a model that does not fit at all All right It is also known as coefficient of determination or the coefficient of multiple determination Let’s move on and see how a square is calculated So these are our actual values plotted on the graph We had calculated the predicted values of Y as 2.8 3.2 3.6 4.0 4.4 Remember when we calculated the predicted values of Y for the equation Y predicted equals 0 1 4 x of X plus two point four for every x equal 1 2 3 4 & 5 from there We got the power Good values of Phi All right So let’s plot it on the graph So these are point and the line passing through these points are nothing but the regression line All right Now, what you need to do is you have to check and compare the distance of actual – mean versus the distance of predicted – mean Alright So basically what you are doing you are calculating the distance of actual value to the mean to distance of predicted value to the mean All right, so there is nothing but a square in mathematically you can represent our school Whereas summation of Y predicted values minus y bar whole Square divided by summation of Y minus y bar whole Square where Y is the actual value y p is the predicted value and Y Bar is the mean value of y that is nothing but 3.6 Remember, this is our formula So next what we’ll do we’ll calculate y minus y bar So we have y is 3y bar as 3 point 6 so we’ll calculate it as 3 minus 3.6 that is nothing but minus of 0.6 similarly for y equals 4 and Y Bar equal 3.6 We have y minus y bar as zero point 4 then 2 minus 3.6 It has 1 point 6 4 minus 3.6 again zero point four and five minus 3.6 it is 1.4 So we got the value of y minus y bar Now what we have to do we have to take it Square So we have minus of 0.6 Square as 0.36 0.4 Square as 0.16 – of 1.6 Square as 2.56 0.4 Square as 0.16 and 1.4 squared is 1.96 now is a part of formula what we need We need our YP minus y BAR value So these are VIP values and we have to subtract it from the No, right So 2 .8 minus 3.6 that is minus 0.8 Similarly We will get 3.2 minus 3.6 that is 0.4 and 3.6 minus 3.6 that is 0 for 1 0 minus 3.6 that is 0.4 Then 4 .4 minus 3.6 that is 0.8 So we calculated the value of YP minus y bar now, it’s our turn to calculate the value of y b minus y bar whole Square next We have – of 0.8 Square as 0.64 – of Point four square as 0.160 Square   or you can say categorical in nature Whereas in linear regression We have the value of by or you can see Val you need to predict within a range that is how there’s a difference between linear regression and logistic regression We must be having question Why not linear regression now guys in linear regression the value of by or the value, which you need to predict is in a range, but in our case as in the logistic regression, we just have two values it can be either 0 or it can be one It should not entertain the values which is below zero or above one But in linear regression, we have the value of y in the range so here in order to implement logic regression we need To clip this part so we don’t need the value that is below zero or we don’t need the value which is above 1 so since the value of y will be between only 0 and 1 that is the main rule of logistic regression The linear line has to be clipped at 0 and 1 now Once we clip this graph it would look somewhat like this So here you’re getting the curve which is nothing but three different straight lines So here we need to make a new way to solve this problem So this has to be formulated into equation And hence we come up with logistic regression So here the outcome is either 0 Or one which is the main rule of logistic regression So with this our resulting curve cannot be formulated So hence our main aim to bring the values to 0 and 1 is fulfilled So that is how we came up with large stick regression now here once it gets formulated into an equation It looks somewhat like this So guys, this is nothing but an S curve or you can say the sigmoid curve a sigmoid function curve So this sigmoid function basically converts any value from minus infinity to Infinity to your discrete values, which a Logitech regression wants or it Can say the values which are in binary format either 0 or 1 So if you see here the values as either 0 or 1 and this is nothing but just a transition of it, but guys there’s a catch over here So let’s say I have a data point that is 0.8 Now, how can you decide whether your value is 0 or 1 now here you have the concept of threshold which basically divides your line So here threshold value basically indicates the probability of either winning or losing so here by winning I mean the value is equal One and by losing I mean the values equal to 0 but how does it do that? Let’s have a data point which is over here Let’s say my cursor is at 0.8 So here I check whether this value is less than the threshold value or not Let’s say if it is more than the threshold value It should give me the result as 1 if it is less than that, then should give me the result is zero So here my threshold value is 0.5 I need to Define that if my value let’s is 0.8 It is more than 0.5 Then the value shall be rounded of two one One and let’s say if it is less than 0.5 Let’s I have a value 0.2 then should reduce it to zero So here you can use the concept of threshold value to find output So here it should be discreet It should be either 0 or it should be one So I hope you caught this curve of logistic regression So guys, this is the sigmoid S curve So to make this curve we need to make an equation So let me address that part as well So let’s see how an equation is formed to imitate this functionality so over here, we have an equation of a straight Line, which is y is equal to MX plus C So in this case, I just have only one independent variable but let’s say if we have many independent variable then the equation becomes m 1 x 1 plus m 2 x 2 plus m 3 x 3 and so on till M NX n now, let us put in B and X So here the equation becomes Y is equal to b 1 x 1 plus beta 2 x 2 plus b 3 x 3 and so on till be nxn plus C. So guys equation of the straight line has a range from minus infinity to Infinity Yeah, but in our case or you can say largest equation the value which we need to predict or you can say the Y value it can have the range only from 0 to 1 So in that case we need to transform this equation So to do that what we had done we have just divide this equation by 1 minus y so now Y is equal to 0 so 0 over 1 minus 0 which is equal to 1 so 0 over 1 is again 0 and if we take Y is equals to 1 then 1 over 1 minus 1 which is 0 so 1 over 0 is infinity So here are my range is now Between 0 to Infinity, but again, we want the range from minus infinity to Infinity So for that what we’ll do we’ll have the log of this equation So let’s go ahead and have the logarithmic of this equation So here we have this transform it further to get the range between minus infinity to Infinity so over here we have log of Y over 1 minus 1 and this is your final logistic regression equation So guys, don’t worry You don’t have to write this formula or memorize this formula in Python You just need to call this function which is logistic regression and Everything will be automatically for you So I don’t want to scare you with the maths in the formulas behind it But it is always good to know how this formula was generated So I hope you guys are clear with how logistic regression comes into the picture next Let us see what are the major differences between linear regression was a logistic regression the first of all in linear regression, we have the value of y as a continuous variable or the variable between need to predict are continuous in nature Whereas in logistic regression   I’ll just paste it dot index and next set me just bring this one So here the number of passengers which are there in the original data set we have is 891 so around this number were traveling in the Titanic ship so over here, my first step is done where you have just collected data imported all the libraries and find out the total number of passengers, which are Titanic so now let me just go back to presentation and let’s see What is my next step So we’re done with the collecting data Next step is to analyze your data so over here, we will be creating different plots to check the relationship between variables as in how one variable is affecting the other so you can simply explore your data set by making use of various columns and then you can plot a graph between them So you can either plot a correlation graph You can plot a distribution curve It’s up to you guys So let me just go back to my jupyter notebook and let me analyze some of the data Over here My second part is to analyze data So I just put this in headed to now to put this in here to I just have to go and code click on mark down and I just run this so first let us plot account plot where you can pay between the passengers who survived and who did not survive So for that I will be using the Seabourn Library so over here I have imported Seaborn as SNS so I don’t have to write the whole name I’ll simply say SNS dot count plot I say axis with the survive and the data that I’ll be using is the Titanic data or you can say the name of variable in which you have store your data set So now let me just run this so who were here as you can see I have survived column on my x axis and on the y axis I have the count So 0 basically stands for did not survive and one stands for the passengers who did survive so over here, you can see that around 550 of the passengers who did not survive and they were around 350 passengers who only survive so here you can basically compute There are very less survivors than on survivors So this was the very first floor now that is not another plot to compare the sex as to whether out of all the passengers who survived and who did not survive How many were men and how many were female so to do that? I’ll simply say SNS dot count plot I add the Hue as six so I want to know how many females and how many male survive then I’ll be specifying the data So I’m using Titanic data set and let me just run this you have done a mistake over here so over here you can see I have survived column on the x-axis and I have the count on the why now So here your view color stands for your male passengers and orange stands for your female So as you can see here the passengers who did not survive that has a value 0 so we can see that Majority of males did not survive and if we see the people who survived here, we can see the majority of female survive So this basically concludes the gender of the survival rate So it appears on average women were more than three times more likely to survive than men next Let us plot another plot where we have the Hue as the passenger class so over here we can see which class at the passenger was traveling in whether it was traveling in class one two, or three so for that I just tried the same command I’ll say SNS dot count plot I keep my x-axis as subtly I’ll change my you to passenger class So my variable named as PE class And the data said that I’ll be using is Titanic data So this is my result so over here you can see I have blue for first-class orange for second class and green for the third class So here the passengers who did not survive a majorly of the third class or you can say the lowest class or the cheapest class to get into the dynamic and the people who did survive majorly belong to the higher classes So here 1 & 2 has more eyes than the passenger who were traveling in the third class So here we have concluded that the passengers who did not survive a majorly of third class Us all you can see the lowest class and the passengers who were traveling in first and second class would tend to survive more next I just got a graph for the age distribution over here I can simply use my data So we’ll be using pandas library for this I will declare an array and I’ll pass in the column That is age So I plot and I want a histogram so I’ll say plot da test So you can notice over here that we have more of young passengers, or you can see the children between the ages 0 to 10 and then we have the average people and if you go ahead Lester would be the population So this is the analysis on the age column So we saw that we have more young passengers and more mediocre eight passengers, which are traveling in the Titanic So next let me plot a graph of fare as well So I’ll say Titanic data I say fair And again, I got a histogram so I’ll say haste So here you can see the fair size is between zero to hundred now Let me add the bin size So as to make it more clear over here, I’ll say Ben is equals to let’s say 20 and I’ll increase the figure size as well So I’ll say fixed size Let’s say I’ll give the dimensions as 10 by 5 So it is bins So this is more clear now next It is analyzed the other columns as well So I’ll just type in Titanic data and I want the information as to what all columns are left So here we have passenger ID, which I guess it’s of no use then you have see how many passengers survived and how many did not we also see the analysis on the gender basis We saw when the female tend to survive more or the maintain to survive more then we saw the passenger class where the passenger is traveling in the first class second class or third class Then we have the name So in name, we cannot do any analysis We saw the sex we saw the age as well Then we have sea bass P So this stands for the number of siblings or the spouses which Are aboard the Titanic so let us do this as well So I’ll say SNS dot count plot I mentioned X SC SP And I will be using the Titanic data so you can see the plot over here so over here you can conclude that It has the maximum value on zero so you can conclude that neither children nor a spouse was on board the Titanic now second most highest value is 1 and then we have various values for 2 3 4 and so on next if I go above the store this column as well Similarly can do four parts So next we have part so you can see the number of parents or children which were aboard the Titanic so similarly can do As well then we have the ticket number So I don’t think so Any analysis is required for Ticket Then we have fears of a we have already discussed as in the people would tend to travel in the first class You will be the highest view then we have the cable number and we have embarked So these are the columns that will be doing data wrangling on so we have analyzed the data and we have seen quite a few graphs in which we can conclude which variable is better than another or what is the relationship the whole third step is my data wrangling so data wrangling basically means Cleaning your data So if you have a large data set, you might be having some null values or you can say Nan values So it’s very important that you remove all the unnecessary items that are present in your data set So removing this directly affects your accuracy So I’ll just go ahead and clean my data by removing all the n n values and unnecessary columns, which has a null value in the data set the next time you’re performing data wrangling Supposed to fall I check whether my data set is null or not So I’ll say Titanic data, which is the name of my data set and I’ll say is null So this will basically tell me what all values are null and will return me a Boolean result So this basically checks the missing data and your result will be in Boolean format as in the result will be true or false so Falls mean if it is not null and prove means if it is null, so let me just run this Over here you can see the values as false or true So Falls is where the value is not null and Drew is where the value is none So over here you can see in the cabin column We have the very first value which is null so we have to do something on this so you can see that we have a large data set So the counting does not stop and we can actually see the some of it We can actually print the number of passengers who have the Nan value in each column So I’ll say Titanic underscore data is null and I want the sum of it all Same thought some so this is basically print the number of passengers who have the n n values in each column so we can see that we have missing values in each column that is 177 Then we have the maximum value in the cave in column and we have very Less in the Embark column That is 2 so here if you don’t want to see this numbers, you can also plot a heat map and then you can visually analyze it let me just do that as well So I’ll say SNSD heat map And save I take labels False Choice run this as we have already seen that there were three columns in which missing data value was present So this might be age so over here almost 20% of each column has a missing value Then we have the cabling columns So this is quite a large value and then we have two values for embark column as well Add a see map for color coding So I’ll say see map So if I do this so the graph becomes more attractive so over here yellow stands for Drew or you can say the values are null So here we have computed that we have the missing value of H. We have a lot of missing values in the cabin column and we have very less value, which is not even visible in the Embark column as well So to remove these missing values, you can either replace the values and you can put in some dummy values to it or you can simply drop the column So here let us suppose pick the age column So first, let me just plot a box plot and they will analyze with having a column as H So I’ll say SNS dot box plot I’ll say x is equals to passenger class So it’s p class I’ll say Y is equal to H and the data set that I’ll be using is Titanic side So I’ll say three times goes to Titanic data You can see the edge in first class and second class tends to be more older rather than we have it in the third class Well that depends on The Experience how much you earn or might be there any number of reasons so here we concluded that passengers who were traveling in class one and class two a tend to be older than what we have in the class 3 so we have found that we have some missing values in EM Now one way is to either just drop the column or you can just simply fill in some values to them So this method is called as imputation now to perform data wrangling or cleaning it is for spring the head of the data set So I’ll say tightening knot head So it’s Titanic Data, let’s say I just want the five rows So here we have survived which is again categorical So in this particular column, I can apply logic to progression So this can be my y value or the value that you need to predict Then we have the passenger class We have the name Then we have ticket number We’re taping so over here We have seen that in keeping We have a lot of null values or you can say that any invalid which is quite visible as well So first of all, we’ll just drop this column for dropping it I’ll just say Titanic underscore data And I’ll simply type in drop and the column which I need to draw so I have to drop the cable column I mention the access equals to 1 and I’ll say in place also to true So now again, I just print the head and let us see whether this column has been removed from the data set or not So I’ll say Titanic dot head So as you can see here, we don’t have given column anymore Now, you can also drop the na values So I’ll say Titanic data dot drop all the any values or you can say Nan which is not a number and I will say in place is equal to True its Titanic So over here, let me again plot the heat map and let’s say for the values we should before showing a lot of null values Has it been removed or not So I’ll say SNS dot heat map I’ll pass in the data set I’ll check it is null I’ll say why tick labels is equal to false And I don’t want color coding So again I say false So this will basically help me to check whether my values has been removed from the data set or not So as you can see here, I don’t have any null values So it’s entirely black now You can actually know the some as well So I’ll just go above So I’ll just copy this part and I just use the sum function to calculate the sum So here the tells me that data set is clean as in the data set does not contain any null value or any Nan value So now we have R Angela data You can see cleaner data So here we have done just one step in data wrangling that is just removing one column out of it Now you can do a lot of things you can actually fill in the values with some other values or you can just calculate the mean and then you can just fit in the null values But now if I see my data set, so I’ll say Titanic data dot head But now if I see you over here I have a lot of string values So this has to be converted to a categorical variables in order to implement logistic regression So what we will do we will convert this to categorical variable into some dummy variables and this can be done using pandas because logistic regression just take two values So whenever you apply machine learning you need to make sure that there are no string values present because it won’t be taking these as your input variables So using string you don’t have to predict anything but in my case I have the survived columns 2210 how many? People tend to survive and how many did not so CEO stands for did not survive and one stands for survive So now let me just convert these variables into dummy variables So I’ll just use pandas and a say PD not get dummies You can simply press tab to autocomplete and say Titanic data and I’ll pass the six so you can just simply click on shift + tab to get more information on this So here we have the type data frame and we have the passenger ID survived and passenger class So if Run this you’ll see that 0 basically stands for not a female and one stand for it is a female similarly for male 0 Stanford’s not made and one Stanford may now we don’t require both these columns because one column itself is enough to tell us whether it’s male or you can say female or not So let’s say if I want to keep only male I’ll say if the value of mail is 1 so it is definitely a maid and is not a female So that is how you don’t need both of these values So for that I just remove the First Column, let’s say a female so I’ll say drop first Andrew it has given me just one column which is male and has a value 0 and 1 Let me just set this as a variable hsx so over here I can say sex dot head and just want to see the first five rows Sorry, it’s dot So this is how my data looks like now here We have done it for sex Then we have the numerical values in age We have the numerical values in spouses Then we have the ticket number         it says a cherry any random permutation and combination can be possible So in this case I’d say that impurities is nonzero I hope the concept of impurities care Are so coming back to entropy as I said entropy is the measure of impurity from the graph on your left You can see that as the probability is zero or one that has either they are highly impure or they are highly pure than in that case the value of entropy is zero And when the probability is 0.5, then the value of entropy is maximum Well, what is impurity impurities the degree of Randomness how random data is so if the data is completely pure in that case the randomness equals 0 or if the Dies completely Empire even in that case the value of impurity will be zero question Like why is it that the value of entropy is maximum at 0.5 might arise in a mine, right? So let me discuss about that Let me derive at mathematically as you can see here on the slide, the mathematical formula of entropy is – of probability of yes, let’s move on and see what this graph has to say mathematically suppose s is our total sample space and it’s divided into two parts Yes, and no No, like in our data set the result for playing was divided into two parts Yes or no, which we have to predict either we have to play or not Right? So for that particular case, you can Define the formula of entropy as entropy of total sample space equals negative of probability of e is multiplied by log of probability of years with a base 2 minus probability of no X log of probability of no with base to where s is your total sample space and P of v s is the probability of E. And be of known as the probability of no, well, if the number of yes equal number of know that is probability of s equals 0.5 right since you have equal number of yes, and no so in that case value of entropy will be one just put the value over there All right Let me just move to the next slide I’ll show you this Alright next is if it contains all Yes, or all know that is probability of a sample space is either 1 or 0 then in that case entropy will be equal to 0 Let’s see the mathematically one by one So let’s start with the first condition where the probability was 0.5 So this is our formula for entropy, right? So there’s our first case right which we discuss the art when the probability of vs equal probability of node that is in our data set We have equal number of yes, and no All right So probability of yes equal probability of no and that equals 0.5 or in other words, you can say that yes plus no equal to Total sample He’s all right, since the probability is 0.5 So when you put the values in the formula you get something like this and when you calculate it, you will get the entropy of the total sample space as one All right Let’s see for the next case What is the next case either you have totally us or you have totally know so if you have total, yes, let’s see the formula when we have totally as so you have all yes and 0 no fine So probability of e s equal 1 and yes Yes as the total sample space obviously So in the formula when you put that thing up here, you get entropy of sample space equal negative X of 1 multiplied by log of 1 as the value of log 1 equals 0 So the total thing will result to 0 similarly is the case with no even in that case, you will get the entropy of total sample space as 0 so this was all about entropy All right Next is what is Information Gain? Well Information Gain what it does is it measures the reduction in entropy? It decides which attributes should be selected as the decision node If s is our total collection than Information Gain equals entropy, which we calculated just now that – weighted average X entropy of each feature Don’t worry We’ll just see how it to calculate it with an example Let’s manually build a decision tree for our data set So there’s our data set which consists of 14 different instances out of which we have nine Yes and five know I like so we have the formula for entropy just put over that since 9 years So total probability of e s equals 9 by 14 and total probability of no equals Phi by 14 and when you put up the value and calculate the result, you will get the value of entropy as 0.94 All right So this was your first step that is compute the entropy for the entire data set only now, you have to select that out of Outlook temperature humidity and windy, which of the node should you select as the root node big question right? I will Decide that this particular node should be chosen at the base note And on the basis of that only I will be creating the entire tree I will select that Let’s see So you have to do it one by one you have to calculate the entropy and Information Gain for all of the different nodes So starting with Outlook So Outlook has three different parameters Sunny overcast and rainy So first of all select how many number of years and no are there in the case of Sunny like when it is sunny how many number of years and how many number of knows? Are there so in total we have to yes and three Nos and case of sunny in case of overcast We have all yes So if it is overcast then we will surely go to play It’s like that Alright and next it is rainy then total number of vs equal 3 and total number of no equals 2 fine next what we do we calculate the entropy for each feature for here We are calculating the entropy when Outlook equals Sunny First of all, we are assuming that Outlook is our root node and for that we are calculating the Can gain for it All right So in order to calculate the Information Gain remember the formula it was entropy of the total sample space – weighted average X entropy of each feature All right So what we are doing here, we are calculating the entropy of Outlook when it was sunny So total number of yes, when it was Sonny was to and total number of know that was three fine So let’s put up in the formula since the probability of yes is 2 by 5 and the probability of no is 3 by 5 So you will get something like this All right So you are getting the entropy of sunny as zero point nine seven one fine Next we will calculate the entropy for overcast when it was overcast Remember it was all yes, right So the probability of e is equal 1 and when you put over that you will get the value of entropy as 0 fine and when it was rainy rainy has 3s and to nose So probability of e s in case of Sonny’s 3 by 5 and probability of know in case of Sonny’s 2 by 5 and when you add the You of probability of vs and probability of note the formula you get the entropy of sunny as zero point nine seven one point Now, you have to calculate how much information you are getting from Outlook that equals weighted average All right So what was this weighted average total number of years and total number of no fine So information from Outlook equals 5 by 14 from where does this 5 came over? We are calculating the total number of sample space within that particular Outlook when it was sunny, right? So in case of Sunny there was two years and three NOS All right So weighted average for Sonny would be equal to 5 by 14 All right, since the formula was five by 14 x entropy of each feature All right, so as calculated the entropy for Sonny is zero point nine seven one, right? So what we’ll do we’ll multiply five by 14 with 0.97 one, right? Well, this was the calculation for information when Outlook equal sunny, but Outlook even equals overcast and rainy In that case, what we’ll do again similarly will calculate for everything for overcast and sunny for overcast weighted averages for by 14 x its entropy That is 0 and for Sonny it is same 5i 14-3 Yes and two nodes X its entropy that is zero point nine seven one And finally we’ll take the sum of all of them which equals to 0.693 right next We will calculate the information gained this what we did earlier was Malaysian taken from Outlook Now We are calculating What is the information? We are gaining from Outlook right Now this Information Gain that equals to Total entropy minus the information that is taken from Outlook All right So total entropy we had 0.94 – information we took from Outlook as 0.693 So the value of information gained from Outlook results to zero point two four seven All right So next what we have to do Let’s assume that Wendy is our root node So Wendy consists of two parameters false and true Let’s see how many years and how many nodes are there in case of true and false So when Wendy has Falls as its parameter, then in that case, it has six years and two nodes and when it as true as its parameter, it has 3 S and 3 nodes All right So let’s move ahead and similarly calculate the information taken from Wendy and finally calculate the information gained from Wendy Alright, so first of all, what we’ll do we’ll calculate the entropy of each feature ER starting with windy equal true So in case of true we had equal number of yes and equal number of know We’ll remember the graph when we had the probability as 0.5 as total number of years equal total number of know and for that case the entropy equals 1 so we can directly write entropy of room when it’s windy is one as we had already proved it when probability equals 0.5 the entropy is the maximum that equals to 1 All right Next is entropy of false when it is Vending I like so similarly just put the probability of yes and no in the formula and then calculate the result since you have six years and to nose So in total, you’ll get the probability of yes 6 by 8 and probability of no as 2 by 8 All right, so when you will calculate it, you will get the entropy of false as zero point eight one one Alright now, let’s calculate the information from windy So total information collected from Windy equals information taken when Wendy equal true plus Action taken when Wendy equal false So we’ll calculate the weighted average for each one of them and then we’ll sum it up to finally get the total information taken from windy So in this case, it equals to 8 by 14 multiplied by 0.8 1 1 plus 6 by 14 x 1 What is this? 8 it is total number of yes, and no in case when when D equals false, right? So when it was false, so total number of BS that equals to 6 and total more of know that equal to 2 that some UPS to 8 Alright, so that is why the waiter Resul results to Aid by 14 similarly information taken when windy equals true equals to 3 plus 3 that is 3 S and 3 no equal 6 divided by total number of sample space that is 14 x 1 that is entropy of true All right So it is 8 by 14 multiplied by 0.8 1 1 plus 6 by 14 x one which results to 0.89 to this is information taken from Windy All right Now how much information you are gaining from Wendy? So for that what you will do, so total information gained from Windy that equals to Total entropy – information taken from Windy All right, that is 0.94 – 0.89 to that equals to zero point zero four eight So 0.048 is the information gained from Windy Similarly We calculated for the rest too So for Outlook as you can see, the information was 0.693, and it’s Information Gain was zero point two four seven in case of temperature the information was around Zero point nine one one and the Information Gain that was equal to 0.02 9 in case of humidity The information gained was 0.15 to and in the case of windy The information gained was 0.048 So what we’ll do we’ll select the attribute with the maximum fine Now, we are selected Outlook as our root node, and it is further subdivided into three different parts Sunny overcast and rain, so in case of overcast we have seen that it consists of all ears so we can consider it as a Leaf node, but in case of sunny and rainy it’s doubtful as it consists of both Yes and both know so you need to recalculate the things right again for this node You have to recalculate the things All right, you have to again select the attribute which is having the maximum Information Gain All right, so there is how your complete tree will look like All right So, let’s see when you can play so you can play when Outlook is overcast All right in that case You can always play if the Outlook is sunny You will further drill Time to check the humidity condition All right, if the humidity is normal, then you will play if the humidity is high then you won’t play right when the Outlook predicts that it’s raining then further you will check whether it’s windy or not If it is a week went then you will go and offer play but if it has strong wind, then you won’t play right? So this is how your entire decision tree would look like at the end Now comes the concept of pruning say is that what should I do to play? Well you have to do pruning pruning will decide how you will play Say what is this pruning? Well, this pruning is nothing but cutting down the nodes and order to get the optimal solution All right So what pruning does it reduces the complexity? All right, as are you can see on the screen that it showing only the result for yes that is it showing all the result which says that you can play before we drill down to a practical session a common question might come in your mind You might think that our tree based model better than linear model right? You can think like if I can Was a logistic regression for classification problem and linear regression for regression problem Then why there is a need to use the tree Well, many of us have this question in their mind and well there’s a valid question too Well actually as I said earlier, you can use any algorithm It depends on the type of problem You’re solving let’s look at some key factor, which will help you to decide which algorithm to use and when so the first point being if the relationship between dependent and independent variable as well approximated by By a linear model, then linear regression will outperform tree base model second case if there is a high non-linearity and complex relationship between dependent and independent variables at remodel will outperform a classical regression model in third case If you need to build a model which is easy to explain to people a decision tree model will always do better than a linear model as the decision tree models are simpler to interpret then linear regression All right Now let’s move on ahead and see how you can write it as Gentry classifier from scratch and python using the cart algorithm All right for this I will be using jupyter notebook with python 3.0 installed on it Alright, so let’s open the Anaconda and the jupyter notebook Where is that? So this is our Anaconda Navigator and I will directly jump over to jupyter notebook and hit     So the It is that his application of loan will get approved Right? So there is actually low risk or moderate risk, but there’s no real issue of high risk as such we can approve the applicants request here Now, let’s move on and watch out for the second category where the person is actually earning from 15 to 35 thousand dollars right now here the person may or may not pay back So in such scenarios will look for the credit History as to what has been his previous history Now if his previous history has been bad like he has been a default ER in the previous transactions will definitely not consider approving his request and he will be at the high risk in which is not good for the bank If the previous history of that particular applicant is really good then we will just to clarify our doubt will consider another pair Dress Well, that will be on depth I have his already in really high depth then the risks again increases and there are chances that he might not pay repay in the future So here will not accept the request of the person having high dipped if the person is in the low depth and he has been a good pair in his past history Then there are chances that he might be back and we can consider approving the request of this particular applicant And let’s look at the third category, which is a person earning from 0 to 15 thousand dollars Now, this is something which actually raises I broke and this person will actually lie in the category of high risk All right So the probability is that his application of loan would probably get rejected now, we’ll get one final outcome from this income parameter, right? Now let us look at our second variable that is age which will lead into the second decision tree Now Let us say if the person is Young, right? So now we will look forward to if it is a student now if it is a student then the chances are high that he won’t be able to repay back because he has no learning Source, right? So here the risks are too high and probability is that his application of loan will get rejected fine Now if the person is Young And he’s not a student then we’ll probably go on and look for another variable That is pan balance Now Let’s look if the bank balance is less than 5 lakhs So again the risk arises and the probabilities that his application of loan will get rejected Now if the person is Young is not a student and his bank balance of greater than 5 lakhs is got a pretty good and stable and balanced then the probabilities that his zone of application will get approved Of not let us take another scenario if he’s a senior, right? So if he is a senior will probably go and check out for this credit history How well has he been in his previous transactions? What kind of a person he is like whether he’s a defaulter or is Ananda falter now if he is a very fair kind of person in his previous transactions then again the risk arises and the probability of his application getting rejected actually increases right now If he has been an excellent person as per his transactions in the previous history So now again here there is least risk and the probabilities that his application of loan will get approved So now here these two variables income and age have led to two different decision trees Right and these two different decision trees actually led to two different results Now what random forest does is it will actually compile these two different results from these two different Decision trees and then finally, it will lead to a final outcome That is how random Forest actually works Right? So that is actually the motive of the random Forest Now let us move forward and see what is random Forest right? You can get an idea of the mechanism from the name itself random forests So a collection of trees is a fortress that’s why I called for is probably and here also the trees are actually because being trained on subsets which are being selected at random And therefore they are called random forests So a random forests is a collection or an in symbol of decision Eat straight head a decision trees actually built using the whole data set considering all features, but actually in random Forest only a fraction of the number of rows is selected and that too at random and a particular number of features, which are actually selected at random are trained    and I will make a call in the majority voting right here As you can see in this picture I had in different instances Then I created indifferent decision trees And finally, I will compile the result of all these n different decision trees and I will take my call on the majority voting right So whatever my majority vote says It will be my final result So this is basically an overview of the random Forest algorithm how it actually works Let’s just have a look at this example to get much better understanding of what we have learnt So let’s say I have this data set which consists of four different instances, right? So basically it consists of the weather information of previous 14 days right from D1 tildy 14, and this basically Outlook humidity and Win, this basically gives me the weather condition of those 14 days And finally I have play which is my target variable weather match did take place on that particular day or not right Now My main goal is to find out whether the match will actually take place if I have following these weather conditions with me on any particular day Let’s say the Outlook is rainy that day and humidity is high and the wind is very weak So now I need to predict whether I will be able to play The match that they are not all right So this is a problem statement fine Now, let’s see how random Forest is used in this to sort it out now here the first step is to actually split my entire data set into subsets here I have split my entire 14 variables into further smaller subsets right now these subsets may or may not overlap like there is certain overlapping between d 1 till D3 and D3 till D6 Fine, so there is an overlapping of D3 So it might happen that there might be overlapping so you need not really worry about the overlapping but you have to make sure that all those subsets are actually different right? So here I have taken three different subsets my first sub set consists of D1 till D3 Mexican subset consists of D3 till D6 and methods subset consists of D7 tildy Now now I will first be focusing on my first subset now here, let’s say that particular day the out It was overcast fine If yes, it was overcast then the probabilities that the match will take place So overcast is basically when your weather is too cloudy So if that is the condition then definitely the match will take place and let’s say it wasn’t overcast Then you will consider these second most probable option that will be the wind and we will make a decision based on this now whether wind was weak or strong if wind was weak, then you will definitely go out And play the match else you would not So now the final outcome out of this decision tree will be Play Because here the ratio between the play and no play is to is to 1 so we get to a certain decision from a first decision tree Now, let us look at the second subset now since second subset has different number of variables So that is why this decision trees absolutely different from what we saw in our four subsets So let’s say if it was overcast then you will play the match If it isn’t the overcast and you would go and look out for humidity now further, it will get split into two whether it was high or normal Now, we’ll take the first case if the humidity was high and wind was week Then you will play the match else if humidity was high but wind was too strong, then you would not go out and play the match right now Let us look at the second dot to node of humidity if the humidity was Oil and the wind was weak then you will definitely go out and play the match as you want go out and play the match So here if you look at the final result, then the ratio of placed no play is 3 is to 2 then again The final outcome is actually play, right? So from second subset, we get the final decision of play now, let us look at our third subset which consists of D7 till D9 here if again the overcast is yes, then you will A match it’s you will go and check out for humidity And if the humidity is really high then you won’t play the match and you will play the match again the probability of playing the matches Yes, because the ratio of no play is Twist one, right? So three different subsets three different decision trees three different outcomes and one final outcome after compiling all the results from these three different decision trees are so I hope this gives a better perspective a bit understanding of random Forest like how it really works All right So now let’s just have a look at various features of random Forest Ray So the first and the foremost feature is that it is one of the most accurate learning algorithms, right? So why it is so because single decision trees are actually prone to having high variance or Hive bias and on the contrary actually Random Forest it averages the entire variance across the decision trees So let’s say if the variances say X4 decision tree, but for random Forest, let’s say we have implemented n number of decision trees parallely So my entire variance gets averaged to upon and my final variance actually becomes X upon n so that is how the entire variance actually goes down as compared to other algorithms Thumbs right now second most important feature is that it works? Well for both classification and regression problems and by far I have come across this is one and the only algorithm which works equally well for both of them Beh classification kind of problem or a regression kind of problem, right? Then it’s really runs efficient on large databases So basically it’s really scalable Even if you work for the lesser amount of database or if you work for really huge volume of data, right? So that’s a very good part about it Then the fourth most important point is that it requires almost no input preparation Now, why am I saying this is because it has got certain implicit methods, which actually take care And remove all the outliers and all the missing data and you really don’t have to take care about all that thing while you are in the stages of input preparations So Random Forest is all here to take care of everything else and next Is it performs implicit feature selection, right? So while we are implementing multiple decision trees, so it has got implicit method which will automatically pick up some random features Result of all your parameters and then it will go on and implementing different decision trees So for example, if you just give one simple command that all right, I want to implement 500 decision trees no matter how so Random Forest will automatically take care and it will Implement all those 500 decision trees and those all 500 decision trees will be different from each other and this is because it has got implicit methods which will automatically collect different parameters Has itself out of all the variables that you have right, then it can be easily grown in parallel why it is so because we are actually implementing multiple decision trees and all those decision trees are running or all those decisions trees are actually getting implemented parallely So if you say I want thousand trees to be implemented So all those thousand trees are getting implemented parallely So that is how the computation time reduces Right, and the last point is that it has got methods for balancing error in unbalanced it as it’s now what exactly unbalanced data sets are let me just give you an example of that So let’s say you’re working on a data set fine and you create a random forest model and get 90% accuracy immediately Fantastic you think right So now you start diving deep you go a little Little deeper and you discovered that ninety percent of that data actually belongs to just one class tan your entire data set your entire decision is actually biased to just one particular class So Random Forest actually takes care of this thing and it is really not biased towards any particular decision tree or any particular variable or any class So it has got methods which looks after it and they Does all the balance of errors in your data sets? So that’s pretty much about the features of random forests K-nearest neighbor is a simple algorithm which uses entire data set in its training phase when our prediction is required for unseen data What it does is it searches through the entire training data set for kaymu similar instances and the data with the most similar instance is finally returned as the prediction So hello Oh and welcome all to this YouTube session and in today’s session will be dealing with KNN algorithm So without doing any further, let’s move on and discuss agenda for today’s session So we’ll start our session with what is KN where I’ll brief you about the topic and we’ll move ahead to see what its popular use cases  which are closest to this star So in this case after calculating the distance, we find that we have four blue points and two Orange Point which are closest to the star now as you can see that the blue points are in majority, so you Can say that for k equals 6 this star belongs to class A or the star is more similar to Blue Points So by now, I guess you know how a KNN algorithm work and what is the significance of gain KNN algorithm So how will you choose the value of K? So keeping in mind this case the most important parameter in KNN algorithm So, let’s see when you build a k nearest neighbor classifier How will you choose a value of K? Well, you might have a specific value of K in mind or you could divide up your data and use something like cross-validation technique to test several values of K in order To determine which works best for your data, for example, if n equal 2,000 cases then in that case the optimal value of K lies somewhere in between 1 to 19 But yes, unless you try it you cannot be sure of it So, you know how the algorithm is working on a higher level Let’s move on and see how things are predicted using KNN algorithm Remember I told you the KNN algorithm uses the least distance measure in order to find its nearest neighbors So, let’s see how these distance is calculated Well, there are several distance measure which can be used So to start with Will mainly focus on euclidean distance and Manhattan distance in this session So what is this euclidean distance? Well, this euclidean distance is defined as the square root of the sum of difference between a new point x and an existing Point why so for example here we have Point P1 and P2 Point T. 1 is 1 1 and point p 2 is 5 for so what is the euclidean distance between both of them? So you can see that euclidean distance is a direct distance between two points So what is the distance between the point P1 and P2 so we can calculate it as 5 minus 1 whole square plus 4 minus 1 whole square and we can route it over which results to 5 So next is the Manhattan distance Well, this Manhattan distance is used to calculate the distance between real Vector using this some of their absolute difference in this case the Manhattan distance between the point P1 and P2 is Mode of 5 minus 1 plus mod value of 4 minus 1 which results to 3 plus 4 that is 7 so this slide shows the difference between euclidean and Manhattan distance from point A to point B So euclidean distance is nothing but the direct or the least possible distance between A and B Whereas the Manhattan distance is a distance between A and B measured along the axis at right angle Let’s take an example and see how things are predicted using KNN algorithm or how the cannon algorithm is working Suppose we have a data set which consists of height weight and T-shirt size of some customers Now when a new customer come we only have his height and weight as the information now our task is to predict What is the T-shirt size of that particular customer so for this will be using the KNN algorithm So the very first thing what we need to do, we need to calculate the euclidean distance So now that you have a new data of height 160 one centimeter and weight are 61 kg So the very first thing that we’ll do is we’ll calculate the euclidean distance Stance which is nothing but the square root of 160 1 minus 158 whole square plus 61 minus 58 whole square and square root of that is 4.24 Let’s drag and drop it So these are the various euclidean distance of other points Now, let’s suppose k equal to 5 then the algorithm what it does is it searches for the five customer closest to the new customer that is most similar to the new data in terms of its attribute for k equal 5 Let’s find the top five minimum euclidian distance So these are the distance which we are going to use Two three four and five So let’s rank them in the order first This is second This is third then this one is for again This one is 5 so there is our order So for k equal 5 we have for t-shirts which commanders size M and one t-shirt which comes under size l so obviously best guess for the best protection for the T-shirt size of height 160 one centimeter and wait 60 1 kg is M Or you can say that a new customer Fittin to size M. Well this was all about Body theoretical session, but before we drill down to the coding part, let me just tell you why people call KN as a lazy learner Well Cannon for classification is a very simple algorithm, but that’s not why they are called lazy KN is a lazy learner because it doesn’t have a discriminative function from the training data But what it does it memorizes the training data, there is no learning phase of the model and all of the work happens at the time Your prediction is requested So as such there’s the reason why KN is often referred to us lazy learning algorithm So this was all about Or detail reticle session now, let’s move on the coding part So for the Practical implementation of the Hands-On part, I’ll be using the IRS data set So this data set consists of 150 observation We have four features and one class label the four features include the sepal length sepal width petal length and the petrol head whereas the class label decides which flower belongs to which category So this was the description of the data set, which we are using now,  as get neighbors So for that what we will be doing will be defining a function as get neighbors what it will do it will return the K most similar Neighbors From the training set for a given test instance All right So this is how our get nabal function look like it takes training data set and test instance and K as its input here The K is nothing but the number of nearest neighbor you want to check for All right So basically what you’ll be getting from this get Mabel’s function is K different points having least euclidean distance from the test instance All right, let’s execute it So the function executed without any errors So let’s test our function Suppose the training data set includes the data like 2 to 2 and it belongs to class A and other data includes four four four and it belongs to class P and our testing instances five five five or now We have to predict whether this test instance belongs to class A or it belongs to class be All right for k equal 1 we have to predict its nearest neighbor and predict whether this test instance it will belong to class A or will it belong to class be? All right So let’s execute the Run button aligned So an executing the Run button you can see that we have output is 4 4 4 and B. Be a new instance 5 5 5 is closest to point 4 4 4 which belongs to class be? All right Now once you have located the most similar neighbor for a test instance next task is to predict a response based on those neighbors So how we can do that Well, we can do this by allowing each neighbor to vote for the class attribute and take the majority vote as a prediction Let’s see how we can do that So we have a function as getresponse with takes neighbors as the input Well, this neighbor was nothing but the output of this get me / function The output of get neighbor function will be fed to get response All right, let’s execute the Run button It’s executed Let’s move ahead and test our function get response So we have a neighbor as one one one It belongs to class A to to to it belongs to class a33 It belongs to class B So this response what it will do it will store the value of get response by passing this neighbor value All right So what we want to check is we want to predict whether that test instance five five five It belongs to class A or Class B. Be when the neighbors are 1 1 1 a 2 2 A + 3 3 p So let’s check our response now that we have created all the different function which are required for a KNN algorithm So important main concern is how do you evaluate the accuracy of the prediction and easy way to evaluate the accuracy of the model is to calculate a ratio of the total correct prediction to all the prediction made So for this I will be defining function as get accuracy and inside that I’ll be passing my test data set and the predictions get accuracy function Check it Executed without any error Let’s check it for a sample data set So we have our test data set as 1 1 1 which belongs to class A 2/2 which again belongs to class 3 3 3 which belongs to class B and my predictions is for first test data It predicted latter belongs to class A which is true for next it predicted that belongs to class E, which is again to and for the next again it predictive that it belongs to class A which is false in this case cause the test data belongs to class be All right So in total we have to correct prediction out of three All right Right So the ratio will be 2 by 3, which is nothing but 66.6 So our accuracy rate is 66.6 So now that you have created all the function that are required for KNN algorithm Let’s compile them into one single main function Alright, so this is our main function and we are using Iris data set with a split of 0.67 and the value of K is 3 Let’s see What is the accuracy score of this check how accurate are modulus so in training data set, we have 113 values and then the test data set we have Seven values These are the predicted and the actual values of the output Okay So in total, we got an accuracy of ninety seven point two nine percent, which is really very good Alright, so I hope the concept of this KNN algorithm is here devised in a world full of machine learning and artificial intelligence surrounding almost everything around us classification and prediction is one of the most important aspects of machine learning So before moving forward, let’s have a Look at the agenda I’ll start of this video by explaining you guys What exactly is Nave biased then we’ll understand what is space theorem which serves as a logic behind the name pass algorithm moving forward I’ll explain the steps involved in the neighb as algorithm one by one and finally, I’ll finish off this video with a demo on the Nave bass using the sklearn package noun a bass is a simple but surprisingly powerful algorithm from predictive analysis It is a classification technique based on base him with an assumption of Independence among predictors it comprises of two parts, which is knave and bias in simple terms neighbors classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature, even if this features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability whether a fruit is an apple or an orange or a banana, so That is why it is known as naive now naive based model is easy to build and particularly useful for very large data sets in probability Theory and statistics based theorem, which is already known as the base law or the base rule describes the probability of an event based on prior knowledge of the conditions that might be related to the event now pasted here m is a way to figure out conditional probability The conditional probability is the probability of an event happening given that it has some relationship One or more other events, for example, your probability of getting a parking space is connected to the time of the day You park where you park and what conventions are you going on at that time Bayes theorem is slightly more nuanced in a nutshell It gives you an actual probability of an event given information about the tests Now, if you look at the definition of Bayes theorem, we can see that given a hypothesis H and the evidence e-base term states that the relationship between the E of the hypothesis before getting the evidence, which is the P of H and the probability of the hypothesis after getting the evidence that is p of H given e is defined as probability of e given H into probability of H divided by probability of e it’s rather confusing, right? So let’s take an example to understand this theorem So suppose I have a deck of cards and if a single card is drawn from the deck of playing cards, the probability that the card is a king is for by 52 since there are four Kings in a standard deck of 52 cards Now if King is an event, this card is a king The probability of King is given as 4 by 52 that is equal to 1 by 13 Now if the evidence is provided for instance someone looks Such as the card that the single card is a face card the probability of King given that it’s a face can be calculated using the base theorem by this formula Now since every King is also a face card the probability of face given that it’s a king is equal to 1 and since there are three phase cards in each suit That is the chat king and queen The probability of the face card is equal to 12 by 52 That is 3 by 30 No using base certain we can find out the probability of King given that it’s a face So our final answer comes to 1 by 3, which is also true So if you have a deck of cards, which has having only faces now, there are three types of phases which are the chat king and queen So the probability that it’s the king is 1 by 3 Now This is the simple example of how based on works now if we look at the proof as in how this paste Serum evolved So here we have probability of a given B and probability of B given a now for a joint probability distribution over the sets A and B, the probability of a intersection B, the conditional probability of a given B is defined as the probability of a intersection B divided by probability of B, and similarly probability of B, given a is defined as probability of B intersection a divided by probability of a now we Equate probability of a intersection p and probability of B intersection a as both are the same thing now from this method as you can see, we get our final base theorem proof, which is the probability of a given b equals probability of B, given a into probability of P divided by the probability of a now while this is the equation that applies to any probability distribution over the events A and B It has a particular nice interpretation in case where a is represented as the hypothesis h H and B is represented as some observed evidence e in that case the formula is p of H given e is equal to P of e given H into probability of H divided by probability of e now this relates the probability of hypothesis before cutting the evidence, which is p of H to the probability of the hypothesis after getting the evidence which is p of H given e for this reason P of H is known as the prior probability while P of It’s given e is known as the posterior probability and the factor that relates the two is known as the likelihood ratio Now using this term space theorem can be rephrased as the procedure probability equals The prior probability times the likelihood ratio So now that we know the maths which is involved behind the Bayes theorem Let’s see how we can implement this in real life scenario So suppose we have a data set Set in which we have the Outlook the humidity and we need to find out whether we should play or not on that day So the Outlook can be sunny overcast rain and the humidity are high normal and the wind are categorized into two phases which are the weak and the strong winds The first of all will create a frequency table using each attribute of the data set So the frequency table for the Outlook looks like this we have Sunny overcast and rainy the frequency table of humidity looks like this And a frequency table of when looks like this we have strong and weak for wind and high and normal ranges for humidity So for each frequency table, we will generate a likelihood table now now the likelihood table contains the probability of a particular day suppose we take the sunny and we take the play as yes and no so the probability of Sunny given that we play yes is 3 by 10, which is 0.3 the probability of X, which is the probability of Sunny He is equal to 5 by 14 Now These are all the terms which are just generated from the data which we have here And finally the probability of yes is 10 out of 14 So if we have a look at the likelihood of yes given that it’s a sunny we can see using Bayes theorem It’s the probability of Sunny given yes into probability of yes divided by the probability of Sunny So we have all the values here calculated So if you put that in our base serum equation, we get the likelihood of Is a 0.59 similarly the likelihood of no can also be calculated here is 0.40 now similarly We are going to create the likelihood table for both the humidity and the win there’s a for humidity the likelihood for yes given the humidity is high is equal to 0.4 to and the probability of playing know given the Venice High is 0.58 the similarly for table wind The probability of e is given that the wind is week is 0.75 and the probability of no given that the win is week is 0.25 now suppose we have of day which has high rain which has high humidity and the wind is weak So should we play or not? That’s all for that We use the base theorem here again the likelihood of yes on that day is equal to the probability of Outlook rain given that it’s a yes into probability Of humidity given that say yes, and the probability of when that is we given that it’s we are playing yes into the probability of yes, which equals to zero point zero one nine and similarly the likelihood of know on that day is equal to zero point zero one six Now if we look at the probability of yes for that day of playing we just need to divide it with the likelihood some of both the yes and no so the probability of playing tomorrow, which is yes is .5 Whereas the probability of not playing is equal to 0.45 Now This is based upon the data which we already have with us So now that you have an idea of what exactly is named by as how it works and we have seen how it can be implemented on a particular data set Let’s see where it is used in the industry The started with our first industrial use case, which is news categorized It’s move on to them or we can use the term text classification to broaden the spectrum of this algorithm news in the web are rapidly growing in the era of Information Age where each new site has its own different layout and categorization for grouping news Now these heterogeneity of layout and categorization cannot always satisfy individual users need to remove these heterogeneity and classifying the news articles Owing to the user preference is a formidable task companies use web crawler to extract useful text from HTML Pages the news articles and each of these news articles is then tokenized now these tokens are nothing but the categories of the news now in order to achieve better classification result We remove the less significant Words, which are the stop was from the documents or the Articles and then we apply the Nave base classifier for classifying the news contents based on the news Now this is by far one of the best examples of Neighbors classifier, which is Spam filtering Now It’s the Nave Bayes classifier are a popular statistical technique for email filtering They typically use bag-of-words features to identify at the spam email and approach commonly used in text classification as well Now it works by correlating the use of tokens, but the spam and non-spam emails and then the Bayes theorem, which I explained earlier is used to calculate the probability that an email is or not a Spam so named by a Spam filtering is a baseline technique for dealing with Spam that container itself to the emails need of an individual user and give low false positive spam detection rates that are generally acceptable to users It is one of the oldest ways of doing spam filtering with its roots in the 1990s particular words have particular probabilities of occurring in spam And in legitimate email as well for instance most emails users will frequently encounter the world lottery or the lucky draw a spam email, but we’ll sell them see it in other emails The filter doesn’t know these probabilities in advance and must be friends So it can build them up to train the filter The user must manually indicate whether a new email is Spam or not for all the words in each straining email The filter will adjust the probability that each word will appear in a Spam or legitimate All in the database now after training the word probabilities also known as the likelihood functions are used to compute the probability that an email with a particular set of words as in belongs to either category each word in the email contributes the email spam probability This contribution is called the posterior probability and is computed again using the base 0 then the email spam probability is computed over all the verse in the email and if the total exceeds a certain threshold say Or 95% the filter will Mark the email as spam Now object detection is the process of finding instances of real-world objects such as faces bicycles and buildings in images or video now object detection algorithm typically use extracted features and learning algorithm to recognize instance of an object category here again, a bias plays an important role of categorization and classification of object now medical area This is increasingly voluminous amount of electronic data, which are becoming more and more complicated The produced medical data has certain characteristics that make the analysis very challenging and attractive as well among all the different approaches The knave bias is used It is the most effective and efficient classification algorithm and has been successfully applied to many medical problems empirical comparison of knave bias versus five popular classifiers on Medical data sets shows that may bias is well suited for medical application and has high performance in most of the examine medical problems Now in the past various statistical methods have been used for modeling in the area of disease diagnosis These methods require prior assumptions and are less capable of dealing with massive and complicated nonlinear and dependent data one of the main advantages of neighbor as approach which is appealing to Physicians is that all the available information is used? To explain the decision this explanation seems to be natural for medical diagnosis and prognosis That is it is very close to the way how physician diagnosed patients now weather is one of the most influential factor in our daily life to an extent that it may affect the economy of a country that depends on occupation like agriculture Therefore as a countermeasure to reduce the damage caused by uncertainty in whether Behavior, there should be an efficient way to print the weather now whether projecting has Challenging problem in the meteorological department since ears even after the technology skill and scientific advancement the accuracy and production of weather has never been sufficient even in current day this domain remains as a research topic in which scientists and mathematicians are working to produce a model or an algorithm that will accurately predict the weather now a bias in approach based model is created by where procedure probabilities are used to calculate the likelihood of each class label for input Data instance and the one with the maximum likelihood is considered as the resulting output now earlier We saw a small implementation of this algorithm as well where we predicted whether we should play or not based on the data, which we have collected earlier Now, this is a python Library which is known as scikit-learn it helps to build in a bias and model in Python Now, there are three types of named by ass model under scikit-learn Library The first one is the caution It is used in classification and it Assumes that the feature follow a normal distribution The next we have is multinomial It is used for discrete counts For example, let’s say we have a text classification problem and here we consider bernouli trials, which is one step further and instead of word occurring in the document We have count how often word occurs in the document you can think of it as a number of times outcomes number is observed in the given number of Trials And finally we have the bernouli type Of Naples, the binomial model is useful if your feature vectors are binary bag of words model where the once and the zeros are words occur in the document and the verse which do not occur in the document respectively based on their data set You can choose any of the given discussed model here,  Class and sort the entire dataset of instances into the appropriate list Now the separate by class function just the same So as you can see the function assumes that the last attribute is the class value the function returns a map of class value to the list of data instances next We need to calculate the mean of each attribute for a class value Now, the mean is the central middle or the central tendency of the data and we use it as a middle of our gaussian distribution when Calculating the probabilities So this is our function for mean now We also need to calculate the standard deviation of each attribute for a class value The standard deviation is calculated as a square root of the variance and the variance is calculated as the average of the squared differences for each attribute value from the mean now one thing to note that here is that we are using n minus one method which subtracts one from the number of attributes values when calculating the variance The now that we have the tools to summarize the data for a given list of instances, we can calculate the mean and standard deviation for each attribute Now that’s if function groups the values for each attribute across our data instances into their own lists so that we can compute the mean and standard deviation values for each attribute The next comes the summarizing attributes by class We can pull it all together by first separating our training data sets into instances growth by class then calculating the summaries for each a To be with now We are ready to make predictions using the summaries prepared from our training data making predictions involves calculating the probability that a given data instance belong to each class then selecting the class with the largest probability as a prediction Now we can divide this whole method into four tasks which are the calculating gaussian probability density function calculating class probability making a prediction and then estimating the accuracy now to calculate the gaussian probability density function We use the gaussian function to estimate the probability of a given attribute value given the node mean and the standard deviation of the attribute estimated from the training data As you can see the parameters are x mean and the standard deviation now in the calculate probability function, we calculate the exponent first then calculate the main division this lets us fit the equation nicely into two lines Now, the next task is calculating the class properties now that we had can calculate the probability of an attribute belonging to a class We can combine the probabilities of all the attributes values for a data instance and come up with a probability of the entire Our data instance belonging to the class So now that we have calculated the class properties It’s time to finally make our first prediction now, we can calculate the probability of the data instance belong to each class value and we can look for the largest probability and return the associated class and for that we are going to use this function to predict which uses the summaries and the input Vector which is basically all the probabilities which are being input for a particular label now finally we can An estimate the accuracy of the model by making predictions for each data instances in our test data for that We use the cat predictions method Now this method is used to calculate the predictions based upon the test data sets and the summary of the training data set Now, the predictions can be compared to the class values in our test data set and classification accuracy can be calculated as an accuracy ratio between the zeros and the hundred percent Now the get accuracy method will calculate this accuracy ratio Now finally to sum it all up We Define our main function we call all these methods which we have defined earlier one by one to get the Courtesy of the model which we have created So as you can see, this is our main function in which we have the file name We have defined the split ratio We have the data set We have the training and test data set We are using the split data set method next We are using the summarized by class function using the get prediction and the get accuracy method as well So guys as you can see the output of this one gives us that we are splitting the seven sixty eight rows into 514 which is the training and 254 which is the test data set rows and the accuracy of this model is 68% Now we can play with the amount of training and test data sets which are to be used so we can change the split ratio to seventies 238 is 220 to get different sort of accuracy So suppose I change the split ratio from 0.67 20.8 So as you can see, we get the accuracy of 62 percent So splitting it into 0.67 gave us a better result which was 68 percent So this is how you can Implement Navy bias caution classifier These are the step by step methods which you need to do in case of using the Nave Bayes classifier, but don’t worry We do not need to write all this many lines of code to make a model this with The Sacketts And I really comes into picture the scikit-learn library has a predefined method or as say a predefined function of neighbor bias, which converts all of these lines, of course into merely just two or three lines of codes So, let me just open another jupyter notebook So let me name it as sklearn a pass Now here we are going to use the most famous data set which is the iris dataset Now, the iris flower data set is a multivariate data set introduced by the British statistician and biologists Roland Fisher and based on this fish is linear discriminant model this data set became a typical test case for many statistical classification techniques in machine learning So here we are going to use the caution NB model, which is already available in the sklearn As I mentioned earlier, there were three types of Neighbors which are the question multinomial and the bernouli So here we are going to use the caution and be model which is already present in the sklearn library, which is the cycle learn Library So first of all, what we need to do is import the sklearn data sets and the metrics and we also need to import the caution and be Now once all these libraries are lowered we need to load the data set which is the iris dataset The next what we need to do is fit a Nave by a small to this data set So as you can see we have so easily defined the model which is the gaussian NB which contains all the programming which I just showed you earlier all the methods which are taking the input calculating the mean the standard deviation separating it bike last and finally making predictions Calculating the prediction accuracy All of this comes under the caution and be method which is inside already present in the sklearn library We just need to fit it according to the data set which we have so next if we print the model we see which is the gaussian NB model The next what we need to do is make the predictions So the expected output is data set dot Target and the projected is using the pretend model and the model we are using is the cause in NB here Here now to summarize the model which created we calculate the confusion Matrix and the classification report So guys, as you can see the classification to provide we have the Precision of Point Ninety Six, we have the recall of 0.96 We have the F1 score and the support and finally if we print our confusion Matrix, as you can see it gives us this output So as you can see using the gaussian and we method just putting it in the model and using any of the data Fitting the model which you created into a particular data set and getting the desired output is so easy with the scikit-learn library So guys, this is it I hope you understood a lot about the nape Bayes classifier how it is used where it is used and what are the different steps involved in the classification technique and how the scikit-learn makes all of those techniques very easy to implement in any data set which we have As we M or support Vector machine is one of the most effective machine learning classifier and it has been used in various Fields such as face recognition cancer classification and so on today’s session is dedicated to how svm works the various features of svm and how it is used in the real world So without any further due let’s take a look at the agenda for today We’re going to begin the session with an introduction to machine learning and the different types of machine learning Next we’ll discuss what exactly support Vector machines are and then we’ll move on and see how svm works and how it can be used to classify linearly separable data will also briefly discuss about how nonlinear svm’s work and then we’ll move on and look at the use case of svm in colon cancer classification and finally we’ll end the session by running a demo where we’ll use svm to predict whether a patient is suffering from a heart disease or not Okay, so that was the agenda Let’s get stood with our first topic So what is machine learning machine learning is a science of getting computers to act by feeding them data and letting them learn a few tricks on their own Okay, we’re not going to explicitly program the machine instead We’re going to feed it data and let it learn the key to machine learning is the data machines learn just like us humans  build your fence one way to get around? The problem is to build a classifier based on the position of the rabbits and words in your Faster So what I’m telling you is you can classify the group of rabbits as one group and draw a decision boundary between the rabbits and the world All right So if I do that and if I try to draw a decision boundary between the rabbits and the Wolves, it looks something like this Okay Now you can clearly build a fence along this line in simple terms This is exactly how SPM work it draws a decision boundary, which is a hyperplane between any two classes in order to separate them or class Asif I them now, I know you’re thinking how do you know where to draw a hyperplane the basic principle behind svm is to draw a hyperplane that best separates the two classes in our case the two glasses of the rabbits and the Wolves So you start off by drawing a random hyperplane and then you check the distance between the hyperplane and the closest data points from each glove these closes on your is data points to the hyperplane are known as support vectors and that’s where the name comes from support Active machine So basically the hyperplane is drawn based on these support vectors So guys an Optimum hyperplane will have a maximum distance from each of these support vectors All right So basically the hyper plane which has the maximum distance from the support vectors is the most optimal hyperplane and this distance between the hyperplane and the support vectors is known as the margin All right So to sum it up svm is used to classify data by using a hyper plane such that the distance distance between the hyperplane and the support vectors is maximum So basically your margin has to be maximum All right, that way, you know that you’re actually separating your classes or add because the distance between the two classes is maximum Okay Now, let’s try to solve a problem Okay So let’s say that I input a new data point Okay This is a new data point and now I want to draw a hyper plane such that it best separates the two classes Okay, so I start off by drawing a hyperplane like this and then I check the distance between Hyper plane and the support vectors Okay, so I’m trying to check if the margin is maximum for this hyperplane, but what if I draw a hyper plane which is like this? All right Now I’m going to check the support vectors over here Then I’m going to check the distance from the support vectors and with this hyperplane, it’s clear that the margin is more right when you compare the margin of the previous one to this hyperplane It is more So the reason why I’m choosing this hyperplane is because the distance between the support vectors and the hi Hyperplane is maximum in this scenario Okay, so guys this is how you choose a hyperplane You basically have to make sure that the hyper plane has a maximum Margin All right, it has two best separate the two classes All right Okay so far it was quite easy Our data was linearly separable which means that you could draw a straight line to separate the two classes All right, but what will you do? If the data set is like this you possibly can’t draw a hyper plane like this All right It doesn’t separate the two At all, so what do you do in such situations now earlier in the session I mentioned how a kernel can be used to transform data into another dimension that has a clear dividing margin between the classes of data Alright, so kernel functions offer the user this option of transforming nonlinear spaces into linear ones Nonlinear data set is the one that you can’t separate using a straight line All right, in order to deal with such data sets you’re going to Ants form them into linear data sets and then use svm on them Okay So simple trick would be to transform the two variables X and Y into a new feature space involving a new variable called Z All right, so guys so far we were plotting our data on two dimensional space Correct? We will only using the X and the y axis so we had only those two variables X and Y now in order to deal with this kind of data a simple trick would be to transform the two variables X and I into a new feature space involving a new variable called Z. Ok, so we’re basically visualizing the data on a three-dimensional space Now when you transform the 2D space into a 3D space, you can clearly see a dividing margin between the two classes of data right now You can go ahead and separate the two classes by drawing the best hyperplane between them Okay, that’s exactly what we discussed in the previous slides So guys, why don’t you try this yourself dry drawing a hyperplane, which is the most Optimum For these two classes All right, so guys, I hope you have a good understanding about nonlinear svm’s now Let’s look at a real world use case of support Vector machines So guys s VM as a classifier has been used in cancer classification since the early 2000s So there was an experiment held by a group of professionals who applied svm in a colon cancer tissue classification So the data set consisted of about 2,000 transmembrane protein samples and Only about 50 to 200 genes samples were input Into the svm classifier Now this sample which was input into the svm classifier had both colon cancer tissue samples and normal colon tissue samples right now The main objective of this study was to classify Gene samples based on whether they are cancerous or not Okay, so svm was trained using the 50 to 200 samples in order to discriminate between non-tumor from tumor specimens So the performance of The svm classifier was very accurate for even a small data set All right, we had only 50 to 200 samples And even for the small data set svm was pretty accurate with its results Not only that its performance was compared to other classification algorithm like naive Bayes and in each case svm outperform naive Bayes So after this experiment it was clear that svm classify the data more effectively and it worked exceptionally good with small data sets Let’s go ahead and understand what exactly is unsupervised learning So sometimes the given data is unstructured and unlabeled so it becomes difficult to classify the data into different categories So unsupervised learning helps to solve this problem This learning is used to Cluster the input data in classes on the basis of their statistical properties So example, we can cluster Different Bikes based upon the speed limit their acceleration or the average Average that they are giving so and suppose learning is a type of machine learning algorithm used to draw inferences from data sets consisting of input data without labels responses So if you have a look at the workflow or the process flow of unsupervised learning, so the training data is collection of information without any label We have the machine learning algorithm and then we have the clustering malls So what it does is that distributes the data into different clusters and again if you provide any Lebanon new data, it will make a prediction and find out to which cluster that particular data or the data set belongs to or the particular data point belongs to so one of the most important algorithms in unsupervised learning is clustering So let’s understand exactly what is clustering So a clustering basically is the process of dividing the data sets into groups consisting of similar data points It means grouping of objects based on the information found in the data describing the objects or their relationships, so So clustering malls focus on and defying groups of similar records and labeling records according to the group to which they belong now This is done without the benefit of prior knowledge about the groups and their creator districts So and in fact, we may not even know exactly how many groups are there to look for Now These models are often referred to as unsupervised learning models, since there’s no external standard by which to judge the malls classification performance There are no right or wrong answers to these model and if we talk about why clustering is used so the goal of clustering is to determine the intrinsic growth in a set of unlabeled data sometime The partitioning is the goal or the purpose of clustering algorithm is to make sense of and exact value from the last set of structured and unstructured data So that is why clustering is used in the industry And if you have a look at the various use cases of clustering in Industry so first of all, it’s being used in marketing So discovering distinct groups in customer databases such as customers who make a lot of long distance calls customers who use internet more than cause they’re also using insurance companies for like identifying groups of Corporation insurance policy holders with high average claim rate Farmers crash cops, which is profitable They are using C Smith studies and Define probability areas of oil or gas exploration based Don’t cease make data and they’re also used in the recommendation of movies If you’d say they are also used in Flickr photos They also used by Amazon for recommending the product which category it lies in So basically if we talk about clustering there are three types of clustering So first of all, we have the exclusive clustering which is the hard clustering so here and item belongs exclusively to one cluster not several clusters and the datapoint belong exclusively to one cluster ER so an example of this is the k-means clustering so claiming clustering does this exclusive kind of clustering so secondly, we have overlapping clustering so it is also known as soft clusters in this and item can belong to multiple clusters as its degree of association with each cluster is shown and for example, we have fuzzy or the c means clustering which has been used for overlapping clustering and finally we have the hierarchical clustering so When two clusters have a parent-child relationship or a tree-like structure, then it is known as hierarchical cluster So as you can see here from the example, we have a parent-child kind of relationship in the cluster given here So let’s understand what exactly is K means clustering So today means clustering is an Enquirer them whose main goal is to group similar elements of data points into a cluster and it is a process by which objects are classified into a predefined number of groups so that they They are as much just similar as possible from one group to another group but as much as similar or possible within each group now if you have a look at the algorithm working here, you’re right So first of all, it starts with and defying the number of clusters, which is K that I can we find the centroid we find that distance objects to the distance object to the centroid distance of object to the centroid Then we find the grouping based on the minimum distance Past the centroid Converse if true then we make a cluster false We then I can’t find the centroid repeat all of the steps again and again, so let me show you how exactly clustering was with an example here So first we need to decide the number of clusters to be made now another important task here is how to decide the important number of clusters or how to decide the number of classes will get into that later So first, let’s assume that the number of clusters we have decided It is three So after that then we provide the centroids for all the Clusters which is guessing and the algorithm calculates the euclidean distance of the point from each centroid and assize the data point to the closest cluster now euclidean distance All of you know is the square root of the distance the square root of the square of the distance So next when the centroids are calculated again, we have our new clusters for each data point then again the distance from the points To the new classes are calculated and then again the points are assigned to the closest cluster And then again, we have the new centroid scattered and now these steps are repeated until we have a repetition the centroids or the new centralized are very close to the very previous ones So until unless our output gets repeated or the outputs are very very close enough We do not stop this process We keep on calculating the euclidean distance of all the points to the centroid It’s then we calculate the new centroids and that is how K means clustering Works basically, so an important part here is to understand how to decide the value of K or the number of clusters because it does not make any sense If you do not know how many classes are you going to make? So to decide the number of clusters? We have the elbow method So let’s assume first of all compute the sum squared error, which is sse4 some value of a for example Take two four six and eight now the SSE which is the sum squared error is defined as a sum of the squared distance between each number member of the cluster and its centroid mathematically and if you mathematically it is given by the equation which is provided here And if you brought the key against the SSE, you will see that the error decreases as K gets large not this is because the number of cluster increases they should be smaller So the Distortion is also smaller know The idea of the elbow method is to choose the K at which the SSE decreases abruptly So for example here if we have a look at the figure given here We see that the best number of cluster is at the elbow as you can see here the graph here changes abruptly after the number four So for this particular example, we’re going to use for as a number of cluster So first of all while working with k-means clustering there are two key points to know first of all, Be careful about where you start so choosing the first center at random during the second center That is far away from the first center similarly choosing the NIH Center as far away as possible from the closest of the of the other centers and the second idea is to do as many runs of k-means each with different random starting points so that you get an idea of where exactly and how many clusters you need to make and where exactly the centroid lies and how the data is getting converted Divorced now k-means is not exactly a very good method So let’s understand the pros and cons of k-means clustering We know that k-means is simple and understandable Everyone learns to the first go the items automatically assigned to the Clusters Now if we have a look at the cons, so first of all one needs to define the number of clusters, there’s a very heavy task asks us if we have three four or if we have 10 categories, and if you do not know what the number of clusters are going to be It’s very difficult for anyone You know to guess the number of clusters not all the items are forced into clusters whether they are actually belong to any other cluster or any other category They are forced to rely in that other category in which they are closest to this against happens because of the number of clusters with not defining the correct number of clusters or not being able to guess the correct number of clusters So and for most of all, it’s unable to handle the noisy data and the outliners because anyways machine learning engineers and date Our scientists have to clean the data But then again it comes down to the analysis what they’re doing and the method that they are using so typically people do not clean the data for k-means clustering or even if the clean there’s sometimes a now see noisy and outliners data which affect the whole model so that was all for k-means clustering So what we’re going to do is now use k-means clustering for the movie datasets, so, Have to find out the number of clusters and divide it accordingly So the use case is that first of all, we have a data set of five thousand movies And what we want to do is grip them if the movies into clusters based on the Facebook likes, so guys, let’s have a look at the demo here So first of all, what we’re going to do is import deep copy numpy pandas Seaborn the various libraries, which we’re going to use now and from my proclivities in the use ply plot And we’re going to use this ggplot and next what we’re going to do is import the data set and look at the shape of the data set So if you have a look at the shape of the data set we can see that it has 5043 rose with 28 columns And if you have a look at the head of the data set we can see it just 5043 data points, so George we going to do is place the data points in the plot we take the director Facebook likes and we have a look at the data columns face number in post cars total Facebook likes director Facebook likes So what we have done here now is taking the director Facebook likes and the actor three Facebook likes, right So we have five thousand forty three rows and two columns Now using the k-means from sklearn what we’re going to do is import it First we’re going to import k-means from scale and Dot cluster Remember guys eschaton is a very important library in Python for machine learning So and the number of cluster what we’re going to do is provide as five now this again, the number of cluster depends upon the SSE, which is the sum of squared errors all the we’re going to use the elbow method So I’m not going to go into the details of that again So we’re going to fit the data into the k-means to fit and if you find the cluster, Us than for the k-means and printed So what we find is is an array of five clusters and Fa print the label of the k-means cluster Now next what we’re going to do is plot the data which we have with the Clusters with the new data clusters, which we have found and for this we’re going to use the CC Bond and as you can see here, we have plotted that car We have plotted the data into the grid and you can see here we have five clusters So probably what I would say is that the cluster 3 and the cluster zero are very very close So it might depend see that’s exactly what I was going to say Is that initially the main Challenge and k-means clustering is to define the number of centers which are the K So as you can see here that the third Center and the zeroth cluster the third cluster and the zeroth cluster up very very close to each other So guys It probably could have been in one another cluster and the another disadvantage was that we do not exactly know how the points are to be arranged So it’s very difficult to force the data into any other cluster which makes our analysis a little different works fine But sometimes it might be difficult to code in the k-means clustering now, let’s understand what exactly is c means clustering So the fuzzy see means is an extension of the k-means clustering the popular simple Clustering technique so fuzzy clustering also referred as soft clustering is a form of clustering in which each data point can belong to more than one cluster So k-means tries to find the heart clusters where each point belongs to one cluster Whereas the fuzzy c means discovers the soft clusters in a soft cluster any point can belong to more than one cluster at a time with a certain Affinity value towards each 4zc means assigns the degree of membership, which Just from 0 to 1 to an object to a given cluster    our minimum confidence value is 60 percent for that We’re going to generate all the non-empty subsets for each frequent itemsets Now for I equals 1 comma 3 comma 5 which is the item set We get the subset one three one five three five one three and five similarly for 2 3 5 we get to three to five three five two three and five now this rule states that for every subset s of I the output of the rule gives something like s gives i2s that implies s recommends I of s and this is only possible if the support of I divided by the support of s is greater than equal to the minimum confidence value now applying these rules to the item set of F3 we get rule 1 which is 1 3 gives 1 comma 3 comma 5 and 1/3 3 it means 1 and 3 gives 5 so the confidence is equal to the support of 1 comma 3 comma fire driver support of 1 comma 3 that equals 2 by 3 which is 66% and which is greater than the 60 percent So the rule 1 is selected now if we come to rule 2 which is 1 comma 5 it gives 1 comma 3 comma 5 and 1 5 it means if we have 1 & 5 it implies We also going to have three know Calculate the confidence of this one We’re going to have support 1 3 5 whereby support 1/5 which gives us a hundred percent which means rule 2 is selected as well But again if you have a look at rule 506 over here similarly, if it’s select 3 gives 1 3 5 & 3 it means if you have three, we also get one and five So the confidence for this comes at 50% Which is less than the given 60 percent Target So we’re going to reject this Rule and same Goes for the rule number six Now one thing to keep in mind here is that all those are rule 1 and Rule 5 look a lot similar they are not so it really depends what’s on the left hand side of the arrow And what’s on the right-hand sides of the arrow It’s the if-then possibility I’m sure you guys can understand what exactly these rows are and how to proceed with this rules So, let’s see how we can implement the same in Python, right? So for that what I’m going to do is create a new python and I’m going to use the chapter notebook You’re free to use any sort of ID I’m going to name it as a priority So the first thing what we’re going to do is we will be using the online transactional data of retail store for generating Association rules So firstly what we need to do is get the pandas and ml x 10 libraries imported and read the file So as you can see here, we are using the online retail dot xlsx format file and from ml extant We’re going to import a prairie and Association rules at all comes under MX 10 So as you can see here, we have the invoice the stock quote the description the quantity the invoice data unit price customer ID and the country now next in this step What we’re going to do is do data cleanup which includes removing the spaces from some of the descriptions And drop the rules that do not have invoice numbers and remove the great grab transactions because that is of no use to us So as you can see here at the output in which we have like five hundred and thirty two thousand rows with eight columns So after the cleanup, we need to consolidate the items into one transaction per row with each product for the sake of keeping the data set small We are only looking at the sales for France So as you can see here, we have excluded all the other says we’re just looking at the sales for France Now There are a lot of zeros in the data But we also need to make sure any positive values are converted to 1 and anything less than zero is set to 0 so as you can see here, we are still 392 Rose We’re going to encode it and see Check again Now that you have structured the data properly in this step What we’re going to do is generate frequent itemsets that have support at least seven percent, but this number is chosen so that you can get close enough and generated rules with the corresponding support confidence and lift So go ahead you can see here The minimum support is 0.71 of what if we add another constraint on the rules such as the lift is greater than 6 and the conference is greater than 0.8 So as you can see here, we have the left-hand side and the right-hand side of the association rule,  So guys, this was a small example to show you how reinforcement learning process works So you start with an initial State and once a player clothes that state he gets a reward after that the environment will give another stage to the player And after it clears that state it’s going to get another award and it’s going to keep happening until the player reaches his destination All right, so guys, I hope this is clear now, let’s move on and look at the reinforcement learning definitions So there are a few Concepts that you should be aware of while studying reinforcement learning Let’s look at those definitions over here So first we have the agent now an agent is basically the reinforcement learning algorithm that learns from trial and error Okay, so an agent takes actions like For example a soldier in Counter-Strike navigating through the game That’s also an action Okay, if he moves left right or if he shoots at somebody that’s also an action Okay So the agent is responsible for taking actions in the environment Now the environment is the whole Counter-Strike game Okay It’s basically the world through which the agent moves the environment takes the agents current state and action as input and it Returns the agency reward and its next state as output Alright next we have action now all the possible Steps that an agent can take are called actions So like I said, it can be moving right left or shooting or any of that Alright, then we have state now state is basically the current condition returned by the environment So whichever State you are in if you are in state 1 or if you’re in state to that represents your current condition All right Next we have reward a reward is basically an instant return from the environment to appraise Your Last Action Okay, so it can be anything like coins or it can be audition Two points So basically a reward is given to an agent after it clears the specific stages Next we have policy policies basically the strategy that the agent uses to find out his next action based on his current state policy is just the strategy with which you approach the game Then we have value Now while you is the expected long-term return with discount so value in action value can be a little bit confusing for you right now, but as we move further, you’ll understand what I’m talking Kima okay So value is basically the long-term return that you get with discount Okay discount I’ll explain in the furthest lines Then we have action value now action value is also known as Q value Okay It’s very similar to Value except that it takes an extra parameter, which is the current action So basically here you’ll find out the Q value depending on the particular action that you took All right So guys don’t get confused with value and action value We look at examples in the further slides and you will understand this better Okay So guys make sure that you’re familiar with these terms because you’ll be seeing a lot of these terms in the further slides All right Now before we move any further, I’d like to discuss a few more Concepts Okay So first we will discuss the reward maximization So if you haven’t already realized it the basic aim of the RL agent is to maximize the reward now, how does that happen? Let’s try to understand this in depth So the agent must be trained in such a way that he takes the best action so that the reward is Because the end goal of reinforcement learning is to maximize your reward based on a set of actions So let me explain this with a small game now in the figure you can see there is a fox there’s some meat and there’s a tiger so our agent is basically the fox and his end goal is to eat the maximum amount of meat before being eaten by the tiger now since the fox is a clever fellow he eats the meat that is closer to him rather than the meat which is closer to the tiger Now this is because the closer he is to the tiger the higher our his chances of getting killed So because of this the rewards which are near the tiger, even if they are bigger meat chunks, they will be discounted So this is exactly what discounting means so our agent is not going to eat the meat chunks which are closer to the tiger because of the risk All right now, even though the meat chunks might be larger He does not want to take the chances of getting killed Okay This is called discounting Okay This is where you discount because it improvise and you just eat the meat which are closer to you instead of taking risks and eating the meat which are The to your opponent All right Now the discounting of reward Works based on a value called gamma will be discussing gamma in our further slides but in short the value of gamma is between 0 and 1 Okay So the smaller the gamma the larger is the discount value Okay So if the gamma value is lesser, it means that the agent is not going to explore and he’s not going to try and eat the meat chunks which are closer to the tiger Okay, but if the gamma value is closer to 1 it means that our agent is actually We’re going to explore and it’s going to dry and eat the meat chunks which are closer to the tiger All right, now, I’ll be explaining this in depth in the further slides So don’t worry if you haven’t got a clear concept yet, but just understand that reward maximization is a very important step when it comes to reinforcement learning because the agent has to collect maximum rewards    and go back and forth through the different rooms and find the fastest route to the goal All right Now, let’s look at an example Okay Let’s see how the algorithm works Okay Let’s go back to the previous slide and Here it says that the first step is to set the gamma parameter Okay So let’s do that Now the first step is to set the value of the learning parameter, which is gamma and we have randomly set it to zero point eight Okay The next step is to initialize the Matrix Q 2 0 Okay So we’ve set Matrix Q 2 0 over here and then we will select the initial stage Okay, the third step is select a random initial State and here we’ve selected the initial State as room number one Okay So after you initialize the matter Q as a zero Matrix from room number one, you can either go to room number three or number five So if you look at the reward Matrix can see that from room number one, you can only go to room number three or room number five The other values are minus 1 here, which means that there is no link from 1 to 0 1 2 1 1 2 2 and 1 to 4 So the only possible actions from room number one is to go to room number 3 and to go to room number five All right Okay So let’s select room number five, okay So from room number one, you can go to 3 and 5 and we have randomly selected five You can also select three but for example, let’s select five over here Now from Rome five, you’re going to calculate the maximum Q value for the next state based on all possible actions So from number five, the next state can be room number one four or five So you’re going to calculate the Q value for traversing 5 to 1 5 2 4 5 2 5 and you’re going to find out which has the maximum Q value and that’s how you’re going Compute the Q value So let’s Implement our formula Okay, this is the q-learning formula So right now we’re traversing from room number one to room number 5 Okay This is our state So here I’ve written Q 1 comma 5 Okay one represents our current state which is room number one Okay Our initial state was room number one and we are traversing to room number five Okay It’s shown in this figure room number 5 now for this we need to calculate the Q value next in our formula It says the reward Matrix State and action So the reward Matrix for 1 comma 5 let’s look at 1 comma 5 1 comma 5 corresponds to a hundred Okay, so I reward over here will be hundred so r 1 comma 5 is basically hundred then you’re going to add the gamma value Now the gamma value will be initialized it to zero point eight So that’s what we have written over here And we’re going to multiply it with the maximum value that we’re going to get for the next date based on all possible actions Okay So from 5, the next state is 1 4 and 5 So if Travis from five to one that’s what I’ve written over here 5 to 4 You’re going to calculate the Q value of Fire 2 4 & 5 to 5 Okay That’s what I mentioned over here So Q 5 comma 1 5 comma 4 and 5 comma 5 are the next possible actions that you can take from State V So r 1 comma 5 is hundred Okay, because from the reward Matrix, you can see that 1 comma 5 is hundred 0.8 is the value of gamma after that We will calculate Q of 5 comma 1 5 comma 4 and 5 comma 5 Like I mentioned earlier that we’re going to initialize Matrix Q as zero Matrix So based setting the value of 0 because initially obviously the agent doesn’t have any memory of what is happening Okay, so he just starting from scratch That’s why all these values are 0 so Q of 5 comma 1 will obviously be 0 5 comma 4 would be 0 and 5 comma 5 will also be zero and to find out the maximum between these it’s obviously 0 So when you compute this equation, you will get hundred so the Q value of 1 comma 5 is So if I agent goes from room number one to room number five, he’s going to have a maximum reward or Q value of hundred All right Now in the next slide you can see that I’ve updated the value of Q of 1 comma 5 Okay, it said 200 All right now similarly, let’s look at another example so that you understand this better So guys, this is exactly what we’re going to do in our demo It’s only going to be coded Okay I’m just explaining our code right now I’m just telling you the math behind it Alright now, let’s look at another example Example OK this time We’ll start with a randomly chosen initial State Let’s say that we’ve chosen State 3 Okay So from room 3, you can either go to room number one two, or four randomly will select room number one and from room number one, you’re going to calculate the maximum Q value for the next state based on all possible actions So the possible actions from one is to go to 3 and to go to 5 now if you calculate the Q value using this formula, so let me explain this to you once again now, 3 comma 1 basically represents that we’re in room number three and we are going to room number one Okay So this represents our action? Okay So we’re going from 3 to 1 which is our action and three is our current state next we will look at the reward of going from 3 to 1 Okay, if you go to the reward Matrix 3 comma 1 is 0 okay Now this is because there’s no direct link between three and five Okay, so that’s why the reward here is zero So the value here will be 0 after that we have the gamma value, which is zero point Eight and then we’re going to calculate the Q Max of 1 comma 3 and 1 comma 5 out of these whichever has the maximum value we’re going to use that Okay, so Q of 1 comma 3 is 0 All right 0 you can see here 1 comma 3 is 0 and 1 comma 5 if you remember we just calculated 1 comma 5 in the previous slide Okay 1 comma 5 is hundred So here I’m going to put a hundred So the maximum here is hundred So 0.8 in 200 will give us c t so that’s the Q value Going to get if you Traverse from three two one Okay I hope that was clear So now we have Travers from room number three to room number one with the reward of 80 Okay, but we still haven’t reached the end goal which is room number five So for our next episode the state will be room Number one So guys, like I said, we’ll repeat this in a loop because room number one is not our end goal Okay, our end goal is room number 5 So now we need to figure out how to get from room number one to room number 5 So from room number one, you can either either go to three or five That’s what I’ve drawn over here So if we select five we know that it’s our end goal Okay So from room number 5, then you have to calculate the maximum Q value for the next possible actions So the next possible actions from five is to go to room number one room number four or room number five So you’re going to calculate the Q value of 5 to 1 5 2 4 & 5 2 5 and find out which is the maximum Q value here and you’re going to use that value All right So let’s look at the formula now now again, we’re in room number one and Want to go to room number 5 Okay, so that’s exactly what I’ve written here Q 1 comma 5 next is the reward Matrix So reward of 1 comma 5 which is hundred All right, then we have added the gamma value which is 0.8 And then we’re going to find the maximum Q value from 5 to 1 5 2 4 & 5 to 5 So this is what we’re performing over here So 5 comma 1 5 comma 4 and 5 comma 5 are all 0 this is because we initially set all the values of the Q Matrix as 0 so you get Hundred over here and the Matrix Remains the Same because we already had calculated Q 1 comma 5 so the value of 1 comma 5 is already fed to the agent So when he comes back here, he knows our okay He’s already done this before now He’s going to try and Implement another method Okay is going to try and take another route or another policy So he’s going to try to go from different rooms and finally land up in room number 5, so guys, this is exactly how our code runs We’re going to Traverse through each and every node because we want an Optimum ball See, okay An Optimum policy is attained only when you Traverse through all possible actions Okay So if you go through all possible actions that you can perform only then will you understand which is the best action which will lead us to the reward I hope this is clear now, let’s move on and look at our code So guys, this is our code and this is executed in Python and I’m assuming that all of you have a good background in Python Okay, if you don’t understand python very well I’m going to leave a link in the description You can check out that video on Python and then maybe come back to this later Okay, but I’ll be explaining the code to you anyway, but I’m not going to spend a lot of time explaining each and every line of code because I’m assuming that you know python Okay So let’s look at the first line of code over here So what we’re going to do is we’re going to import numpy Okay numpy is basically a python library for adding support for large multi-dimensional arrays and matrices and it’s basically for computing mathematical functions Okay so first Want to import that after that we’re going to create the our Matrix Okay So this is the our Matrix next we’re going to create a q Matrix and it’s a 6 into 6 Matrix because obviously we have six states starting from 0 to 5 Okay, and we are going to initialize the value to zero So basically the Q Matrix is going to be initialized to zero over here All right, after that we’re setting the gamma parameter to 0.8 So guys you can play with this parameter and you know move it to 0.9 or movement logo to 0.8 Okay, you can see see what happens then then we’ll set an initial stage Okay initial stage is set as 1 after that We’re defining a function called available actions Okay So basically what we’re doing here is since our initial state is one We’re going to check our row number one Okay, this is our own number one Okay This is wrong number zero This is zero number one and so on So we’re going to check the row number one and we’re going to find the values which are greater than or equal to 0 because these values basically The nodes that we can travel to now if you select minus 1 you can Traverse 2-1 Okay, I explained this earlier the – one represents all the nodes that we can travel to but we can travel to these nodes Okay So basically over here a checking all the values which are equal to 0 or greater than 0 these will be our available actions So if our initial state is one we can travel to other states whose value is equal to 0 or greater than 0 and this is stored in this variable called All available act right now This will basically get the available actions in the current state Okay So we’re just storing the possible actions in this available act variable over here So basically over here since our initial state is one we’re going to find out the next possible States we can go to okay that is stored in the available act variable Now next is this function chooses at random which action to be performed within the range So if you remember over here, so guys initially we are in stage number Okay are available actions is to go to stage number 3 or stage number five Sorry room number 3 or room number 5 Okay Now randomly, we need to choose one room So for that using this line of code, okay So here we are randomly going to choose one of the actions from the available act this available act Like I said earlier stores all our possible actions Okay from the initial State Okay So once it chooses an action is going to store it in next action, so guys this action will Present the next available action to take now next is our Q Matrix Remember this formula that we used So guys this formula that we use is what we are going to calculate in the next few lines of code So in this block of code, which is executing and Computing the value of Q Okay, this is our formula for computing the value of Q current state Karma action Our current state Karma action gamma into the maximum value So here basically we’re going to calculate the maximum index meaning that To be going to check which of the possible actions will give us the maximum Q value read if you remember in our explanation over here this value over here Max Q or five comma 1 5 comma 4 and 5 comma 5 we had to choose a maximum Q value that we get from these three So basically that’s exactly what we’re doing in this line of code, the calculating the index which gives us the maximum value after we finish Computing the value of Q will just have to update our Matrix After that, we’ll be updating the Q value and will be choosing a new initial State Okay So this is the update function that is defined over here Okay So I’ve just called the function over here So guys this whole set of code will just calculate the Q value Okay This is exactly what we did in our examples after that We have the training phase So guys remember the more you train an algorithm the better it’s going to learn Okay so over here I have provided around 10,000 titrations Okay So my range is 10 thousand iterations meaning that my age It will take 10,000 possible scenarios and in go to 10,000 titrations to find out the best policy So you’re exactly what I’m doing is I’m choosing the current state randomly after that I’m choosing the available action from the current state So either I can go to stage 3 or straight five then I’m calculating the next action and then I’m finally updating the value in the Q Matrix and next We just normalize the Q Matrix So sometimes in our Q Matrix the value might exceed Okay, let’s say it Heated to 500 600 so that time you want to normalize The Matrix Okay, we want to bring it down a little bit Okay, because larger numbers we won’t be able to understand and computation would be very hard on larger numbers That’s why we perform normalization You’re taking your calculated value and you’re dividing it with the maximum Q value in 200 All right, so you are normalizing it over here So guys, this is the testing phase Okay here you will just randomly set a current state and you want given any other data because you’ve already trained our model Okay, you’re To give a Garden State then you’re going to tell your agent that listen you’re in room Number one Now You need to go to room number five Okay, so he has to figure out how to go to room number 5 because we have trained him now All right So here we have set the current state to one and we need to make sure that it’s not equal to 5 because 5 is the end goal So guys this is the same Loop that we executed earlier So we’re going to do the same I trations again now if I run this entire code, let’s look at the result So our current state here we’ve chosen as one Okay and And if we go back to our Matrix, you can see that there is a direct link from 1 to 5, which means that the route that the agent should take is one to five Okay directly You should go from 1 to 5 because it will get the maximum reward like that Okay Let’s see if that’s happening So if I run this it should give me a direct path from 1 to 5 Okay, that’s exactly what happened So this is the selected path so directly from one to five it went and it calculated the entire Q Matrix Works for me So guys this is exactly how it works Now Let’s try to set the initial stage as that’s a to so if I set the initial stage as to and if I try to run the code, let’s see the path that it gives so the selected path is 2 3 4 5 now chose this path  this is done by associating the topmost priority location with a very high reward The usual ones so let’s put 999 in the cell L 6 comma and six now the table of rewards with a higher reward for the topmost location looks something like this We have not formally defined all the vital components for the solution We are aiming for the problem discussed now, we will shift gears a bit and study some of the fundamental concepts that Prevail in the world of reinforcement learning and q-learning the first of all we’ll start with the Bellman equation now consider the following Square Rooms, which is analogous to the actual environment from our original problem But without the barriers now suppose a robot needs to go to the room marked in the green from its current position a using the specified Direction Now, how can we enable the robot to do this programmatically one idea would be introduced some kind of a footprint which the robot will be able to follow now here a constant value is specified in each of the rooms, which will come along the robots way if it follows the directions by Fight about now in this way if it starts at location a it will be able to scan through this constant value and will move accordingly but this will only work if the direction is prefix and the robot always starts at the location a now consider the robot starts at this location rather than its previous one Now the robot now sees Footprints in two different directions It is therefore unable to decide which way to go in order to get the destination which is the Green Room It happens Primarily because the robot does not have a way to remember the directions to proceed So our job now is to enable the robot with a memory Now, this is where the Bellman equation comes into play So as you can see here, the main reason of the Bellman equation is to enable the reward with the memory That’s the thing we’re going to use So the equation goes something like this V of s gives maximum a r of s comma a plus gamma of vs – where s is a particular state Which is a room is the Action Moving between the rooms as – is the state to which the robot goes from s and gamma is the discount Factor now we’ll get into it in a moment and obviously R of s comma a is a reward function which takes a state as an action a and outputs the reward now V of s is the value of being in a particular state which is the footprint now we consider all the possible actions and take the one that yields the maximum value Now there is one constraint However regarding the value footprint that is the room marked in the yellow just below the Green Room It will always have the value of 1 to denote that is one of the nearest room adjacent to the green room Now This is also to ensure that a robot gets a reward when it goes from a yellow room to The Green Room Let’s see how to make sense of the equation which we have here So let’s assume a discount factor of 0.9 as remember gamma is the discount value or the discount Factor So let’s Take a 0.9 Now for the room, which is Mark just below the one or the yellow room, which is the Aztec Mark for this room What will be the V of s that is the value of being in a particular state? So for this V of s would be something like maximum of a will take 0 which is the initial of our s comma Hey plus 0.9 which is gamma into 1 that gives us zero point nine now here the robot will not get any reward for Owing to a state marked in yellow hence the IR s comma a is 0 here but the robot knows the value of being in the yellow room Hence V of s Dash is one following this for the other states We should get 0.9 then again, if we put 0.9 in this equation, we get 0.81 then zero point seven to nine and then we again reached the starting point So this is how the table looks with some value Footprints computer From the Bellman equation now a couple of things to notice here is that the max function has the robot to always choose the state that gives it the maximum value of being in that state now the discount Factor gamma notifies the robot about how far it is from the destination This is typically specified by the developer of the algorithm That would be installed in the robot Now, the other states can also be given their respective values in a similar way So as you can see here the boxes Into the green one have one and if we move away from one we get 0.9 0.8 1 0 1 7 to 9 And finally we reach 0.66 now the robot now can precede its way through the Green Room utilizing these value Footprints event if it’s dropped at any arbitrary room in the given location now, if a robot Lance up in the highlighted Sky Blue Area, it will still find two options to choose from but eventually either of the parties It’s will be good enough for the robot to take because Auto V the value Footprints are not only that out Now one thing to note is that the Bellman equation is one of the key equations in the world of reinforcement learning and Q learning So if we think realistically our surroundings do not always work in the way we expect there is always a bit of stochastic City involved in it So this applies to robot as well Sometimes it might so happen that the robots Machinery got corrupted Sometimes the robot makes come across some hindrance on its way which may not be known to it beforehand Right and sometimes even if the robot knows that it needs to take the right turn it will not so how do we introduce this to cast a city in our case now here comes the Markov decision process now consider the robot is currently in the Red Room and it needs to go to the green room Now Let’s now consider the robot has a slight chance of dysfunctioning and might take the left or the right or the bottom On instead updating the upper turn in order to get to The Green Room from where it is now, which is the Red Room Now the question is, how do we enable the robot to handle this when it is out in the given environment right Now, this is a situation where the decision making regarding which turn is to be taken is partly random and partly another control of the robot now partly random because we are not sure when exactly the robot mind dysfunctional and partly under the control of the robot because it is still Making a decision of taking a turn right on its own and with the help of the program embedded into it So a Markov decision process is a discrete time stochastic Control process It provides a mathematical framework for modeling decision-making in situations where the outcomes are partly random and partly under control of the decision maker Now we need to give this concept a mathematical shape most likely an equation which then can be taken further now you might be Price that we can do this with the help of the Bellman equation with a few minor tweaks So if we have a look at the original Bellman equation V of X is equal to maximum of our s comma a plus gamma V of s stash what needs to be changed in the above equation so that we can introduce some amount of Randomness here as long as we are not sure when the robot might not take the expected turn We are then also not sure in which room it might end up in which is nothing but the room it Moves from its current room at this point according to the equation We are not sure of the S stash which is the next state or the room, but we do know all the probable turns the reward might take now in order to incorporate each of this probabilities into the above equation We need to associate a probability with each of the turns to quantify the robot if it has got any experts it is chance of taking this turn now if we do, so We get PS is equal to maximum of our s comma a plus gamma into summation of s – PS comma a comma s stash into V of his stash now the PS a– and a stash is the probability of moving from room s to establish with the action a and the submission here is the expectation of the situation that the robot in curse, which is the randomness now, let’s take a look at this example here So when We associate the probabilities to each of these Stones We essentially mean that there is an 80% chance that the robot will take the upper turn Now, if you put all the required values in our equation, we get V of s is equal to maximum of our of s comma a + comma of 0.8 into V of room up plus 0.1 into V of room down 0.03 into a room of V of from left plus 0.03 into Vo Right now note that the value Footprints will not change due to the fact that we are incorporating stochastic Ali here But this time we will not calculate those values Footprints instead We will let the robot to figure it out Now up until this point We have not considered about rewarding the robot for its action of going into a particular room We are only watering the robot when it gets to the destination now, ideally there should be a reward for each action the robot takes to help it better as Assess the quality of the actions, but there was need not to be always be the same but it is much better than having some amount of reward for the actions than having no rewards at all Right and this idea is known as the living penalty in reality The reward system can be very complex and particularly modeling sparse rewards is an active area of research in the domain of reinforcement learning So by now we have got the equation which we have a so what? To do is now transition to Q learning So this equation gives us the value of going to a particular State taking the stochastic city of the environment into account Now, we have also learned very briefly about the idea of living penalty which deals with associating each move of the robot with a reward so Q learning processes and idea of assessing the quality of an action that is taken to move to a state rather than determining the possible value of the state which is being moved to So earlier we had 0.8 into V of s 1 0.03 into V of S 2 0 point 1 into V of S 3 and so on now if you incorporate the idea of assessing the quality of the action for moving to a certain state so the environment with the agent and the quality of the action will look something like this So instead of 0.8 V of s 1 will have q of s 1 comma a one will have q of S 2 comma 2 You of S3 not the robot now has four different states to choose from and along with that There are four different actions also for the current state it is in so how do we calculate Q of s comma a that is the cumulative quality of the possible actions the robot might take so let’s break it down Now from the equation V of s equals maximum a RS comma a + comma summation s – PSAs stash – into V of s – if we discard the maximum function we have is of a plus gamma into summation p and v now essentially in the equation that produces V of s we are considering all possible actions and all possible States from the current state that the robot is in and then we are taking the maximum value caused by taking a certain action and the equation produces a value footprint, which is for just one possible action In fact if we can think of it as the quality of the action so Q of s comma a is equal to RS comma a plus gamma of summation p and v now that we have got an equation to quantify the quality of a particular action We are going to make a little adjustment in the equation we can now say that we of s is the maximum of all the possible values of Q of s comma a right So let’s utilize this fact and replace V of s Stash as a function of Q so q s comma a becomes R of s comma a + comma of summation PSAs – and maximum of the que es – a – so the equation of V is now turned into an equation of Q, which is the quality But why would we do that now? This is done to ease our calculations because now we have only one function Q, which is also the core of the Programming language We have only one function Q to calculate an R of s comma a is a Quantified metric which produces reward of moving to a certain State Now, the qualities of the actions are called The Q values and from now on we will refer to the value Footprints as the Q values an important piece of the puzzle is the temporal difference Now temporal difference is the component that will help the robot calculate the Q values which respect to the change Changes in the environment over time So consider our robot is currently in the mark State and it wants to move to the Upper State One thing to note that here is that the robot already knows the Q value of making the action that is moving through the Upper State and we know that the environment is stochastic in nature and the reward that the robot will get after moving to the Upper State might be different from an earlier observation So how do we capture this change the real difference? We calculate the new Q as My a with the same formula and subtract the previous you known qsa from it So this will in turn give us the new QA now the equation that we just derived gifts the temporal difference in the Q values which further helps to capture the random changes in the environment which may impose now the new q s comma a is updated as the following so Q T of s comma is equal to QT minus 1 s comma a plus Alpha TD ET of a comma s now here Alpha is the learning rate which controls how quickly the robot adapts to the random changes imposed by the environment the qts comma is the current state q value and a QT minus 1 s comma is the previously recorded Q value So if we replace the TDS comma a with its full form equation, we should get Q T of s comma is equal to QT – 1 of s comma y plus Alpha into our of S comma a plus gamma maximum of q s Dash a dash minus QT minus 1 s comma a now that we have all the little pieces of q line together Let’s move forward to its implementation part Now, this is the final equation of q-learning, right? So, let’s see how we can implement this and obtain the best path for any robot to take now to implement the algorithm We need to understand the warehouse Ian and how that can be mapped to different states So let’s start by reconnecting the sample environment So as you can see here, we have L1 L2 L3 to align and as you can see here, we have certain borders also So first of all, let’s map each of the above locations in the warehouse two numbers or the states so that it will ease our calculations, right? So what I’m going to do is create a new Python 3 file in the jupyter notebook and I’ll name it as learning Numb, but okay, so let’s define the states But before that what we need to do is import numpy because we’re going to use numpy for this purpose and let’s initialize the parameters That is the gamma and Alpha parameters So gamma is 0.75, which is the discount Factor whereas Alpha is 0.9, which is the learning rate Now next what we’re going to do is Define the states and map it to numbers So as I mentioned earlier l 1 is Zero and online We have defined the states in the numerical form Now The next step is to define the actions which is as mentioned above represents the transition to the next state So as you can see here, we have an array of actions from 0 to 8 Now, what we’re going to do is Define the reward table So as you can see here is the same Matrix that we created just now that I showed you just now now if you understood it correctly, there isn’t any real Barrel limitation as depicted in the image, for example, the transitional for tell one is allowed but the reward will be 0 to discourage that path or in tough situation What we do is add a minus 1 there so that it gets a negative reward So in the above code snippet as you can see here, we took each of the It’s and put once in the respective state that are directly reachable from the certain State Now If you refer to that reward table, once again, which we created the above or reconstruction will be easy to understand but one thing to note here is that we did not consider the top priority location L6 yet We would also need an inverse mapping from the state’s back to its original location and it will be cleaner when we reach to the other depths of the algorithms So for that what we’re going to do is Have the inverse map location state to location We will take the distinct State and location and convert it back Now What will do is will not Define a function get optimal which is the get optimal route, which will have a start location and an N location Don’t worry the code is back But I’ll explain you each and every bit of the code It’s not the get optimal root function will take two arguments the starting location in the warehouse and the end location in the warehouse recipe lovely and it will return the optimal route for reaching the end location from the starting location in the form of an ordered list containing the letters So we’ll start by defining the function by initializing the Q values to be all zeros So as you can see here we have Even the Q value has to be 0 but before that what we need to do is copy the reward Matrix to a new one So this the rewards new and next again, what we need to do is get the ending State corresponding to the ending location And with this information automatically will set the priority of the given ending stay to the highest one that we are not defining it now, but will automatically set the priority of the given ending State as nine nine nine So what we’re going to do is initialize the Q values to be 0 and in the Learning process what you can see here We are taking I in range 1000 and we’re going to pick up a state randomly So we’re going to use the MP dot random randint and for traversing through the neighbor location in the same maze we’re going to iterate through the new reward Matrix and get the actions which are greater than 0 and after that what we’re going to do is pick an action randomly from the list of the playable actions in years to the next state will going to compute the temporal difference, which is TD, which is the rewards plus gamma into the queue of next state and will take n p dot ARG Max of Q of next 8 minus Q of the current state We going to then update the Q values using the Bellman equation as you can see here We have the Bellman equation and we’re going to update the Q values and after that we’re going to initialize the optimal route with a starting location now here we do not know what the next location yet So initialize it with a value of the starting location, which Again is the random location So we do not know about the exact number of iteration needed to reach to the final location Hence while loop will be a good choice for the iteration So when you’re going to fetch the starting State fetch the highest Q value penetrating to the starting State we go to the index or the next state, but we need the corresponding letter So we’re going to use that state to location function We just mentioned there and after that we’re going to update the starting location for the The next iteration and finally we’ll return the root So let’s take the starting location of n line and and location of L while and see what part do we actually get? So as you can see here we get Airline l8l 5 L2 and L1 And if you have a look at the image here, we have if we start from L9 to L1 We got L8 L5 L 2 l 1 l 8l v L2 L1 that would He does the maximum value of the maximum reward for the robot So now we have come to the end of this Q learning session and I hope you got to know what exactly is Q learning with the analogy all the way starting from the number of rooms and I hope the example which I took the analogy which I took was good enough for you to understand q-learning understand the Bellman equation how to make quick changes to the Bellman equation and how to create the reward table the cue Will and how to update the Q values using the Bellman equation, what does alpha do what does karma do?

## You Want To Have Your Favorite Car?

We have a big list of modern & classic cars in both used and new categories.