Uncategorized

inches and their width in inches, maybe you don’t want to treat them equally Right? But there’s a lot more variance in height than in width, or maybe there is and maybe there isn’t So here on the left we don’t have any scaling, and we see a very natural clustering On the other hand, we notice on the y-axis the values range from not too far from 0 to not too far from 1 Whereas on the x-axis, the dynamic range is much less, not too far from 0 to not too far from 1/2 So we have twice the dynamic range here than we have here Therefore, not surprisingly, when we end up doing the clustering, width plays a very important role And we end up clustering it this way, dividing it along here On the other hand, if I take exactly the same data and scale it, and now the x-axis runs from 0 to 1/2 and the y-axis, roughly again, from 0 to 1, we see that suddenly when we look at it geometrically, we end up getting a very different look of clustering What’s the moral? The moral is you have to think hard about how to cluster your features, about how to scale your features, because it can have a dramatic influence on your answer We’ll see some real life examples of this shortly But these are all the important things to think about, and they all, in some sense, tie up into the same major point Whenever you’re doing any kind of learning, including clustering, feature, selection, and scaling is critical It is where most of the thinking ends up going And then the rest gets to be fairly mechanical How do we decide what features to use and how to scale them? We do that using domain knowledge So we actually have to think about the objects that we’re trying to learn about and what the objective of the learning process is So continuing, how do we do the scaling? Most of the time, it’s done using some variant of what’s called the Minkowski metric It’s not nearly as imposing as it looks So the distance between two vectors, X1 and X2, and then we use p to talk about, essentially, the degree we’re going to be using So we take the absolute difference between each element of X1 and X2, raise it to the p-th power, sum them and then take the 1 over p Not very complicated, so let’s say p is 2 That’s the one you people are most familiar with Effectively, all we’re doing is getting the Euclidean distance What we looked at when we looked at the mean squared distance between two things, between our errors and our measured data, between our measured data and our predicted data We used the mean square error That’s essentially in Minkowski distance with p equal to 2 That’s probably the most commonly used, but an almost equally commonly used sets p equal to 1, and that’s something called the Manhattan distance I suspect at least some of you have spent time walking around Manhattan, a small but densely populated island in New York And midtown Manhattan has the feature that it’s laid out in a grid So what you have is a grid, and you have the avenues

running north-south and the streets running east-west And if you want to walk from, say, here to here or drive from here to here, you cannot take the diagonal because there are a bunch of buildings in the way And so you have to move either left or right, or up, or down That’s the Manhattan distance between two points This is used, in fact, for a lot of problems, typically when somebody is comparing the distance between two genes, for example, they use a Manhattan metric rather than a Euclidean metric to say how similar two things are Just wanted to show that because it is something that you will run across in the literature when you read about these kinds of things All right So far, we’ve talked about issues where things are comparable And we’ve been doing that by representing each element of the feature vector as a floating point number So we can run a formula like that by subtracting one from the other But we often, in fact, have to deal with nominal categories, things that have names rather than numbers So for clustering people, maybe we care about eye color, blue, brown, gray, green Hair color Well, how do you compare blue to green? Do you subtract one from the other? Kind of hard to do What does it mean to subtract green from blue? Well, I guess we could talk about it in the frequency domain, enlighten things Typically, what we have to do in that case is, we convert them to a number and then have some ways to relate the numbers Again, this is a place where domain knowledge is critical So, for example, we might convert blue to 0, green to 0.5, and brown to 1, thus indicating that we think blue eyes are closer to green eyes than they are to brown eyes I don’t know why we think that but maybe we think that Red hair is closer to blonde hair than it is to black hair I don’t know These are the sorts of things that are not mathematical questions, typically, but judgments that people have to make Once we’ve converted things to numbers, we then have to go back to our old friend of scaling, which is often called normalization Very often we try and contrive to have every feature range between 0 and 1, for example, so that everything is normalized to the same dynamic range, and then we can compare Is that the right thing to do? Not necessarily, because you might consider some features more important than others and want to give them a greater weight And, again, that’s something we’ll come back to and look at All this is a bit abstract I now want to look at an example Let’s look at the example of clustering mammals There are, essentially, an unbounded number of features you could use, size at birth, gestation period, lifespan, length of tail, speed, eating habits You name it The choice of features and weighting will, of course, have an enormous impact on what clusters you get If you choose size, humans might appear in one cluster If you choose eating habits, they might appear in another How should you choose which features you want? You have to begin by choosing, thinking about the reason

details of the code And then I have a distance metric, and I’m just for the moment using simple Euclidean distance The next element in my hierarchy, not yet a hierarchy– it’s still flat– is a cluster And so what a cluster is, you can think of it as, at some abstract level, it’s just going to be a set of points, the points that are in the cluster But I’ve got some other operations on it that will be useful I can compute the distance between two clusters, and as you’ll see, I have single linkage, Mac Link, max , average, the three I talked about last week And also this notion of a centroid We’ll come back to that when we get to k-means clustering We don’t need to worry right now about what that is Then I’m going to have a cluster set That’s another useful data abstraction And that’s what you might guess from its name, just a set of clusters The most interesting operation there is merge As you saw, when we looked at hierarchical clustering last week, the key step there is merging two clusters And in doing that, I’m going to have a function called Find Closest, which given a metric and a cluster, finds the cluster that is most similar to that, to self, because as you, again, will recall from hierarchical clustering, that’s what I merged at each step is the two most similar clusters And then there’s some details about how it works, which again, we don’t need to worry about at the moment And then I’m going to have a subclass of point called Mammal, in which I will represent each mammal by the dentitian as we’ve looked at before Then pretty simply, we can do a bunch of things with it Before we look at the other details of the code, I want to now run it and see what we get So I’m just going to use hierarchical clustering now to cluster the mammals based upon this feature vector, which will be a list of numbers showing how many of each kind of tooth the mammals have Let’s see what we get So it’s doing the merging So we can see the first step, it merged beavers with ground hogs and it merged grey squirrels with porcupines, wolves and bears Various other kinds of things, like jaguars and cougars, were a lot alike Eventually, it starts doing more complicated merges It merges a cluster containing only the river otter with one containing a Martin and a wolverine, beavers and ground hogs with squirrels and porcupines, et cetera And at the end, I had it stop with two clusters It came up with these clusters Now we can look at these clusters and say, all right What do we think? Have we learned anything interesting? Do we see anything in any of these– do we think it makes sense? Remember, our goal was to cluster mammals based upon what they might eat And we can ask, do we think this corresponds to that? No All right Who– somebody said– Now, why no? Go ahead

AUDIENCE: We’ve got– like a deer doesn’t eat similar things as a dog And we’ve got one type on the top cluster and a different kind of bat in the bottom cluster Seems like they would be even closer together PROFESSOR: Well, sorry Yeah A deer doesn’t eat what a dog eats, and for that matter, we have humans here, and while some human are by choice vegetarians, genetically, humans are essentially carnivores We know that We eat meat And here we are with a bunch of herbivores, typically Things are strange By the way, bats might end up being in ones, because some bats eat fruit, other bat eat insects, but who knows? So I’m not very happy Why do you think we got this clustering that maybe isn’t helping us very much? Well, let’s go look at what we did here Let’s look at test 0 So I said I wanted two clusters I don’t want it to print all the steps along the way I’m going to print the history at the end And scaling is identity Well, let’s go back and look at some of the data here What we can see is– or maybe we can’t see too quickly, looking at all this– is some kinds of teeth have a relatively small range Other kinds of teeth have a big range And so, at the moment, we’re not doing any normalization, and maybe what we’re doing is getting something distorted where we’re only looking at a certain kind of tooth because it has a larger dynamic range And in fact, if we look at the code, we can go back up and let’s look at Build Mammal Points and Read Mammal Data So Build Mammal Points calls Read Mammal Data, and then builds the points So Read Mammal Data is the interesting piece And what we can see here is, as we read it in– this is just simply reading things in, ignoring comments, keeping track of things– and then we come down here, I might do some scaling So Point.Scale feature is using the scaling argument Where’s that coming from? If we look at Mammal Teeth, here from the mammal class, we see that there are two ways to scale it, identity, where we just multiply every element in the vector by 1 That doesn’t change anything Or what I’ve called 1 over max And here, I’ve looked at the maximum number of each kind of tooth and I’m dividing 1 by that So here we could have up to three of those Here we could have four of those We could have six of this kind of tooth, whatever it is And so we can see, by dividing by the max, I’m now putting all of the different kinds of teeth on the same scale I’m normalizing And now we’ll see, does that make a difference? Well, since we’re dividing by 6 here and 3 here, it certainly could make a difference It’s a significant scaling factor, 2X So let’s go and change the code, or change the test And let’s look at Test 0– 0, not “O”– with scale set to 1 over max