[MUSIC PLAYING] CRAIG: Welcome to Talks at Google I’m very pleased to offer discussion on the book “Meltdown.” Has any complex system that you work with ever failed? This discussion today by Chris Clearfield and Andras Tilcsik will tell you a little bit about why these complex systems fail And they’re going to offer some suggestions on how to fix it I’d like to welcome our authors Chris is a former derivatives trader, licensed commercial pilot, and he tells me he’s also a science geek And Andras has the Canada Research Chair in Strategy, Organizations, and Society at the University of Toronto’s Rotman School of Management They’ve written a great book I’ve read it, and I hope you do, too Welcome, Chris and Andras [APPLAUSE] CHRIS CLEARFIELD: Thanks very much, Craig So before we dive into the book, we like to talk a little bit about how we got interested in this topic But also, wanted to thank you all Although, probably not everyone here works in the Google Apps suite This book would not be written if it were not for Google Docs So you can pass that on to your colleagues Andras is in Toronto I’m in Seattle And we had many, many a Google Hangout with the Google Doc open where we were kind of pounding away So thank you for that because we wouldn’t be here otherwise So as Craig said, this book is about failure And it’s about how failure happens in complex systems And we just wanted to talk a little bit about how we got onto this topic and how we started thinking about it And then, go through– kind of just skim the surface of what a little bit of our research is, so that we can have a conversation about it Because I think you guys more than many, many organizations really think carefully about this stuff and are already working with complex systems And many times, working with them in a really effective way So our hope is that this is– what we’re contributing is a perspective And that we can use that to start a conversation So Andras and I got into this research kind of from different areas As Craig said, I was a derivatives trader I traded during the financial crisis And so I saw different organizations that managed that process very well and some that managed it very poorly and went bankrupt And a lot of people saw the impacts of that And so at the same time, I was learning how to fly I was starting the process of becoming a pilot, and really interested in the lessons that aviation kind of had learned about how to manage these complex and interconnected systems And Andras in the meantime, was finishing his PhD in sociology, and looking at how organizations made decisions And it was really after the BP Deepwater Horizon oil spill that we kind of came together and started thinking about this a little bit more systematically And just started to try to identify kind of two– answer two big questions Which is 1, why are these kind of failures happening more and more? And 2, what are capable organizations doing about it? And kind of, what can we learn from that? So when we’re talking about big failures, today we’re going to start in what seems like an unlikely place, which is the festive holiday season So this example, it’s one of our favorite examples in the book It’s a social media campaign that Starbucks was running They had this hashtag, #spreadthecheer, right? And I mean, it seems so warm and fuzzy They wanted tweets kind of like this, “I love Starbucks gingerbread lattes #spreadthecheer.” And as part of this campaign, they had sponsored a big projection screen at an ice skating rink in a museum in London And so the idea was they would have this #spreadthecheer And then, those tweets would come up on this screen Yeah, you guys see where this is going People would enjoy it And listen, Starbucks is a sophisticated company, right? I mean, they think very, very much about their marketing And they had a content filter for these tweets, but the content filter broke And so early on, you started getting stuff like this, “I like buying coffee that tastes nice from a shop that pays its tax So I avoid @starbucks #spreadthecheer.” And Starbucks was enmeshed in this kind of tax controversy in the UK at the time Sorry Then, you saw, #spreadthecheer, pay your tax bills, parasites.” And then, a couple of people that were a little less subtle in their critique And what happened was that people started tweeting about the fact that this was happening And then, we saw this kind of snowball, right?
So we saw this avalanche effect that we see in social media where tweet after tweet after tweet started happening And what’s interesting is even after they fixed the moderation filter, I mean, the cat was out of the bag, right? The problem was already happening And you know, Twitter was abuzz with this It also jumped to the traditional media Papers covered it So it was a PR meltdown And nobody died in it, but it’s a really kind of interesting place for our story to start because this kind of meltdown is not unique anymore We have the story of Knight Capital, a pretty well-known, at least in Wall Street circles, trader, who because of a– basically, because of a DevOps error, they accidentally didn’t roll out a piece of software on one of their eight servers They lost $500 million in half an hour And that’s a lot, even for Wall Street A lot of us have had this experience where we’ve seen or read about or even been in a situation where flights are delayed Not because there’s a problem with the planes or the pilots, but because there’s a problem with the airline reservation system There’s been a fire in a data center somewhere Or in this case, somebody accidentally pulled the power cord out at BA, and then the whole network goes down I’m from Seattle In Washington state, there was this amazing example where in the state’s prisoner management system, the Department of Corrections was releasing thousands of felons early for a decade because they had a bug in their code that miscalculated sentences and they didn’t know about it And even after they were told about it, it took them three years to fix it We all saw this happen in Hawaii, where a pretty bad user interface meant that this alert that a ballistic missile was inbound happened And then beyond that, they hadn’t designed for failure, right? So they couldn’t pull back the message They didn’t have the infrastructure to do that And so what all these things have in common on the surface is that they are all examples of small problems snowballing into these really big failures And that’s what we see over and over again And that’s kind of one of the things that unifies these sorts of failures that we looked at in the book And when we dug past the headlines, what we found was that there’s actually some pretty compelling research that describes kind of not the precise failures that are going to happen But in general, the dynamics that lead to these kinds of failures So these kind of failures are more likely to happen in systems that are complex and tightly coupled And I’m going to talk a little bit about what those mean But when a system is complex and tightly coupled, it’s in what we think of as the danger zone It’s in this area where these failures are more likely to happen And when they happen, they are more likely to spiral out of control So from our definition of complexity, complexity means that the system is connected There’s a lot of connectivity It looks a lot like a web And it’s hard to understand what’s going on inside it So if you think about that could be because of computer code or it could be because it’s something like Deepwater Horizon, where the action, if you will, is happening miles under the surface of the ocean miles under the surface of the earth And tight coupling is this idea that once the failure starts to happen, there’s not a lot of slack in our system to correct it So it moves faster than we can move And we can’t kind of put the genie back in the bottle Once the error starts happening, it’s hard to recover from And when we have a system that’s complex and tightly coupled, we tend to be in the danger zone Now, this is research that some of you might recognize It was done by a guy called [? Chick Perot ?] in the 1980s And [? Chick Perot’s ?] research really stemmed out of the Three Mile Island nuclear meltdown And what he saw when he did this research initially is that there really weren’t that many systems in this danger zone It was kind of like nuclear power and like big kind of space/aerospace systems But what we have seen in the research for our book is that so many more systems are in this danger zone today Whether we’re talking about finance, or health care, or transportation, or even the kind of gadgets that we bring into our lives Whether we’re talking about our homes or internet-connected cars And so that’s kind of the setup And that’s sort of the bad news But the book is actually deeply optimistic And we’re going to have time to talk about a couple of solutions today And then, kick off a conversation So Andras is going to start with our first solution ANDRAS TILCSIK: Yeah, thanks, Chris And as Chris said, we don’t have time to go into all the solutions we discussed in this book, but we do have time to give you a few tidbits So let me start with this thing These sentences were written on a blog by a [INAUDIBLE] gentleman about five or six years ago He says, “She’s unbelievably beautiful to look at She stands extremely tall She’s the most beautiful.” You look at these, you think he’s probably writing about a supermodel who is also very tall and very pretty But it turns out he’s actually writing about an airplane
And he’s a KLM pilot, and he really loves the Airbus 330 If you read the whole blog post, there are like these 10 points in there It almost reads like a love letter to the plane And what he loves about it the most is that it has this very sleek, elegant, streamlined design, both on the outside and on the inside in the cockpit If you look at the cockpit design here, one of the things that this pilot loved the most is these small, little side-stick controllers They are like little joysticks that the pilots used to control the plane And he loves them because they are small, elegant They are out of the way They even leave space for a tray table It’s a French plane, so lunch is very important The pilots get to pull it out They have their lunch It’s very neat It doesn’t obscure the instrument panel It’s set aside And they are also both fully computerized So it looks like great design And if you compare that to the design of a Boeing 737 Similar plane in many other ways, but the setup of the controls is very different Here, instead of the elegant, little controls, we have these big W-shaped control yokes They stand on these control columns that are about 3 feet tall In fact, these things are so big and bulky that the seats of the pilots need to be cut out in the front They are sort of split over here to make space for them And these things are not fully computerized In fact, the two controls are mechanically physically connected So if I pull mine back, it’s also going to go back on Chris’s side And in fact, if I pull back too hard and too fast, it probably hits him in the knee and makes him spill his lunch on his shirt, which is actually something that pilots complain about all the time So if you look at this, it’s big It’s bulky It’s oversized And it makes you spill your lunch on your shirt It looks like terrible design But it turns out that there’s something really helpful and something really beautiful about these ugly designs Or at least this particular ugly design And it’s that it makes everything the other pilot does immediately visible It’s literally in your face So if I’m pulling back or pushing forward, it’s very easy for Chris to see what’s happening and what I’m doing And in fact, with the side-stick controls, we’ve already seen a number of accidents– two or three in just the past few years– where one of the pilots on one of those Airbus planes got confused in the heat of the moment They were trying to manage a crisis Instead of pushing forward to avoid a stall, they were pulling back on the control And the other pilot, the more experienced pilot, could have caught that error and could have intervened, but they just didn’t see it because the side-stick control is to the side And that would never happen with these giant, ugly, bulky controls, which are literally in your face and probably hitting you in the knee and the belly And we actually see this sort of principle over and over again in our research for the book This idea that we often privilege elegance and love the sleek and shiny designs But often, those designs also make things less transparent And sometimes, that kind of transparency is really valuable And what we found is– what’s very interesting is that we see this principle not just in the design of physical systems We see it with airplanes and car design But we see it even in the design of financial or accounting systems Or to give you one more example that’s not technological People might remember this from last year, the Oscars mix up in 2017 when the wrong film was announced CHRIS CLEARFIELD: “La La Land.” ANDRAS TILCSIK: It’s not “La La Land,” I’m sorry CHRIS CLEARFIELD: “Moonlight.” ANDRAS TILCSIK: It’s “Moonlight,” I am sorry Yes So part of that confusion came from the design of the envelope I mean, look at this It’s pretty elegant It’s this artful red, gold lettering that’s supposed to be the category name, but we can’t really read it And especially if you are standing backstage, it’s really hard And if I grab the wrong envelope, it’s very hard for anyone to see that a mistake is being made And now, take a look at what they did this year They learned, right? I mean, this is not a pretty design I mean, I am pretty terrible at these things, but this is the kind of thing I would design if they told me to It says “Best Picture” with huge font And it says “Best Picture” again in small font And it’s not pretty
It’s ugly, but it’s very transparent You make it so transparent that when you’re sitting at home watching on TV, you can probably tell if there is a mistake And that’s really the same kind of principle we see with the airplanes, where the other pilot, the monitoring pilot, can intervene And with that, I’ll turn it over to Chris, who I think will talk about another way to get transparency CHRIS CLEARFIELD: Yeah So I’m just talking about transparency through thinking about how we design our systems But another big way to get transparency is by learning about our systems And that’s because in particular, complex systems– we can’t sort of sit down ahead of time and write down all of the ways that our systems are going to fail We can’t sit down ahead of time and kind of just list ab initio, these are the things that can go wrong, because we don’t know These failures come from these interactions But what we can do is we can increase transparency by learning from the failures that do happen and preventing them from kind of spiraling out of control And so it turns out that there’s one big obstacle to learning about our systems And that is actually its bosses Now, bosses As a boss, bosses see themselves as somebody who is very friendly They have an open door policy You know, they encourage people to speak up But what the research shows is that people that are managed, they often see their bosses more like this, as these kind of sort of scary, bear-like, shadowy creatures, right? And this is something I think you guys probably have a more forward-thinking discussion than a lot of companies Because you know, Project Aristotle, one of the big things that came out of that was how psychological safety helps with performance And that’s really what we’re talking about here We’re not talking about it in the performance context, but we’re talking about it in the context of getting information about the things that are going wrong kind of up and distributed to decision makers as broadly as possible And so there’s a lot of really interesting research that goes into this We’re going to talk about of two threads of that So if you take a look at these two– this graph So this is a graph of two kind of groups of different teams solving a complex problem Everybody has different pieces of information, and they have to bring that information together to solve the problem So the blue team and the red team Or sorry, the blue group and the red group, each of which is comprised of different teams A number of them So if you look, the blue group proposes five solutions The red group proposes closer to seven The blue group shares about half as many facts before a decision is reached And that’s really interesting, right? I mean, the solutions are significantly different The number of solutions But the facts are just wildly different So the red groups have a much more thorough discussion before a solution is reached And you might ask, well, why? What’s kind of behind this? Is there something intrinsic about these managers of these groups as leaders? Or kind of what is it? Do they have MBAs? Do they have kind of more leadership experience? And it turns out it’s nothing like that It turns out that the intervention is very, very small The leaders of the blue groups are taught to start their discussions with this phrase– the most important thing is that we all agree on our decision Now, here’s what I think could be done The leaders of these red groups, on the other hand, of these red teams, are taught to start by saying– the most important thing is that we air all possible viewpoints to reach a good decision Now, what does each of you think should be done? That’s pretty dramatic, right? I mean, this very small intervention yields dramatically different discussions and number of solutions that are proposed Now, one critique of this study is you might look at it and say, well, this is just a psych study, right? So how does this actually apply in the real world? And I think actually, you guys– again, through Project Aristotle, you guys actually have some of the most exciting data about how this can apply in the real world But we also have a really nice context where we have a lot of good data about how this kind of thing applies And that comes actually from commercial aviation So if you look at the accident rate over the last four decades in commercial aviation, it has plummeted dramatically And that is despite the increasing complexity of the system and the kind of increasing technology in it And commercial aviation has come up with a number of things, a number of strategies and lessons that make them more effective at dealing with this kind of big system that they run But one of them is called crew resource management And crew resource management is, among other things, kind of a script for how flight crews can talk about the decisions that they need to make, and how they can surface issues
But also, how the captains can listen to those issues and encourage those voices of concern And one of the things that crew resource management does is it’s a script It’s a script for how first officers can share concerns and how captains should listen to those concerns And this isn’t a lot different than a kind of boss-employee relationship There’s a hierarchy in the cockpit The captain is usually more experienced than the first officer But the first step for the first officer to raise a concern is to start by getting the captain’s attention So you might say, hey, Andras ANDRAS TILCSIK: I’m the captain now? CHRIS CLEARFIELD: Andras is the captain ANDRAS TILCSIK: That’s awesome All right Hey, Chris CHRIS CLEARFIELD: Hey, Andras I’m worried that we’re going to be late for our talk at Google, because I’ve heard that traffic in the Bay Area is pretty notorious I’m going to state the problem I’m going to propose a solution I think we should leave really early from our hotel, so we make sure we get there on time And then, we get explicit agreement How does that sound to you? ANDRAS TILCSIK: Sounds great I’d hate to miss that talk CHRIS CLEARFIELD: Right? So what’s fascinating about this is not its sophistication, because it’s not I have a five-year-old This is how I talk with my five-year-old about how he can raise concerns, how he can ask for help But what’s really interesting about this is this had a dramatic effect on the accident rates in commercial aviation In particular, when the captain was the pilot flying That’s what’s really interesting about this So before, first officers wouldn’t challenge the captain It was more likely that the captain was the one flying during these crew-caused mistakes After crew resource management and this language of dissent was kind of– the language of dissent was taught, but also learned by the captain, by the bosses That changed dramatically And I think there’s a bigger lesson here because we don’t all fly in airline cockpits But what it teaches us is that speaking up and listening are teachable skills And that’s a really powerful thing I think that’s a really powerful takeaway And what that means is that there are techniques that we can use kind of in our day-to-day lives, in our organizations to help encourage that And that turns out to be one of the keys to learning from our systems, and adding transparency, and catching those small errors before they turn into the big ones ANDRAS TILCSIK: All right And I think we have time to talk about one more set of solutions And this is something that really surprised us This really wasn’t in our book proposal This was not something we thought about earlier, but it’s something that we kept running into as we were studying cases of failure and resilience And let me introduce this with an example as well Take a look at these two students, hypothetical students Let’s call them Student A and Student B And imagine that they are both applying to an elite, academically-rigorous university Which of them do you think would be a stronger candidate for admission? If you asked people that question individually, something like 95% or 98% of people will say it’s Student A And it makes a lot of sense, right? Student A has a higher GPA, better SAT scores, and their activities don’t seem to be any inferior to those of Student B, unless you think that drama club is really that important But it really seems like the right answer is Student A But of course, this is when we asked people the question on an individual basis But as humans, we are funny creatures when we are in our group in a social setup And one interesting experiment that was done actually just a couple of years ago by researchers at MIT and a couple of other places, what they did is they put people in a room They gave them the exact same problem that I just gave you, Student A versus Student B, and a number of other problems And they setup the room so that each room would only have one actual subject The other three people were actors who were working with the researcher And the researchers instructed the actors to give the wrong answer To say Student B when the problem was presented So they would say Student B, Student B, Student B And in a very large number of cases– and you might recognize this setup from the famous Asch conformity experiments that were done a few decades ago The actual subject would also say Student B, even though we know that when we asked people individually, a very small percentage of people would think that Student B is the right answer And a vast majority would agree that it’s actually Student A Yet here, after hearing these three strangers say Student B, Student B, Student B, we are much more likely to say it’s Student B. Much more likely to give the wrong answer What’s really interesting about this though,
is what happens when you rerun the same experiment with diverse groups In this case, racially-diverse groups So still the same setup We have three actors, the same problem, one actual subject We introduce some ethnic or racial diversity in this case, but its setup is the same The actors say Student B, Student B, Student B. But in these cases, the actual subject is much less likely to fall victim to this kind of conformity and much more likely to give the right answer, Student A And this is something we see across a very large number of studies done in the past five or so years, this effect that diversity makes everyone more skeptical And that’s a very different way of thinking about the facts or benefits of diversity than what we usually do Usually, we think diversity doesn’t really do anything Or if it does something, it’s because people bring different ideas to the table It’s this sort of beautiful group process where we each bring something unique, and then we learn from each other And there’s this great discussion and it’s this beautiful thing Well, what this research shows is that in many cases, diversity is much more painful than beautiful It’s almost like a speed bump effect It makes things less smooth, less pleasant It makes things harder, but it wakes us up And we see this kind of effect not just in the lab We see it in field studies, lab studies, across all kinds of domains, and across types of diversity as well Whether it’s diversity in surface-level things Things like race and gender and age But also, in things like professional background Things that are not on the surface, but they seem to trigger the same kind of effect And we see this in finance We see it in juries We see it in a corporate boardroom very, very consistently CHRIS CLEARFIELD: So I’m going to bring this home with kind of a summary of what we’ve just looked at So we started by talking about systems, and how we can use design to inject some transparency into our systems But design isn’t the only thing that can help our systems There’s a number of other things One example that we really like is you can use this idea of complexity and coupling, and you can sort of use that as a heuristic to figure out which of your systems you should be paying attention to, where these kind of failures are more likely We talked about learning Learning by hearing voices of concern from inside your organization But there’s other ways of learning about our systems, too We can learn by obsessing about small failures that happen And really, using them as tools to increase our understanding of how our system might fail And finally, we talked about, broadly speaking, decision-making We talked about how diversity can be a boon to how we can make decisions in groups But there’s other ways we can improve our decision-making too, from introducing outsiders– something that groups like NASA’s Jet Propulsion Labs does– to using kind of a set of predetermined criteria to score options that we have when we’re making a decision between a couple of things And so I think broadly speaking, the solutions that we have researched fall roughly into these categories But there’s a lot of nuance and there’s a lot of different texture to them And I mean, one thing that you might be thinking– and that we thought a lot as we were reading the book– is that this stuff isn’t rocket science I mean, you don’t have to be a genius to introduce an outsider into your decision-making process And what we kind of came down to was that these solutions are simple, but that doesn’t mean that they’re easy And what we see over and over again is examples where instead of going for transparent design, we choose sort of sleek, shiny design Instead of listening to voices of concern, we ignore them and suppress them And you know, what happened in Flint, Michigan with the lead poisoning crisis is a great example of that Or rather, a tragic example of that And even though we know the value of diversity and how it actually helps groups make better decisions, a shocking number of groups making some of our most important decisions are strikingly homogeneous And so really, what our kind of broad thesis is, is that the world has changed A lot more of these systems are in this danger zone They’re complex and tightly coupled And they lead to these kind of idiosyncratic And yet, somehow connected shared sets of failures caused by a shared set of things And many of our organizations, many of our most important organizations, their decision-making, the way they approach these things, hasn’t yet caught up And so our hope with this book is to really just start a conversation about that And so that’s where we’ll end for today And rather, that’s where the conversation will begin So thank you for your attention, and we’d love to have a discussion if anyone has any questions
or thoughts [APPLAUSE] CRAIG: There’s a microphone over here If you have questions, please step up to the mike And the authors have said they’ll take as many questions as we have time for CHRIS CLEARFIELD: I have a flight at 6:00 But until then, I’m good CRAIG: So please, step up Have your shot at them CHRIS CLEARFIELD: Somebody has to be first AUDIENCE: I’ll be first then CHRIS CLEARFIELD: OK AUDIENCE: I’ve not had a chance to read through the whole book yet, but I wanted to ask your opinion of artificial intelligence Where does this apply to the system and the complexity of the systems that we’re seeing today? CHRIS CLEARFIELD: Yeah, that’s a great question So where does this apply to artificial intelligence? I mean, I think artificial intelligence in many ways is– I mean, I think it’s a big question, so there’s a lot of answers I mean, on the one hand, artificial intelligence is in some ways the ultimate black box, right? So it sort of takes transparency and entirely removes it and replaces it with a set of parameters that nobody but the model itself understands And so I think from that perspective, that I would say is a little bit concerning Do you have anything to add? ANDRAS TILCSIK: Yeah No, I would say that I think there’s potentially an upside, too, in terms of these issues And I think that this notion that AI, if used correctly and well-calibrated, will allow us to predict things in a more effective way, especially in these very complex systems So it’s this interesting tension between, are these things going to be the ultimate black box or are they going to be our most effective prediction machines that we can use to prevent, or at least foresee some of these failures? Or at least learn better from the enormous amounts of data that we are generating as we are trying to learn about these small failures And I think at Google, either way I think the challenge is to sort of manage that trade-off and end up on the positive path CHRIS CLEARFIELD: And I think I would also kind of push the answer back a level and say that, I think whether or not AI is able to help or hurt depends on the organizations that are building the AI, too And whether or not they are able to surface voices of concern or pay attention to these things that aren’t quite right Because I think the other side of the AI question is that we’re building AI into more and more important systems We can talk more about this, but I think driverless cars are an excellent example of both the upside and the challenge of that Thanks Do you have a question? AUDIENCE: So I was thinking about the airplane example And you know, there’s this funny thing that when you talk to users, sometimes they don’t know what they want Or they think they know they want, but really they’re wrong And maybe this is just UX design, which I don’t know anything about But I wonder if just as there are techniques for encouraging dissent in a [INAUDIBLE],, are there techniques for getting users to be– like the people who designed the new cockpit style must have talked to the pilots And I guess they must have heard, oh, it’s a pain in the butt that we knock each other’s lunch on their shirts And they weren’t made to think of the things that they were really using about that interface Are there techniques? Just like for the [INAUDIBLE],, are there techniques of getting users to better understand what are the mechanisms that are making their systems not work? ANDRAS TILCSIK: I think that’s a great question And I think what we’ve seen in the aviation case is that it’s not so much– I think it’s a huge challenge to get people to describe what they want and how– especially in the cockpit design, where it’s really hard for a pilot to describe how they might be interacting with this interface in a crisis situation People can explain, oh, on a normal day, it will be very nice to have my lunch And it’s going to be very comfortable But it’s really hard to kind of think through that And I don’t think that’s what aviation has done So I think with– well, maybe in the Airbus case they sort of talked to their pilots And maybe that’s part of the problem rather than watching what the pilots actually do The reason we are learning about these issues, and the reason I could show you the slide and talk about the benefits of that ugly but transparent design is that we are seeing accidents Or actually, some near misses where we see people in action And of course, aviation has a huge advantage in that we have cockpit recordings and black boxes And we know exactly what was happening and who was pulling back and who was pushing forward
But I think the broader lesson there is to– I mean, it’s really about the basic principle of design thinking, of watching people interact with these systems across a number of different situations rather than necessarily listening to them as they are explaining it CHRIS CLEARFIELD: I think there’s an element of it, too, that Andras kind of touched on, which is not designing for the blue sky day, right? Sort of pushing yourself and your design team to look for all– when you go off the happy path, right? And I think that you could imagine the Airbus is having a design where they have those lovely side sticks, but they are physically linked They are physically connected And for us, that’s kind of this– I think what we see is that the users can take you so far, but then you also have to think in terms of this– in terms if you think about transparency, complexity, and coupling as variables when you’re designing the system, then that can give you some really powerful insights to how to make sure that your sphere of design is broader, I guess I would say ANDRAS TILCSIK: Yeah We talked about this idea of learning from these small failures as they happen I mean, one of the hallmarks of a complex system is that you can’t sit down and just predict how things might fail, but you can learn something about how things might fail as people are using the system So another example is the design of gear shifters in cars, where we are actually seeing the exact same kind of process where there have been a number of designs in recent years that are very sleek, very elegant They are low dials and things like that Things that are mostly computerized, don’t have a lot of physical feedback And car companies introduced them because they could and they look nice and they were pretty, but we started seeing a number of accidents where people didn’t know what gear their car was in They would get out of the car They would get injured or the car would get damaged And there was data about that, but the companies didn’t really pay attention to that And often, that kind of data is much more valuable than asking a user, would you like this little dial? People are pretty terrible at knowing what they’ll actually use CHRIS CLEARFIELD: Other questions? Could you use the microphone? Thanks AUDIENCE: So going back to the Starbucks example, it sounded like these [INAUDIBLE] systems and there’s a lot of human factor involved This is not something new that the trolling is pretty common on the Twitter world So looking back to your framework, what [INAUDIBLE] how could have Starbucks done things differently? I want to understand how these principles apply? CHRIS CLEARFIELD: Yeah Well, I would start even a step back where you’re right that this is not an unfamiliar dynamic for– trolling is not an unfamiliar dynamic for Twitter But I think it’s also easy to forget how new Twitter is So if you think about Starbucks and they’re kind of building up their experience marketing Most of that experience or a lot of it, especially at the time where this was done, was built up in the non-Twitter world, right? It was built up in the– we come up with a campaign, we push it out to people, people like it or don’t like it As you said, this is an example where they have this kind of feedback What I think, I guess, is– I mean, one thing might be, be a little bit more skeptical of how you are engaging users Again, they designed for going off the happy path a little bit in that they had this content filter But the fact that the content filter broke I mean, I think on the face of it means that they didn’t put the engineering talent or they didn’t take the process as seriously as they could have, as they maybe needed to And I think that is actually a surprising– that is actually something that companies have to learn from That just like United Airlines is seeing with– I mean, they just had PR disaster after PR disaster, right? It used to be that that would stay on the plane, right? And maybe other customers on that flight would be aghast that somebody was pulled off in that way But now, it goes global And it goes global immediately And I think that the way that these controversies can spread mean that these companies have to take a very different approach to managing not just their marketing, but also their kind of– their whole approach to this kind of thing ANDRAS TILCSIK: I’ll add two other things One is while researching the book, we have come across a number of these tools that teams can use
Little exercises that allow them to think about these sort of less expected risks more effectively than they normally do You’ll see a number of examples of that The other thing I’ll add is relating to our point about diversity So the Starbucks controversy around wages and taxes at the time was really big and really prominent in the UK But this was a global campaign And I think people rolling out the campaign might not have been aware of just how controversial and problematic this could be right at that time And I think it’s interesting when you end up with this– you are trying to run something globally, yet the team is so local Or the decision-makers are so local, but the consequences are global CHRIS CLEARFIELD: I think I’ll add one more thing to that, which is it seems like what Starbucks did– and we also saw those with– do you guys remember Tay, Microsoft’s Tay bot? So what happened in both of these cases– and I think there’s kind of a common lesson to them– is they made this product and they put it out in the world And they just kind of threw it over the wall And I mean, I can’t believe that Tay ran for as long as it did I mean, to me that’s really surprising If you’re going to build an AI or you’re going to build a system that has all of this feedback in it, then I think you need to be watching it much, much more closely And I think that’s a lesson that Starbuck– they could have had somebody monitoring it at some level And I mean, you guys know how to build tools to monitor things, right? They could have done that and taken that aspect of the process more seriously ANDRAS TILCSIK: Other questions? CHRIS CLEARFIELD: Yeah AUDIENCE: So my question is related to the failure itself So I’m wondering like, so there’s a book from [INAUDIBLE] about [INAUDIBLE] But my question would be about, like maybe failure is good for the system? Like small-scale failures, you know, large failures So it kind of strengthens the system So in your framework, what is the scale that failure is good for the whole system? How do you measure the scale of the failure, if it’s big or not? CHRIS CLEARFIELD: Yeah, that’s a great question I think we start with the premise that failure is a natural consequence of building any system And so what we have focused on is how you obsess about failures so that you learn from them before they become the big ones And also, how you recognize when your system is in a fundamentally different regime Because I think that’s another big thing Some systems, you can afford to have failures in because they’re local, and they’re not tragic, and they don’t propagate But if you think about like BP’s Deepwater Horizon oil rig You know, the consequences of that failure A, it was more likely to happen because of all of the factors that obscured the engineering that they were doing, right? It was under the ocean It was under the rock They couldn’t directly observe it They had to use simulations for a lot of stuff And they really relied on their gut and their intuition to decide on how to kind of manage the drilling process And what we saw is that, A, that didn’t work And B, the consequences were not only tragic and that people died, but were expensive I mean, $50 billion And long-lasting And the environmental damage is still something that is present and being managed And so I think our perspective is that– it’s interesting you mentioned [INAUDIBLE] because the black swan comes up a lot And I think there probably are true black swans out there, but what we kind of think is that most of the time, you can find feathers long before the black swan happens, right? So you can see these black feathers And that the chance that your system isn’t telling you any information that there might be a problem, that’s pretty rare What’s much more common is that your organization isn’t learning from that information ANDRAS TILCSIK: Yeah So in a sense, it’s almost like the issue isn’t scale So where you have a good failure or a bad failure isn’t simply scale It’s sort of what you do with that failure, right? So obviously, Deepwater Horizon is a terrible failure It could have still been a good failure in a particular sense had the industry learned from it And that’s what we have seen in aviation They had some terrible accidents in the ’70s and ’80s, which then triggered all this learning I don’t think we are seeing that in offshore drilling And it’s really what you do with it CHRIS CLEARFIELD: Yeah And I think even in the Deepwater Horizon case in particular, there were lots of small failures that preceded that big failure, right?
There were warning signs that were ignored, and so on and so forth And I think the dynamics of that accident are really interesting and can be really instructive for thinking about your question, which is a really thoughtful question Thank you CRAIG: Other questions? I have one CHRIS CLEARFIELD: Great CRAIG: I didn’t hear two words very often in your talk One, testing And the other, redundancy And I think [INAUDIBLE] Would you have a comment on that? CHRIS CLEARFIELD: Yeah Testing is good I think part of why we didn’t talk about it is we know that that’s something that you guys already do a ton of You guys call it the Disaster Recovery Team, the DRT team Is that right? So I mean, I think as an organization, you all take a fantastic approach Actually, it looks a lot like Jet Propulsion Labs, where you’ve kind of incorporated these outsiders into how teams are designing and engineering things Yeah I mean, testing is super-important If you think about learning from failures that you see as this step, testing is almost the previous step It’s like– sorry Setting up situations where you’re trying to cause these failures so you can preemptively learn from them I mean, one of the things that we do see though, is that testing isn’t the whole enchilada It’s not enough because you can never really reproduce the real world And there’s a really interesting example– the Facebook IPO We go into kind of how NASDAQ tested for that And also, how that failure still happened So testing is awesome And I think we kind of [INAUDIBLE] it a little bit because we think that you guys are probably test evangelists Redundancy is really interesting I mean, redundancy has a huge upside But the other thing redundancy does is it adds a lot of complexity to the system And so you see kind of failure after failure where redundancy is the cause or exacerbates the issue in the first place And so yes, you need to have redundancy But I think you need to have redundancy in, as– I would say, almost like as redundancy can’t be an add-on It has to be kind of an inherent part of the setup And if you have that, it’s much more likely to be an upside rather than a downside ANDRAS TILCSIK: I’ll also say that I think one problem we see in a lot of these cases is that redundancy has these social and psychological effects on people So one thing we know is that when there is some redundancy in the system, even if it’s not perfect, people often tend to believe that it is perfect or at least very robust, which then leads them to take more risks So we talked about Deepwater Horizon just now I think that’s an amazing example of that There is a point in that story where one of the engineers on board says, I’m really not happy with what we are doing I’m very reluctant to go forward with this, but I know we have this redundant system We have this blowout preventer That if everything else falls apart, it will just cut things off And so he says, I guess that’s what we have that system for, so let’s do it And of course, when the explosion happens and the failure happens, it takes out that redundant system as well And I think– in fact, I’m pretty convinced that had they not had that redundant system, they probably would not have taken such a risky series of steps CRAIG: Any other questions? I’m going to take the moderator’s privilege and ask one more then CHRIS CLEARFIELD: Yes CRAIG: I read your book And one of the things that was fascinating to me is the discussion of bank failures and the diversity of the boards there I wonder if you could briefly say something about that, because I think that’s very relevant to decision-making at a place like Google CHRIS CLEARFIELD: Yeah, this is a fascinating– totally fascinating study where these researchers looked at the failure of community banks through the financial crisis And what they found is that banks that had more bankers on their board were actually more likely to fail, which is a little surprising Because you would think that bankers should be pretty good at managing a bank And it turned out it’s very similar to this diversity result that we saw That when the board had lawyers, or nonprofit people, or doctors, or accountants on it, there was this freeing up to ask questions And to say, hey, I don’t understand this Can you all explain this more? Or like no, we’re not going to go with this new product because we don’t understand it yet
It’s not like a risk effect It’s not that the banks with more bankers were making more money because they were taking more risks They were making the same amount of money, but just failing more So they had the same returns And I mean, to your point, I think that’s an example of two things And I think I can speak to this because I sort of straddled these worlds through my career Like normal people think differently than engineers some of the time And so having a group that has a mix of backgrounds can be really powerful to getting that kind of– that questioning and that diverse perspective But also, we should be thinking about how to bring outsiders into some of these most important decisions I mean, the example I think about is Apple slowed their iPhone down with the new operating system And that was probably a really good engineering decision It probably made sense They could kind of make the performance more predictable The phones didn’t turn off But that’s not how the public viewed that And I think if Apple had pulled in somebody off the street– well, that’s hard in this area because there’s so many engineers But pulled in a non-engineer, they might have said, this is crazy This is not a good solution And so I think that there’s an element of that And it’s interesting that surface-level diversity and the kind of professional diversity have such a similar and striking impact CRAIG: We have time for more questions CHRIS CLEARFIELD: We’ll also be up here for a minute if anybody has anything they want to chat about CRAIG: All right Well, let’s thank our speakers And I encourage you to read the book It is fascinating Thank you again ANDRAS TILCSIK: Thank you CHRIS CLEARFIELD: Thank you all very much [APPLAUSE]