so what I will do a few well-known patterns later on show some implementations using cobraman which is dedicated to verse 1 and this is because if you don’t have any business swish so you know why monetary if your customers don’t care if you have don’t put it okay that’s not the case but the way we approach our idea is by first asking the business okay so what’s your greatest fear then we think about our of our list so the first is we have I think anyone with the public API has but especially in the e-commerce business is that University I will be available or in our case we would return resolved too late too slowly or god forbid we expose some kind of 500 or 503 error message which basically tells our customer look you need to take a decision take it on your own and that’s this is like a big no-no when you talk with e-commerce pensions okay they want a decision and what you know what you do with their transaction during the fill of the workflow and with someone click that I want to buy something and they need to you know deliver the goods and send the receipt and you can I help stop it with the bill so this is our biggest office and they’re two ways to approach it the first one is to fix things automatically okay because we as humans are too slow he takes at least you know let’s say 10 minutes 50 minutes for another to propagate them to to to understand what the problem was things even more time and so you need the system to react by itself so this is and then for the rest of this talk we’ll talk about how to do allergic okay so the first thing before you start monitoring is to understand what type of system you know monitor okay to focus about talked about the low latency if you think about is that staying probes okay so you sent a request and you expect the response if your system is a system then you don’t always get the same response but you do that okay so the monitoring system has to be built yeah this man said they’re lonely it’s a system that basically any transaction but do the business logic know the best and then you have an event stream processing a system that is for high throughput it has less of things and then later on we saw that before the transaction and sometimes mostly after the transaction we need to run the conservation system for example to send the bill for customers right and you do not want any mistakes not a best-effort system you don’t send to build your custom and say geeky to pay us between one and ten dollars so I’m going to focus on this system implicitly because so the miking failover just don’t do that because just like the most basic part you have an exception you usually live bubble up benefits unless you know what you do with it you need to let the poster scratch it’s both Fairfax and then you have a system upstart persistent dealer that you want or something like that I thought that we stuff that processing this is the basic level the top of it the complete PC traits that the virtual machine can terminate any more than you need what happens the cause is standing get up and start a new virtual machine instead on top of that you have the load balancer that needs to decide our way to problem

the transaction and if for any reason he doesn’t get a response from one of the instances one of the processes inside instances you just brought it to another instance and a top of that you can use either a third party to further this singlet missive basically means if this data center is off just brought all the traffic to the favor of reduction this is like this is the last piece or so this is a money fell over once your customers you usually have one or all of those the second partner is graceful degradation and that’s very important because the code that changes the most and that’s the jump part okay and if the jump part if the storm does not respond in time that you need she said something to your customer okay so you have the API server okay there’s really no Jess and it needs to do something it’s to take some kind of decision and this one is more stable because we’re making us slow changes and then if no GSA has any problems at all and he needs to take some kind of decision because as I’ve said we need to provide our customers with the decision who there is a decision as a service the less that is no cookie automatic even gear button is is hard to do for doing fine back pressure on your clients so let’s say one of our clients had some kind of problem and they want to reset the tire last day to us one batch and we don’t want you to treat your field with the real consistent okay because we need some kind of capacity planning and such and such so but on the other hand we do not want you to set back an error okay like most does major TV so we don’t want to sit back and listen you you exceeded your limit so we don’t want to apply back pressure so what we do we have spoke code it is not just it does like local money going to customer and if the transaction per second sweeter than special sense the transaction to difference it defines it is bioody since the transaction two different you and this is quite different work okay it’s different machine different process so the real-time workers are entirely not affected on top of that there are test probes this promise which is sent to each of the queues because we want the test probes to cover everything but inside the Q waves were TQ so the test probes are they have a lower coil so if the queue is stuck the disclose our our statute and then we get a notification so this is for so these are over after hitting patterns and everything that is covered by other hearings should not be covered by a been loaded by a human being everything that is not covered by the automatic system needs human intervention and for that he needs a kind of alert okay so the most basic monitoring system is detected try to do some filtering because we humans cannot absorb so much events at data and phone calls and and then you pretty some kind of alert they start and you know some kind of timetable so today I am on schedule at all right now is the beta be stable and and once you get this hopeful or this alert need to do something if you look at some kind of dashboard the dashboard and before measure of their customs so this is our simple monitoring system this is the detection points okay so we have picked up picked up just sends a simple HTTP request we have our own system tests that are basically nodejs test files that sent a recipe request then check the result check that if the state is changed in the system and then we have our code to this waste no Java Python and sin since a lot of events some with latency recording some of it are exceptions and irregularities and for the infrastructure we have in same operating system collective writing which is one of the most one of the oldest ways or most stable ways of common ways to monitor lyric system and here AWS Amazon

Cloud watch the monitor and monitors the infrastructure that Commerce occupies the next stage is to filter over this Dana okay and for the rest of this talk and for now focus mostly on the river and we have those two things into event stream processing basically is some kind of filtering and route this event this commits to the right the right operating system so if this is you know just this is the first time I just proved in other words I don’t want you to wake to it to wake me up at night okay so what I do I usually want it to slap and then I wake up in the morning I guess is like chaff record of what happened or production alert about the adventure duty with the very rigid see settings they they recently introduced Regency settings and from there they treated it knows what you do it develops which one of the developers is all called what is preferred notification method and basically notifies us that we need to do some kind of Diagnostics so any information that those will be mark Olson Wells to our dashboard open source that is elasticsearch is you know the trial is mustache and the UI but which is most of the most customizable dashboards there are we have intuition for it we’re making it for me to be you know feature complete with the vision thing but basically any developer can add to an existing dashboard or a new dashboards or their twenty dashboards like that so the first thing that you saw on the detection phase that we have a lot a lot of redundancy okay and that’s good because monitoring systems are not as highly available as the rest of your system because you don’t have enough time to make them either so you need multiple monitoring systems in case your electric system fails so they need for structural earths basically and they give a very fast forwards usually when we have a CPU spike it’s a real problem and but the problem is that they don’t give energy because it’s pure detection then you application and they tell you the story that if you’re the root cause but they’re usually unless it’s you know it’s subtle exception that we had to pass the courses that are usually to know easy for generating alerts and then you have to become folks that if you get one of those others you know you’re in trouble because all people does he checks that your HTTP service is all mine that’s it and they do it great they’re very reliable for the problem is that because it covered all of our customers use cases each of our probes are you call that system test covered the entire spectrum of our customers okay so whether our future consistent it has better coverage but as I said since our system is you know best effort system it is very difficult to get it right to make the to reduce the force amounts and each time began a new customer with the new use case need to be chewing of these orders so this is how the opportunity formal application this is this is how it looks this is another this is another from the elastic notes no function that it saw one of the system producing in there and you can see that the future is not quite house ok say ok waited for you know five data points this case but it’s all five instances this is how it looks when assistant for fails we have system probe that alerted River which has we will us in a minute alert the rigidity and then after some time the test pass passes and then women closes the alertly page if you killed my okay registry they recently introduced the new or most liquor you either in its type

you know swipe left right missus so how do you filter the system test using a state machine you see women before me to other regiment we don’t want every system test to be sent to patron you okay first because they would further us and significance there is memories so we do effects processing and system the replacement a FEMA which I’ll show you reputation in event but the logic goes like goes like this look at the event if the event has some kind of tag for example this this is the name of the test with ethnicity and each of those event has the state of the test if that that’s recently failed or passed and there was some kind of state change okay then do one of these if the state is opening test versus result if there was this mistake changed by the best thing then trigger paging together and this is this is what you would see in the romantic meditation this is how it works in the code other reason of all these places is because approach is basically women is basically based on closure and closure is basically based on list so I will translate okay so all of these functions are strings means that you will never see or illness never see the event itself you don’t have to be expected give me another arm and other appropriate languages you just talked about functions about transformation today okay so if the event yeah there’s no imagine if the event is connect like this then call and then move the event into this string this this is a function that can get an event and this is repetition okay if the state changed and assuming the state machine starts with this state then check if the recent state is best then is not the pager duty alert and if the recent state is failed and trigger a pidgin enjoy this is exactly the same logic that you saw earlier with with Tecna we will do more and more code snippet and less and less flow so the problem with the simplistic alert approaches that if I wake up and I resolve the other operator duty and then go back to sleep now what happens women the state still think that alerted page entity but the problem keeps persisting and I would like it to River to be open the page editor so for example there was a test that failed women still thinks that the latest state is that the fastest speed so there is no reason to other virginity again on the other hand I manually result using the page ADT Mineral application a mobile application I mean the result they’re like and then what happens is that the alert does not be open again what’s the best fencing so this is how we sold this part is the same okay only resolved if there was some kind of change and the new state is that the tax passed however if the test fail regardless of any previous state then we will talk about this constant to basically perform the entire the rest of the screen by it’s like a group is Kayla okay use the string by the name of the host of the service this is the machine are the best this is the name of the best and then sent it to agility but make sure that you don’t send vertes don’t send more than one trigger domain okay so all this complex logic that is implemented in women pretty easily so this is the quota to serve earlier now this work cause it was is outside the chain state and if you do this by sweet sweet then go to one permanent goodness and only then treat

the paging his problem Raymond construct yeah yeah it’s it’s basically another function that gets an event which we know secure and it outputs the event only this will happen one can happen or that one per minute so closure it’s a little bit difficult to start with and women is it more so but once you’re in it you get a lot of power and less also and now I want to get the alarm you beat some domestic schools you need some visualization this is not the real vocal quality this is just a mom but you can see that a lot of information like the latency okay all the time since we first saw this transaction we support each one of the of the boats in storm we can use it to diagnose exactly where the problem is we have a different way like timeline so here are the names of the boats after the swing and metrics how can you tell which one is taking too long is it column usually most of those are pretty sure and you see the succession PDP defense and here it’s not here yet black that shouldn’t be here okay maybe it’s not maybe JDM garbage collection or whatever and you can see that you have a few votes you know – forever so basically here is the more low-level this is okay we can see here and we can filter by name of the branch tank okay since production or develop or what version of production or any development insert a new branch and then filter all this data you know based on the and down below we have the list of all events you know that is basically monastics first of all our is the data coming to remember or fool pirated Saruman place women is is a push based monitoring system the wait words that you either have a know participate connection which is the preferred way and then you also get pushback which could be a little bit dangerous if the client is not already can send via UDP which is less commended but as it’s more safe of the client side but it doesn’t you know you never sure if there’s a real problem or it just meant position especially yes could be acting smoke now everything is don’t use of the the nippers in general store my tango we built our own system based on women this thing bridges Oh so behind it yeah so I repeat the question the question was how come how does each developer starts a new branch and test is code and they’ll get this dish so basically we have a script that checks what branch are you want currently and then deploys code with the exact same watch that you have you ever live in checks if you forgot to do commit inclusion on that answer and when it reports to women automatically we had this you know this information what is the name of the branch when was the branch deployed and also kind of better they type of the branch and then we filter it either in women or or in nursery Kiba so the second ways that business define is that our JavaScript snippet that our

customers have would hurt the performance of our merchants because the the worst thing that can happen is that someone who wants to buy something and it fails and the second thing that the second worst thing that can happen then is that some tries to browse the the webpage and and it’s too slow so we want to be able to know that our customer has a problem even before he knows it ok it’s very important because what you get the phone call from the customer is that tire you know technical support and customer success and all the managers I mean you don’t want that you want to get at first what you know that there is a problem so you know what we define is that we want to monitor each and every browser anyone that promises or customers website we won’t you know above it and once you know what’s going on and then we want you to do some kind of aggregations but also that because usually the JavaScript our problems are common choose browsers and abilities little differences and then I’ll talk about battery we can define alerts usually seven percent so what do we want to we want you to know about timeouts if someone downloaded the sprinkler didn’t actually get to practice with adage didn’t load much now about Paris and once you know if it’s specific to one JavaScript you know snippet version only the new version or or this is some kind of virtual program because no matter what version just the official snippet we use the existing so we need to keep that deeper into women and the remedy kernels understand how to do it because this is I think that we’re the first abused women instead of serve the manipulate did you were saluted and it’s not you know he wasn’t here for that from day one but but it works so when you make money to a server using Freeman women holds and index the index is a copy of the last event okay the last event for each server for each process running on that server for it yes sir you have an engine X on several number one or let’s say 1000 one and this this specific event accounts how much you know three megabytes Omega by a thread is passed so the vet says the metric is five let’s exactly the lights and the books name is akia dress and the service is whatever whatever metric that you want to find you and what I mean here we see is just example for one that we sent to two different machines and here that this failure the best best best and what women does it stores the latest event that matches this you okay so if this test once again and the events state changes to pass it would only store the last okay it’s like a basic cash what it also does and this is the next button it updates the time when the gto expires the countdown clock expires Lehman sends a fake event which is almost the same as they’ve ended it stored it just changes the state to expire okay and we’re going to use that we are going to use that to understand if sama just downloaded or JavaScript snippet or also volatile so the first thing we do instead of thinking of it as a postal service we think of it as the browser IP address and the cookie number okay this toy take two fires are unique to the browser the browsing experience and our defense state would be loaded which means the page is loaded or down which means that they only reach for JavaScript snippet but the page has loaded and it’s much more complex than that but you believe it as you know a very simple state machine so we have three states JavaScript on the page loaded and then expired which basically means I can’t so I promised you that I will explain how body works and once you understand that would understand the tricks that we are jiju to adapt remain to this so we’re talking about the events it stores the last event grades labid events the white class basically receives three okay which usually as you

saw before where something for two something and each time this is a different combination of cost and service industry think of it like the JVM objects okay take few objects and if there is an existing costly server hosted service that this object has already been created in just passes the effects through the to that object so for example if I have a browser and certain cookie if this is the first time that I saw this kind of event then it will be not a quick way over spin this thing if this is the second time that I saw this event you use the existing string objects you just pass these events which lets us actually maintain state what we did we formed an agreement and we added a new concept that it’s both body faucet service which is almost equal almost the same as this one exactly does garbage collection there are much more browsers in the world that enterprise or data center service okay and so the default implementation never cleans garbage collection and to do it generically and the main branch would be would require some kind of rewrite so what we did we did we took only our use case which is pretty simple because the impact sends us expired event and once we see that event and when it is the existing object okay object automatically deletes the object and then it is garbage collected through the garbage which allows us to make for more and more browsers without reaching without having vegetables or memories so I promise to stop human code so this is the state machine where it’s much more complex than that but this is like a simplification you have an event then said that each other statistic and then you get an event that the page is loaded sometimes you don’t get an event and then the women in that sense and expired and this is for watching the displacement at vacation so this is a fine we quit you know think of this like it’s a lot of objects okay each time we have a new a new browser or new cookie we create we women created new opted okay and then once does that we reject the state is changed what’s the rest a position as a leader it was usually you don’t know most important ones and then he transfers both the previous state and the current state to this function this is why this thing in the trans women that we want both the previous thing in their history and this is a simple if it’s just inclusion okay so on a wheel it as if it was a simple language if the fetus and the current state because they change the metric of the previous event to be the difference between the current time and the previous one I hope all we do is take this double the defect and change the metric to include the time it took to load the bridge that this is all this is what receptionist and then if the previous make equals downloaded but the current state equals expired then best on the previous event but this time change the metric to the general family innocence so this is one string the next screen after it does the application so this is for example a pie chart showing the certain time range which browser had reached girls at time okay so how do we do that now I want to sleep the screen okay the second thing we had before but these things will bind browser so we have this place and then we want to have a list that includes all the events from

the past 60 seconds okay so this is what fix time we do this and then we want to do two things as the arrow says that split the stream again do two things the first one find the median its young okay and edit active element and pass it on to whatever child swing it’s usually that is fast except okay and in addition count the number of events in the time window at the child whatever antecedent you know stash and then we want to do this kind of charts and forced investment and a base matrix and we can add alerts based on that because the same data we see in remember so in elastic search ikebana we also can use me Manchu to the saloons okay the last verse we’re gonna talk about is I mean when I first came before that I thought this is like this is the number one ways that we make a mistake apparently this is only the number you know it’s like the third technical list and already on such a system that is a best-effort system which is still complex as I have showed you it’s very difficult and what we need to do we need to have variable threshold okay so money to make based on fire both fresh of this called anomaly detection so some requirements for example I want a different threshold custom okay some customers have high define me and some don’t and I want you to check into conscious now okay and sometimes for exactly Christmas the decline rate changes okay compared to let’s say whatever June or July and I want to define I want to be able to define what is the sensitivity level of the alarm okay so this is the I know this is the architecture basically where the process the tweets from elasticsearch historical data and other it’s women usually know so thus all the filtering and it gives us women just they’re you know as a it’s about them you know to the sign if it’s if it’s you know if you said wait you read here or set because the thing is for for real-time monitoring you don’t mean to this space system this is all only a memory and as long as you have enough time in CPU this system is very very robust okay here we have this you know this place you know I off-limits and stuff like that so we need stir the data only for anomaly detection this is why you see this is the article efficient alert me if the probability that we decline to watch is monitoring okay going to the in a minute if something happened that the probability that this thing would happen is one in a million then please wake me up okay this is the definition that I want I want to say I don’t know what the threshold is because it’s but this is what I want so what we use is we assume that the distribution of big thanks is what spoke we know distribution so we have a transactions okay of them were declined okay and probability the client is so let’s say I think I use an elastic switch to perform a search query so I prefer to quits the first word is 24-hour span or self are two dates to calculate what is the word customer probability for transaction to be the client okay it’s pretty simple yes give me the number of transactions in the past 24 hours give me the number of the client transactions and then yeah we should yeah so okay so basically to the math I here we can perform two or less in support give me the number of transactions in past 24 hours give me the number of declined transactions

divide those and this is our probability for the transaction then give me the number of transactions in the past 30 minutes give me the number of failed the bank transaction so in the past 30 minutes and then put that into some kind of phenomenal solution and this is Excel chart that help us understand how it works really conceived nice normal distribution so basic it would be defined here there is no other we something happens assuming he’s the best way for all of us we had let’s say 100,000 transactions mother say 5000 was the clients of this Dimas fee of five percent and then the last 30 minutes we had 300 transactions okay then if several some of those are declined then his are because this is I mean for other 300 transactions and with a 5% probability for no transaction to be the club and the client is usually in this and this is no threshold is a problem or if we have a transaction in more than 36 were declined then okay so I don’t have to define these numbers 36 and 0 I don’t have to do that I just need to define make sure that this incident that is so rare once in a million okay then wake me and the way this is translated this is false this is the Vino of distribution so this is the g47 declined transactions per 300 this is ability for 605 before and the beloved for 7 or less okay is the sum of all those who need to do any job in order to have this special beauty to do what – some probabilities and then compare the threshold it looks complicated but basically insensitive excel file it allows us also to change the probability function and still use the same semantics of wake me up this is a there is only one problem and this is the last leg the problem is that the normal distribution series the events coverage and what would you do to get your website and someone declined your transaction you train it again and again and again and again and these defense are correlated okay because there’s some kind of human there that does this correlation so what we have to do we need to ignore we usually only of the first transaction and we ignore the rest of the transactions for the system and then we get back something that is more uncorrelated distribution that’s it yeah we saw the software architecture had the video and of course we’re hiring these are my last questions how much

You Want To Have Your Favorite Car?

We have a big list of modern & classic cars in both used and new categories.