Uncategorized

good morning everyone can everyone hear me good morning thanks for coming my name is Jessica and I’m here today to talk about a large-scale multi tenant search use case that we implemented on top of solar cloud and some of the automation tooling that came out from the design here’s agenda of the talk first we’ll talk about the use case and some of the design details and shaolin’s talk right after will include a lot more about the solar cloud side technical details that’s complements this section then we’ll talk about some of the automation pieces that we implement it to make our lives easier then we’ll talk about some of the measures that would put in to increase fault tolerance so first let’s define some terms and concepts in case someone is not familiar but this is in court a core is a physical Lucene index and in case someone hasn’t seen what loosing looks like this is what it looks like on disk in some of the older writings – solar the term core has been conflated to mean the entire index or if solar instance because solar used to only hold one core but in this hog this just means a one single on disk Lucene index a Corle is on a node and you can you can notice now that a node holds multiple cores and a node basically is a instance of JVM running solar listening in one particular port and no lives on the host and some people might decide to run multiple nodes on a host but for the simplicity of this diagram it’s just one host that one note that our host s query goes goes up or index goes up you can add more host just like this one and solar little solar cloud links them together using zookeeper zookeeper for some people who are not familiar it disagree that coordination service that’s often used to share states and configuration and to terms that are relevant to this talk are one Z node which is the data that data register and zookeeper and a watch which is something you can set on this Z knows so that when the Z no changes you can be notified I’ll tick together this setup is called a solar cloud cluster so in the last light we talked about the physical topology of solar cloud let’s look at some of the conceptual terms so a collection is a logical unit of the search index sharing the same schema and configuration so for example a store can have a products collection and the product collection schema can include deals such as product name SKU categories price etc so that the user can search for a product from any of these fields as the collection grows we can divide them into shards and ash art is basically a logical slice of the collection and as as more queries come in you can you can furthermore replicate these shards into replicas and so a shot it’s the replica is basically a exact copy of the shard briefly going back to the physical kepala G each one of these short each one of these replicas is say poor and so these course can be randomly scattered onto different notes in the solar clock cluster and solar Cloudland links them together using zookeeper into one distributor index so we talked about how a collection can be divided into shards so how does silicon so how do we determine given a document which charter should go to so prior to solar cloud this is often just done at an application side and in solar cloud there is a concept called the document outer which basically determines what shard a particular document goes into so by default is this a hash based router which takes the document ID and hash it into a 32-bit integer and if you imagine a collection as a hash ring with a range of all integers lying on the couch circumference of this circle then you can take the document ID hash find it where it belongs on the range and find out what shortly belongs to so for example is my document ID hash is 55 then it is between 0 and integer max value over 2 and so it belongs in 2 short 3 in version 4.1 solar introduced the composite ID router which is a type of hash based router but on top of it it

allows the user to specify a shark key along with the document and it takes the upper 16-bit of the shark key hash independent with the lower 16 bits of the short key asserting the document ID hash to make the hash what this does is that for all the documents with the same Sharky they hash into the same pie wedge on this chart so that they’re either co-located on the same chart or maybe native very neighboring shards of the pie wedge across this chart boundary by the way I like eyes this as a pie wedge but I’m just to make sure everyone can see it if you take a range of integer on this circumference of the circle and take the two endpoint and link into the middle that’s that’s a pie wedge don’t want anyone to be confused and next overseer is a special note in Silver Cloud cluster that’s elected to handle special Custer State updates and Emin commands and lasts just a quick mention of solar lord which is our automation driver application which I’ll talk more about in the automation section so that’s all the terms and concepts so let’s move on to the use case design so we were tasked to build a multi-tenant search platform on top of solar cloud and with each cluster holding roughly a million distinct logical indexes we don’t know the scheme ahead of time and this the tendon can change the schema at anytime and furthermore what’s more challenging is that we don’t know which logical index is going to grow to which size oh we can guess is that most of them will be small so maybe just about a million documents and a few of them will be medium so in the tens of millions maybe and just just a handful will be large but anywhere onwards from that so design point number one is to put multiple logical indexes into multiple collections so first let’s consider the two extreme options that we have and please let me draw your attention up to the legend first before I bring up the diagram that a score is the collection and a lot in a circle is a logical index the first extreme is to put one logic index per collection and four million collections so this is the natural thing to do and if it worked well that that would have been a good you know it’s easy thing to do however first of all it’s very inefficient to serve many tiny Lucey indexes and we expect most of the indexes to be small because each core has a non-trivial overhead and also solar cloud wasn’t really designed to have this many collections in mind and so we expected that the custer state might explode and we wouldn’t be able to scale well and so the other extreme that we options that we have is to put all million collect all million logical indexes into one collections we will use dynamic schema so that they can live together the problem with this is that there is completely no isolation so if a average tenant is co-located with a hot tenant then all the performance will be impacted also there will be too much overlap in the composite ID pie wedges so that if a large tenant is causing the compact composite ID pie wash to be split multiple way then the another tenant that might be small but just overlap the pie wedge will also have to be split and they will have to fan out their queries same to multiple chives so we take the middle road we put roughly a thousand logical indexes into roughly a thousand collections this gives us a better handle on isolation and also we have a mitigation strategy called migrates that I will talk about a little bit later for really large tenants so whereas a thousand collection is more achievable in solar cloud in its current state a million just sounded way too risky for us to do but even as it is even with a thousand collections the monolithic cluster states still might not scale in terms of both sino size and the number of Watchers not only with the cluster stage xeno be larger there will be more state changes due to more collection and more notes in the cluster in other words there will be more state changes triggering more Watchers pulling a larger xeno content so this is definitely a bottleneck that needs to be fixed and you’ll hear more of a lot more about the details from Charlie since we’re mapping multiple logical indexes into multiple collections we need to handle the the logical index to collection of mapping at first we considered alias but having millions of Z node is still potentially scary so we

just implemented as a simple solar collection managed by our applications supporting a sign look up and remove mapping but this mapping is highly highly cacheable because usually when we assign after we assign a logical index we never want to move it except for in my great design point number two is custom charting since our tenant sizes are varied we need to shard the collections differently and only when needed and both since most of our most of our tenants are small there is no need to char most of the collections but for a collection that does get big because of the large tenant we need when we need to start it we need to short it at an arbitrary ranges instead of always down the middle s is implemented by default or solar cloud so let’s look at an example if this is the hash ring and let’s look at the char one portion you can see sha-1 is getting large because the red tenant of there is having about 40 million documents whereas the other tenants are still pretty small if we if we issue a shard by default it’s going to split the short split the hashing down the middle here well that’s not very useful right because the upper portion still has about 40 million documents and now the lower portion is trivially small so what we really want to do is let the short right here so that basically you have roughly equal number of documents on each side also when a tenant grow so large that it’s causing the composite ID wet pie wedge to be split multiple ways and now you can see that the sad flu in yellow tenants are also having to fan out their queries in this case we want to be able to just migrate the super large tenant out to its own collection so that we can completely isolate it so this is the mitigation strategy I was referring to earlier so that we can scale the large tenant completely separately even after the fact design point number three is pertinent for user routing so we briefly mentioned about composite ID routing in the terms and concepts section and traditionally this routing is just two level so if you manage this collection with these tenant pie wedges if you want to query just Jessica’s collections or sorry documents then you only have to hit char one because Jessica’s pie was just entirely contained in short one but in our multi tenant collections we want to be able to have the flexibility of doing both sometimes pertinent or sometimes pretendin per user queries so here for example if I want to query the documents of 10 and one what shards do I have to hit that’s right so both char want to char 2 because 10 ones pie wedge overlaps both of those shards but what if I just want to query 10 and once user Lancelot’s documents so in this case I just have to hit char – because Lancelot’s pie wedge is a sub range of 10 and 1 and it is contained wholly in char – and so that way we can issue our queries more efficiently but we know exactly what we want so that concludes the major points of the design and Halden will give you a lot more detail on the solar tech sort of cloud side and let’s next let’s move to the automation section and its holiness that we implemented to make operations less cumbersome so the main goal of automation according to me is don’t wake me up at night it’s not needed although made as much as possible it is something that we truly believe in because we have a small team and many large clusters to take care of so the fewer operational things that we have to do manually the more time we have to building things it’s a welcomed solar Lord so Lord issues admin collection API commands when certain conditions are triggered in configured alarms alarm configuration are small we store them on zookeeper so that we can adjust the threshold and have them picked up right away these alarm conditions are evaluated based on aggregated stats collected by our custom cloud solar server wrapper which collects these tests put them in buckets and periodically filters out the unimportant ones and send all the important ones to solar cloud in order to figure out if we need to trigger alarms since the Lord might issue multiple commands into the cluster at

one time and while these these commands are being processed we still need to be able to create new collection in our live cluster we need the overseers processing of these collection commands to the asynchronous and parallel and this is a new feature that we depend on so let’s look at some of the alarms the first one is a simple ad replica alarm which is triggered when a particular shard is getting too many queries per replica in this case all or simply issue it now replicas into a solar cell what’s a stir up with that so that simple a detail here is that we are a configuration only allows the Lord to add up to a limit number of replicas because we don’t want to give it an unlimited range to add as many as it wants if this limit has ever triggered we do want a pair of human eyes to go in and investigate next to smooth replicas so what if a notes overall stats is high but due to one or more cores from different shards or different collections being hot at the same time rather than you know one one one replica of a shard triggering the eye replicas alarm earlier so in this case what we want to do is just move one of these cores off to a more idle node so that we can even now the CPU utilization so since solar cloud doesn’t have a move primitive we can accomplish this by first issuing an Arabic up and then once that replica has recovered all the documents and come up alive then soul Lord can delete the original one so that’s move a split shot alarm is triggered when a particular shard is getting too large in terms of number of documents or if it’s getting too many updates the first thing solar has to do when a split shot alarm is triggered is to figure out the proper split point because I’ve seen earlier in a multi-tenant collection splitting down the middle of the hash range it’s not always the right thing to do so if a shard is split due to the number of documents then solar would do a binary search on the hash to figure out a point I which they’re roughly equal number of documents on the side and if it’s triggered due to update rates then it will only take action if the aggregator stats actually point to a hash or key in which case it will try to split it down the middle of that short key to do a split Solar Lord issue a special command but this is not enough because the resulting split shard are left on the same node in the leader and so solar Lord also needs to move one of the resulting shards onto another node by doing what the move shorter alarm just did which is to add in a delete so that’s an actual rules so sometimes no notes can fail due to bad disk or really any number of reasons and when that happens we need to be able to replace a node so to do that solar Lord first figures out what replicas were on that node by consulting the cluster states that just know going down and one so Lord has figured out what replicas were there then it calls out replicas to add them all to the new placement node since we always run multiple replicas to ensure availability in the cluster these nodes can recover the replicas by talking to one of the remaining life replicas in the cluster once that they’ve become lives then slowly Lord can clean up the state of the bad note by receiving delete replicas and then that’s replaced so talking about no failures bring it’s a good segue into the last section of our talk which is the fault tolerance failures are a fact of life as the number of nodes that we take care of gross failures become an expected thing and so we need to build fault tolerance into our design first of all of its whack rack awareness so solo Lord is aware of our data center topology so that when it places replicas it never plays more than one replica per rack so that if a rat goes down or single rat goes down it never takes out more than one replica of any shard of any collection so you can manage the replica placement yourself as well when issuing

collection API command so for the create action use the create no set parameter and for the add replica action use the node parameter there’s also now a JIRA tracking doing this on a solar cloud level where administrator just has to communicate the the topology into solar cloud and the solar cloud will do right placement for you so we’ve seen that a note can fail a host can fail Iraq can fail so of course there’s a possibility a whole data center can sail it’s in a previous company that used to work with their lots and lots of stories of just datacenter failures and one of my favorite one is when a hunter accidentally shot the fiber and the data center just drop off the map so we need to we need to do prosthesis to make sure that we have availability to do this when eight update comes in our solar Jake wrapper also puts it on to a local Q this Q is then asynchronously replicated to the remote DC and then played on to our blood buddy solar cloud so in this model update doesn’t have to do a proxy see right and also if the remote these states down that up they can keep going because the update would just stay on the queue until the other side comes back up so essentially our buddy Custer is our ha spare when the primary data center goes down we can route our traffic to the buddy and similarly when an update comes in the update will be placed on to a remote well DC to local Q so this Q which tries to replicate it update back to a DC 1 but obviously while DC 1 is down I can’t do that so stay on DC to queue until DC 1 comes back up and then the update can be replicated back and replayed on to the primary solar cloud a detail here is that the replication flow might cause out of order updates and we need to make sure that out of order updates does not result in inconsistencies between the cluster and we take you the picture of starvation we make use of the new feature of dock base version constrains processor factory to make sure that solar cloud can be configured to take our external version and discard owner older updates there’s also now a JIRA tracking doing frosty seed replication in solar cloud and it will be really great to have out of the box in addition to the within solar cloud application and the prosthesis replication we still do periodic backup in case of a user error our backup application takes periodic snapshot of all the course in the cluster figures out the segment files that it didn’t have from a previous snapshot and only download those it is also smart about bottling the download bandwidth so it doesn’t overwhelm the data center network also it has a one-button restore to restore all a single snapshot to either the same cluster or a completely different cluster with cluster state rewrite to map different course into different nodes this is just a tool that we have in our back pocket for whatever reason we need lastly I want to talk about a curious case of inconsistency that we found in our load test and it’s a pretty fun story when because of the heavy load tests all the chattering between the solar nodes was causing or triggering our kernels and flood protection so that all the requests on the solar ports start being throttled for the requests on the zookeeper port remained fine so this is problematic because in the update flow when the leader tries to forward the update to the follower that failed and so then the leader in queues in memory requests recovery to remind the followers to to recover later however when that fired that also failed and so at this point the leader gives up because the leader thinks well if I couldn’t talk to the follower for so long it must mean that followers down so if when the follower comes back up we’ll just request recovery and they’ll get the updates but the follower was actually up fine all this time and actually happily connected to zookeeper so it never learned that he needs to recover at all so this led to solar five four nine five

which essentially tracked on zookeeper which follower needs recovery also as it was the solar car Custer returned success to the client even though update was only successful yet applied on the leader and this is dangerous because the leader could possibly permanently goes down before the replicas have a chance or the followers have a chance to recover that update and if they can be lost and so that led to solar five four six eight which essentially allows the user to specify an Minar F which is minimum replication factor with their updates so that if the update is successfully applied only on a number of followers or replicas fewer than M R min are F then the user can connote this and decide to we try to update to achieve a higher replication factor together this makes solar more robust against inconsistencies but there’s still a lot more work to be done and we’re really really excited to see it happening within the community so lastly before I hand this hog over to Charlene I just want to point out we’re hiring if you’re passionate about building and managing robust large-scale distributed system and you have particular interest in solar or distributor search and I’m sure you guys all do please talk to us after after the talk and also email us at solar cloud desk jobs at group Apple calm and recruiter we’ll get right back to you thank you

You Want To Have Your Favorite Car?

We have a big list of modern & classic cars in both used and new categories.