Thanks for having me it’s really nice to get to come here to talk about it as we haven’t actually talked about Guacamole before, so this is a good thing for us so I’m Tim O’Donnell, I’m here with Ryan Williams — he’s back there somewhere and we’re at Jeff Hammerbacher’s lab at Mount Sinai — this is a computational lab. We focus on bioinformatics for cancer immunotherapy so this is after kind of a lot of looking around at different problems have landed in space I and so all of our software is developed on get up so you can check out everything working on so I wanted to kind of just $MONEY front-load this with a slide on what is blocking moly what we have right now i’m going to go back and I keeps the background i introduced the whole so what we have is we have a bunch of abstraction for working with a line short beats on apache spark and then using those abstractions we’ve implemented to varying colors which are very much so preliminary we have a somatic joint color that makes use of multiple DNA and RNA samples from the same patient and then we have an assembly collar that’s doing haplotype collar style reassembly so none of this makes any sense you don’t worry we’re going to go back I but i just wanted a front-loaded this is because we got so it takes about 30 minutes to run on our side note the new cluster on 250x old you know your normal actually skills up and data really easily so adding more samples to that really affects the laptop I hadn’t mentioned our colors are still primitive so we’re not using these in our current science or clinical project so these are this is really about developing these were not ready to say no use this as your very eyes so the way I thought I’d go through this is we do some background on just sequencing and then i’ll tell you about the abstractions that we implemented to the infrastructure and then i’ll go through one of our various colors how it works we’ll look at some preliminary results running on some data and then i’ll conclude so I so as people probably know the cancer is arising almost always not always though but from a DNA mutations in cells so you have you every tissue kind of accumulates mutations and then at some point you have an oncogenic event and now I’m control proliferation is happening and then we actually have a clinical cancer case that’s a that’s a parent we do sequencing on a tumor or sequencing a snapshot of a whole bunch of different clients development I competing so that the stuff that was back here we’ll see it in every cell but stuff that the greater my only seeing a minority so this is a lot of reasons to cancer sequencing so i think illumise is over half of all of their sequences that they put out by now get used for cancer sequencing just some things that sort of came to mind quickly so you can understand the biology and that use that to find the drug targets that’s kind of a classical reason to do a lot of sequencing and then the kind of more maybe less biologically inspired more data-driven version of that is identifying subtypes where you don’t know why they’re different but you observe that people who fall into this type i tend to respond similarly treatment then increasingly in our world in cancer immunotherapy understanding the not cancer cells what they’re doing especially for RNA sequencing is increasingly important and then you can do things like build-up phylogeny trace how atascocita spread and then there’s now beginning to be some clinical applications I although i would say experimental genome sequencing a clinical context is still not quite there so there’s been a lot of sequencing done so this is the TCGA project sequence over 10,000 patients over two bytes of data this is the last year actually and obviously funded by the 2009 Recovery Act throw to the are so just from an

immunotherapy standpoint it’s also one application i wanted to highlight is a phase one trial that were participating in ok so we’re actually making our cancer vaccines where you do sequencing are two were normal and then use that to select to design a personalized see that it’s what i like there’s a lot of different now i just wanted to get a feel for the group that’s what puts the background people don’t know how many people are coming to this for this Park context versus services or how many people work in a sequence in court that ok how many people use spark for how many people know about DNA RNA ok cool i think this may be will go fairly quickly then I for this section but so cell has a nucleus the nucleus these chromosomes a big component of the problems is DNA DNA is the sequence of his four bases and I can have small variations so you can actually also have large variations which is something i’m not going to talk about but they’re just is important but small variations are vacations have a surgeon education and often this is studied in the what’s called germline sequencing where you’re looking at how is some persons entire like all the cells in the body different from another person but for us we focus on cancer so that’s looking at someone’s tumors difficult the rest of their stuff so next-generation sequencing I been sitting on a vast number of different methods that have come out in the last 15 years this is just a small pistol to the ones you can contact with by far the most common platform right now is that which is full but there’s also some up-and-coming stuff to my iron but everything I’m going to talk about today it’s only been developed and tested on hold so this is just an overview the aluminum method of not gonna go really into any detail on this but i want to point out that what’s happening is you’re really have just have a really expensive camera that’s taking photos after you wash up base over a ship that has your DNA hybridized too little frozen and so then that that sort of happens on let’s look at the web on it and then a first computational step is this a scholar step where we need to go from these photos to decide what base was with bread and so this is actually happens on the device usually talk about anymore then is read mapping which is taking the needs that come out so the output of an alumina sequencer is several hundred million to several billion needs where Reed is usually about a hundred and fifty length 200-250 sequence of the four bases AC TMG and the first step in the sort of methodology we’re talking about here is to do realize we’re taking the initiative gray bars here and find where in the genome the best batch so the genome here’s the human reference genome where there’s nothing special about this but it happens to just be a mixture of 10 people from Buffalo New York ah who has type 1 diabetes i think but it’s just a it’s just an arbitrary genome that was assembled so we know the whole sequence of every chromosome or less from start to end and we so using a liner usually use it to the PWA to form this alignment and then burying calling it the focus on somatic variants it’s for cancer very calling the process of looking at the normal DNA reads so we’ll call this pile of vertical bar centered

at a particular position and then the tumor reads and deciding is there probably are very the tumor that’s not also in the bottle yeah so very precious ovarian mutation so a difference between the person’s cancer cells at the physician and the normal cells so here the reference genome as G both of their normal the United you that position normal just means that the october and then their tumor the whole bunch over here so this would probably be a good automatic variant yeah so there’s a looks like the reader so these greens are about one percent accurate are about nine percent accurate communication we made the proper order so these these errors and there’s also an estimate of their tour of our so that reads the individuals it’s this sexy a cluster in that photograph that the sequence is taking we’re over time each base each . is one face i do forward in time that same position on the alumina chip is going to get a whole bunch of dots that’s sequence of the think that I may be a better explanation it’s a contiguous region of the genome that we sequence but possibly the snares so from that from that diagram you would think that semantic variant calling pretty easy we just need a program that looks at a bunch of season besides it’s not tacky but surprisingly this for me into this Isley agreement automatic variant calling across tools is amazing before so if you look at the literature here it’s really it’s really amazing the amount of disagreement so this is a system a group denmark recently published this but isn’t a million studies like this pick up our we took they took a bunch of varying colors and they looked at for something from some sample they had they looked at out what fraction of variance called by one very Oliver also called by another very color and many of these numbers are like below thirty percent of variance in agreement so this is this is really tough any any conclusion you’re going to draw from your sequencing data if you type in doubt up seventy percent different input you’re going to end up with a different conclusion likely we actually have cases were very small changes to our seats same for me even a smaller changes different various colleges parameters actually gives you a different scientific region from the data so this is worrying yeah yeah yeah yeah so there’s a few is a few things going out so 11 issue is a right you could be a sequence in there so it could be an artifact but there’s no actual DNA sample with these species in this position I another yeah another issue which maybe is what you’re getting at is that tumor is actually a mixture of plans and some may have a seat there and some may not and so this question of what do we actually need by synthetic very going where is it is that what you’re getting what are ya so it’s really not will formalize what we will be by symantec parent you kind of do a back-of-the-envelope calculation that in any macroscopic sized tumor any position some cell has application and it of course we can’t detect that with both sequencing so when we really need by semantic variant calling is just identifying various that we can detect that then if we go back and look using some validation techniques like amplifying the region around it we detected

ok so now i’m going to switch gears and talk about some of the infrastructure we built our block Molly so eye-catching spark it’s a generalization of MapReduce so it’s parallel computing framework is way more expressive the MapReduce it’s much more enjoyable to program in I you can really write much shorter programs that nicely captured would want so this is an example program that i took them uh like a blog post all of our company the e you can see with the programming that’s what it looks like it’s really very straightforward and you would be even surprised looking at this program I that it runs on a cluster and can work with a lot of data so there’s plenty of that is a fair amount of existing work in this area so gatk is a toolkit made by the Rose this is the analysis to get so this is the standard set of tools that most labs used to carry on both germline which is that the haplotype collar and our somatic very colleges checked I so they actually have their own built-in MapReduce framework for some paralyzation the issues they can take advantage of localizing the datasource you’re familiar with the do I actually think about it two visits integration hdfs where I don’t know where the data is on your clusters that when you send out your function your map function will tend to put into running on that note that has the data so dat sort of as it isn’t a step and not in our direction and then that atom project is also something out of Berkeley that I was work a lot of similar ideas is it is a few others to basically what we thought was maybe we can take a sort of what the gatk is doing Isis has been lifted to the a spark type work so we ended up making was some distributed implementations that functionality you would use to write our kind of typical very collar and i love that passage to build $YEAR islets and then we call Darren’s so we mentioned pilots for so it’s the são Paulo is this highlighted vertical region where we have the reference sequence then we have our refueling frosty and beltline the reference and now I i love is centered at one genomic position what basis did you read there so typical quality analysis i feels like this we partition the keynote i’m going to go through these steps to tell a second so we partition the genome into economic partitions of tasks that I and then we partition reads according to that partition so this is our way of deciding how we’re going to split up the data then I each partition goes into traffic partition takes a lot of reeds and we feel pileups at each site where we don’t follow you define function at that side and then whatever that means to define function outputs we collect all those results I and your application or you could write out a result the first thing you have to be where is a coverage is not covered by covering me the number of reads our mapping to a position in the genome so often we look at plots like this which make it look like so this is just from some exome sequencing line we had where is pretty clear nice huh i right around probably the desire three depth and there’s a long tail that goes out but it’s all I one thing that got caught us is that this long thing is actually really really long so we’ll find positions in the king of that out a hundred thousand beats overlapping I and so the critical thing I have done recently aware of that so this won’t go through the steps and this is the step one of partitioning the the idea here is we want to break the genome into continuous intervals where each interval has about the same number of roots so what I have here this diagram we have our chromosomes and this is our that bar so we’re their kind of a fairly small over that have larger partitions the idea is that they’ll have about the same never I assigned to them as i mentioned the extremely idea of Regents we just throw

out don’t call me they’re usually ends up being like 10-20 kilobases you think it’s yeah interesting explanation I so then step two which is now we’re partitioning the wreath so this is an all truffle of all the reads based on the genetic partition step 1 so we have here like assumed in now out of much finer level we have two partitions that we have two reads that are in one partition not the other and we don’t want to that’s both and I hear that the free number two is going to get duplicated set both partitions and overlaps so we increase our data size a little bit typically small it’s like one percent and what this does is now we have regions of the genome and all the reads that overlap position keno and these partitions now are actually spark partition so each spark partition now takes it reads that are sorted and can also have spilled to this and now creates pilots at each position of that it’s that you asked to run this function over and I the function it’s calling is a user-defined function so this is a framework for you to get a function of them i get calls it and you basically gives it the position of the genome and then what the pile of was at that position so here it’s you have we read in a sec get that it’s not quite this and also cute college scores and other information about this couple just implementation that it’s that repartitioning that’s done with a very well named massive method repartition sort of the partitions parts part that does exactly what we want it assigns partition idea cares to each partition will also sorts by the position and i also wanted to mention that we never materialised pileups into our dvds other actually ride suspension this might actually changing right now but historically we’ve been very careful to avoid keeping uh think these pilots are very large because they duplicate if you do it nightly you just take in every position in the genome and then taking all the reads that overlapped it so you can increase your data by a factor of a hundred or a hundred fifty that’s the length of your beats if you do it very valuable so are like initial tests with that kind of approach did not work at all and it turned out to be very important to just keep the keep the weeds uh you haven’t already DS of reeds but the actual pileup I just done in memory and done it’s done also with mutation in the programming language sense makes sense if you’re not familiar with RDS so that was just a little bit of detail but so there’s a few other features that are part of the guacamole frameworks and we could be sliding windows which is for each side you might say run this function on all the reads mapping within a thousand basis of a position in the genome and more sophisticated various colleges that do things like locally assembly i want to work out that the pilot and then also we’re not really working it just reads here we also have generalized leads to this idea of Regents which is anything that maps to the genome so if you a huge set of variants like a giant VCF file you can actually use some of the similar methodology to work at this ok so now we’re going to go through a very common that we wrote using up without using a framework I any questions for Bob yeah good question so we don’t do the alive so the alignment comes to us from a liner that was done previously outside of our products of the honorees having a liner part of this would be nice I it’s a significant undertaking but it’s simply something talked about to do so it’s not clear that you actually want to be easier way to paralyze the alignment too just so we actually run directly on the fam piles so so so we used to think of the tube and that allows us to you just stick a ban on

hdfs it gets chopped up and we run directly on that we got comfortable format so the band file actually already usually does index but no we don’t do any additional indexing we don’t actually use the index that is already there I is typically we’re just using the whole file if we actually run on a very small region will use the index to just love those beat but the common cases we just load all the reads Alabam uh yeah it’s tough question actually so we’re doing a lot of steps actually inspired by Adam so we are less focused on file formats and we’re kind of a smaller smaller project or maybe pretty focused on making a very soluble we can use in our in our lab but there’s really a lot of their kind of learning a lot oh yeah sure yeah it’s a very fair question yeah so I so the that there are complicated sequencing artifacts that people are still figuring out so there’s a so figuring out which I know things that are sequencing artifacts looking awful lot like a true area I so i think so that’s definitely part of it and then there’s also the cancer something specific to cancer so the some finality of the fact you often have structural variation and copy number variants that are overlapping mutations in cancer and that those are also happening at their own level cellularity so i also have things like normal contamination but maybe people have other things to add to that location to get that’s a really good question i we actually did do one case where we took a patient uh where we were kind of getting a frustrated with the lack of three minutes of the very college running and we took one patient in a very cancer patient and we did to a very high def exome sequencing 200x just hyper XL and I we then we use the technical replicate we use the same sample then we compare the very cause I don’t have it in this presentation by the slide the and i can show you the overlap is blow surprisingly well I you know there’s a good i think is probably forty or fifty percent in common but even just running you checked on to the same samples is so actually that story gets a lot better once you i only look at variance with higher coverage and also in the capture kit regions that you intended to capture so one possible source of noise is if you try to get variants outside of your capture region or even if it looks like you have coverage you actually are exposing yourself to a lot of bias there yeah to be questioned yeah I think I haven’t I haven’t seen a lot done with that there are I think there will still be a substantial disagreement as you increase the depth but we def will definitely help i wish i could say more than that but I yeah it’s not really not clear if just throwing or read that it will actually get you that much better able you sure it’s just get yourself a musician so just by looking at what temperature location like packed cells

are actually very possible got it it’s just right out so typically we’re just working both tumors but we don’t do any sorting first although i would definitely increasing our purity we have a great way to like sort of cancer cell phone great but typical for them like 5000 area for not taking that said see the most exciting for like the old different with the same cells like probably game plan so we get it didn’t think people were got we have gotten that data but that would be exciting yeah if you have the right parameters to cluster by the other right out of distance but it’s kind of fun fear what you use I like one thing you could do is is after what you have your variance you can maybe look at doing something like that but I just don’t know where any meat is coming from it’s coming from a novel celery to use certainly in the genome yeah like in terms of their reputation very similar but the trouble is you’re just getting up next year you’re doing sequencing and you’re getting I just kind of a random school full of us that you suck so close to each other in terms of what a quick features gotcha yeah so the I yeah so when I mean so that all the cancer does have you know sometimes as many as like 40,000 mutations still the vast majority of the king of the same as the normal and so when you have a read usually you have 0 or 1 mutations about me so then it’s a question of if you could do something like clustering but clustering reads but you would know I know which which is an error but the walk is also the issue of germline mutations for the vast majority of disagreements with the reference and a tumor are actually turbine very you’re gonna have I’m sorry I got something you inherited from your parents that every cell of the body has so you have millions of German variance and you know I most a couple tens of thousands of of two bangs pictures three points musicians different schools site i also we when we was we work this framework and then we then thought about what what would be a good application where we can kind of do something better than what existed and take advantage of being able to compute on more data so one thing that came to mind i was working with multiple samples from the same patient so many of our projects this is one that

I have to work out which was a reanalysis of this Australian ovarian cancer study uh where for all these patients here one of these groups of the patient at the bar the samples we have multiple multiple samples per patient freaking on a post-it so a so there’s a lot of context where you even if you only have one sample you might have RNA and DNA and i doing joint calling over many samples at one position is really common for German calling I’m there’s been some attempts to do that for somatic like without a lot it’s really gotten your package it’s easy to run so we thought we’d start with this as a as a test case I so we wrote this thing called the static doing caller it calls snv’s small and bells on any number of 24 normal samples from the same patient it will also consider RAC sequencing if you have it I and its overall the design is pretty similar to be checked so go through just all go pretty quickly to this just the likelihood set up for it I but this comes in handy in the in the next slide so free sample we’re going to define a reference picture and a variant mixture where we’re taking this path is like the variable frequency and we’re just going to do something really simple reallocate the number of variants so be like the number she’s not our that season that previous example not about the number of total rates uh but we florida . of five because that’s what we think are lower detection of edges it’s actually probably higher than that I and then we have a mixture so we can say the perennial Eli casio G how many what fraction of the cells in the sample do you think of that as that a wheel so it’s just their allele frequency if you get the secret various sequence and it’s the reference sequence uh give it sorry 1-3 allele frequency on the reference sequence so this is not this just comes in handy for i just put that up there show this formula but so we can now compute a likelihood again this is all being done individually for example but we can compute a likelihood of the data which is the pileup that reads we got given a particular mixture so particular mixture might be the reference mixture which has all reference just as a hundred percent of the alleles in the sample being a reference or it could be the very picture where there’s some I say twenty percent very 80% reference and then we can kind of make this very huge assumption which is completely not true either use useful assumption that given the mixture all of the bases we read are independent of each other and i’ll come back to that but if we make that assumption that we can just take a product of the individual base likely it’s an individual-based likelihoods are easy to compute because we know the quality score that is something that the sequencing platform is giving us if we trust that we have a probability of the correct so then we have a likelihood now all we need is a prior and posterior good all we need now is a liar and you get a posterior so you remember Bayes rule this is that a way of connecting to the a given B TV DNA so we computed from the previous life probability of data given mixture now if we define a prior which is just the probability of the mixture if you don’t know any data we can have this thing that we want which is probably the mixture i given the data so what we want is just what’s the probability of this variant vs reference mixture given what we observed and here’s a really simple prior also i’m i’m ignoring the denominator everywhere because I ended up just taking I back so everything is just up to personality so given the data i was surprised where we define a somatic variants is a really simple way to do that we just say our prior on a mixture is well about 1 million while the somatic variants and one otherwise so this is this this will probably have to get to know better this is this is over using right so then free sample we compare the protein your probabilities of the variant preference mixtures and all this so far is basically just when you check does then slide changes we also make a pooled sample by combining all the tumor DNA be across all the samples and we compute

the same quantities for that so if you have a lonely like fraction mutation that’s consistently found across your samples we have a chance of being able to call that a pooled sample even if we wouldn’t have had the power to find it an individual sample and then very mixture has a higher probability example we trigger very call there so there’s a little isn’t as a subsequent step to but I basically the idea we look at every sample individually we look at for example all the way we incorporate ra is also really simplistic I so RNA alone should not your very life you have a lot of a very basis in your RNA sequencing do things like RNA editing it’s not a good idea called area where the majority of sites will be false positive however if we have RNA support and we also have it with DNA that should make us more likely to call Barry so the movie implement that currently is just a change of prior which is we have a morally prior on our berry mixture if there’s RNA support so i mentioned that assumption of all the three phases being having like independent errors I not true so the way that the kind of hack around this that most very college do is that after this liquid decided likelihood computation to find a bunch of plausible variance and then I apply a bunch of filters that remove known sequencing artifacts so we’re experimenting with the same approach we sent a small number of filters that are nothing new just want to check those yeah we actually don’t currently have that implemented we should be really nice got a question so that’s so that’s the somatic during collar then there’s also a the assembly collar which i’m not going to talk much about what I just wanted to point out that we also have something that does i right now it’s germline that sort of it will do some somatic but I it’s similar to the gtk appetite collar or eventually we hope to make something terrible to check 2i and we’re basically just for testing it for really just debugging it correctly so we’re going to do some preliminary results which you’ll see is is a pretty thin right now so this is this is very good so are testing questions pretty DC it’s a hundred beds with 24 course code 12 terabytes of every a couple of bites storage and so something this is very early stuff we did not talk about a while ago it’s just looking at can we scale before were doing anything interesting just can we read in the reeds do something with them so as you would hope that is feelings a very good feeling with the number codes uh was almost almost linear uh we were were basically io bound which is where you want to be and are you stealing as we had the course for note because Rio about to get less improvement dramatic more course oh just some example runs that we’ve done recently so like to whole genome examples 22 minutes three whole genomes samples 31 minutes so it’s kind of the range for where where it is right now is to probably be improved a lot it’s a ride doing a lot of work on doing that right now comparing the music is a little hard because it paralyzes it’s also differently so finding a valid way to compare against it difficult to provide here we took the it kind of leaves the parallelization up to the user so we took the typical way that it’s been runs in our lab which is to paralyze across chromosomes and for that the slowest chromosome is really coming up on The Biggest and so I run their took a hundred fifty-eight minutes comes up one so you can see this is there is an improvement for faster although we are also using all of our computers or because so we actually I’m not sure how many cores of course we typically use prepared so it has 24 correspondent has 24 actual physical course but then there’s a

question of like optimal scheduling how many do you actually use but actually wrote it was your question i noticed this wow she got it yeah cool so we’re experimenting right out with moving google cloud I don’t have any data on that but we this is all our cluster at outside our that we’ve been using so this is a customer book bare-metal a couple years ago after you do it yeah yeah yeah good question I so we did some testing of the partitioning and i’ll actually skip over there right now but I’m going to show reruns we did one is using TCGA data once using this dream challenge and was using a patient but we don’t have validation David you text calls so we’re also gonna a repository where we’re trying to collect very calling benchmarks so it’s also like a test harness for running broccoli so the unfortunate thing is most of the data like TCGA we can’t distribute it to people because they have to go inside the agreement BTW but if you get your own data you can use the secure right guacamole on it and we’re also interested clear it will be this repository or something else we’re interested in my guy and a collaborative effort of getting together benchmark data if any of you guys been to anyone i’m very color ok so a small minority of TCGA samples have some sort of validation running on them and this is where you did exome sequencing you then ran a very color probably be checked scared some variants and then you targeted sequencing sometimes with an orthogonal platform of the different before but usually just with a woman a higher depth to them validate so we we look through TCGA we found a 700 Sam 700 samples had a 5500 at the top we’re doing this that I had some validation do but there’s a big big caveat to this which is that of you both recall and precision are kind of difficult to mention this contract so precision which would be saying we made a call that wasn’t in the TCGA is that a good call we have no idea that’s true they didn’t try to validate people . didn’t call but recall even is also subject to this more are sort of complex ascertain that bias where they only validated calls that you check was able to make to begin with so we’re kind of limiting ourselves to calls mutex could make I and so just because we make a call that wasn’t tested doesn’t necessarily subject so it was that obvious this is what the data looks like a which is basically that we uh mostly underperform are we consistently underperform on recall both music instructor and so this was using him tubed are not hand-tuned just camp kicked thresholds for filters and for likelihood that we just made up as be for our first test so it’s difficult to optimize that setting there because you don’t you can measure precision you can always get it to make all the TCT calls it just also make a lot for all since I’m here I if you’re improving anything so that this dream a challenge which is a public contest for somatic very like currently consists of synthetic data for number of patients is it increasing difficulty making the Olsens of a bunch of datasets eventually is also be validation data for real patients sequencing so again we saw that are hand-picked thresholds perform poorly services of the James synthetic one which is the easiest challenge to call variants on so the goal is the blue that’s the true variants which we know is synthetic data the green is like you checked all the mutex called all most of the almost all the gold mutations plus a bunch more that were false and guacamole you can see did pretty poorly it i called you know a lot of stuff that’s that you’re gonna call them about relatively small fraction up the truth Falls so we looked into how do we optimize this how do we pick our right

this is a bunch of parameters Tupac ability I don’t pick those parameters a and so the first thing we looked at was I doing something kind of data-driven so we just like that what happens if you use logistic regression on the uh basically the features that go into your filters and you’re likely they’re just output all those features for all your sights and then do resisting regression cycle 50-percent test train split so we took the dream data we chopped in 22 pieces we try to look at just a regression model on mondays and tested it on the second piece and it does really well i can almost completely patch so this is a precision recall / so this is uh almost all most optimal it actually has all the data it needs to make the right calls just because really small number of features which is possibly an artifact of the synthetic nature of this dataset so we also try to reinforce that doesn’t better actually and then I just pick an arbitrary picked a pretty good-looking . on this precision recall curve and looked at what calls were actually generated there so this is the block Holly this is the published politics you can basically match between data using a bottle so yeah sure I yes we have enough data that we didn’t even do cross-validation we just split into a testing train sets that so you’re asking all right yeah so we so this is just on that one sample that one synthetic data set but it with the the point there is just there is enough data apparently in just this small number of features were collecting to call variants you just need a good way of integrating that data or so the how yeah yeah for your question so it turns out we actually compute a bunch of likelihoods across every night for each allele and it turned out the tumor reference allele likelihood was the most important feature in the random forests and I was actually found also in a second data set which I have no intuitive explanation for why that was the most important feature I but the other the other point is that the traditional filtering model is just have a bunch of heart filters that if it exceeds the threshold of strand bias is more than a given number you filtered to call whereas these models are way more as much more capacity of these bottles they can learn to do a lot more things by balancing you know strand is with all the likely really high uh yeah interpreting these models did not uh is something that’s totally open for us so we also just quickly this is basically the end so i looked at the so very cancer study patient i just wanted to see how many new mutations to get from doing is calling this is a case where we have two samples from this station I and not too surprisingly I the well so just looking at the published calls that were made by the group that put out this data we’re really getting nothing doing a sequencing but if we go over here if we look at so we call that was not published many of these may be false though we are getting I see it we are getting like ten percent from the cooling so there’s a possibility that there’s some advantage to doing that but this definitely doesn’t are established that so you’re going to wrap up now some open questions for the variant calling the really implemented our shows should we stick with our filters which is like the most varied colors or should we just switch to some data driven model so the people at the broad actually as far as i understand you some theater bottles just after very calling they call it very solid score recalibration so I guess 11 question for us it should be just kinda to that from the beginning I just needed to get and that’s basically a matter of getting the third point which is some validation data that we train on and then there’s also the question of how do we integrate these two very colors that we have we have assembly calling and we also have this joint calling and so this is an open source project was those issues open Institute working on these problems people have to have attributions yeah so this question how do you take multiple samples and use

your you’re generating haplotypes by doing assembly so one possibility would be just like taking all your data pulling it together and getting a bunch of haplotypes and it’s going everything but you could also imagine doing it for sample and then there’s also the question of RNA how do you deal with introns that are near your region if you’re trying to simply calling should we try to assemble with all right so it seems some spark takeaways so overall this this project took us a while to get things working out sparks so we started this project two years ago sparks come a long way since then uh if you have experience working with distributed systems you may be able to use park effectively on your problem but it’s not magic and it’s not going to keep you from having to think about our problems like sku so that early workout example and spark uh that I showed that’s not really our experience running a very color we have to really dig into a lot of tuning and a lot of thinking about how to distribute this problem so the ease of debugging as before but it’s improving so understanding why we spoke job what memory threshold I it exceeded and so similarly so finding the right settings of every size and number of course for node that’s difficult right now our message kind of we use brian has an Oracle where if our stock dropped crashes I asked him what should I change and says I raised the executor or memory limit but the point also that these parameters change for each dataset we run out so right now we’re very much in a world of and optimizing when we run on a new dataset if it has more data will increase and we will decrease the number of course so that there’s more memory available things like that and so eventually getting to a point where we look at the data and then decide what are the right settings to run sparked with I is something that will be great to do yeah that’s it i also wanted to thank everyone who worked on this add-on uh and are allowed so it’s ryan with myself and then also future and Eliza check so if there’s any questions or open discussion some of the people tried in similar spaces yeah yeah yeah so multi-region sequencing has been a little bit so that it’s usually the term for you take a tumor and then you take different parts of it there’s a couple of really interesting analysis that have looked you know across a large number of samples and there is spatial heterogeneity in it you know as you kind of expect I but we’re we’re really not in that space like our that cost of doing more than one that usually we just want more patients we already don’t have enough patience to do that the science so we want to put our money into getting more patients not getting more samples from the same patient that’s why i see so actually I this for ovarian cancer hydrogenated that is a poor prognostic marker I think for other cancers to right now the way of measuring heterogeneities usually from just a single tumor sample of both sequencing you can kind of look at the illegal fractions and you know if there tend to be lower those various things that basically cluster allele fractions but if we had multiple samples from one tumor we could do a better job yeah right so yeah that’s better yeah but you just imagine actually right so yeah just keep your doing with its

around for our it’s but i think it’s a good point where we made a best effort to try to get the basically what’s the latency if you have plenty of nodes you want around you check while you’re gonna wait on one chromosome but we only would use yet almost 23 chromosomes 23 computers there was we’re using a hundred push-ups yeah is exact person who DNA genome data type all that step is it still probably out where to set anything going to get going now I see me questions yeah anyone have any ideas actually what’s new oh cool the perfect for personal genome or something talk about we’ve had very little but the other the question was it was a good book to read to get up to speed on some of this and one suggestion is joel Dudley’s book Joel Dudley jo yeah cool and a john hopkins Coursera course resources at all yessir reached have a giant list of resources for onboarding lab people can sense of that sure yeah it’s thinking you’re saying like if we knew the approximate mutation count can use that to get their prior on whether it’s already yeah oh if you know the purity yeah that is a nice size yeah so there’s a lot of approaches to that right I see what you’re saying a heart yeah our samples usually come with a pathologist opinion on the purity so the pathologist takes the kind of little piece also the tumor that’s going to be sequence and they’ll say oh I think this is thirty percent normal 70-percent tumor then there’s also methods using snipper rays that can all estimate the purity more accurately where you look at changes in the minor allele frequency a whole bunch of germline snip positions and those often actually disagree with the college is usually that it sounded like the pathologist overestimated the amount of tumor in the sample so very callers commonly will take that information into account ours doesn’t just because it’s very simplistic but I it totally could it could and should these be out and I think that kind of thing is also where this kind of approach with spark

could be more useful where once you’re doing multiple productions on the whole data currently we’re really looking at the data once still uh but as we started adding more runs through the data or savory first estimate purity and then call variants that’s the sort of problem that we might start seeing more advantages too sure you get close so how ya like it ya so it’s a cup so either the short answer is I don’t doubt so it’s a complicated thing because you could also just have differential expression I within the clones you might have that gene highly expressed in the RA and you know one clone about the other when we looked at just how many variants can we find evidence in the RA for it’s fairly high that if it’s expressed at all we often can see also you can see the variant but I don’t know how that relates to some commonalities like an interesting problem that we haven’t we haven’t studied but like to do cool thanks I guess

You Want To Have Your Favorite Car?

We have a big list of modern & classic cars in both used and new categories.