Good afternoon, my name is YuChing Yang and I am the Director of the Statistics Unit here at the UCLA Center for Health Policy Research Welcome to today’s online seminar, “Combining Traditional Modeling with Machine Learning for Predicting COVID-19.” I’m thrilled to introduce our special guest today, Christina Ramirez, a professor of Biostatistics at the UCLA Fielding School of Public Health Dr. Ramirez and her esteemed colleagues have been working tirelessly to predict and track COVID-19 cases and death rate using a unique combination of models Globally, there are more than 13 million confirmed COVID-19 cases and a death toll of more than 580,000. With the recent and alarming surge in COVID-19 cases across the US, which we are now counting at almost 3.5 million confirmed cases and 138,000 lives lost The role of public health research and the partner are even more crucial in slowing down and ultimately stopping the spread of the virus. The recent surge called for more effort to prevent overburdening the health care system and upholding the public health mission, to protect vulnerable populations A state making decisions about reopening, present an increasing health danger Prompting an even greater need to understand the disease and its spread among the community, in not only the state of California, but throughout the nation with healthcare experts using the various models to help forecast case and death rate to ultimately identify hot spots and the need for targeting resources in those areas. While most use either compartmental SIRD models, curve fitting, or machine learning to model COVID-19 cases and deaths Dr. Ramirez and her colleagues have combined all three techniques into a single model In this talk, she will share her groundbreaking and comprehensive model that combined traditional SIRD model with case variety and the machine learning to get precise, reliable estimates of COVID-19 case and death rates, shining a light on whether the pandemic is gaining speed and if the deaths are accelerating or stabilizing. This project also used the CHIS data from the UCLA Center for Health Policy Research to obtain an accurate snapshot of California’s data, so that morbidity and mortality rates are based on the non-preference of the social demographic factors such as age, race, ethnicity, and the co-morbidity or underlying health conditions Dr. Ramirez and her team were also among a group of scientists who worked with lawmakers in Los Angeles, California, London and even South Africa to enact masks as a source control prior to the CDC and WHO recommendations If you are interested in today’s slides, you can request them from our communications department at the email address shown on your screen Please stay tuned for announcements on our upcoming seminar in August on work we are doing around COVID-19 in the Native Hawaiian and Pacific Islander Community with a spectacular group of graduate students from across the country Now, let’s begin today’s discussion with Dr. Ramirez Hi. Thank you for inviting me. It’s a pleasure to be here This work was done with the help of graduate students, I really want to give a shoutout to

Greg Watson, who works at the Center for Health Policy Research He is a phenomenal, ask all of these graduate students. At the very beginning of March, I sent out an email to graduate students who would be interested in working with me on these models and Greg Watson, Di Xiong, Lu Zhang, Jay Xu, Joseph Zoller, Phillip Sundin, John Shamshoian, Teresa Bufford came to my office and at the beginning we were working 12 to 20 hours a day to get this model. Also with the help of Anne Rimoin and Marc Suchard over here in UCLA Thank you for joining me This is not going to be a technical talk, so I think there’s only one slide with the equations, so don’t worry, I will explain all of this and I’m so pleased to talk about what we’ve done for modeling COVID-19 So, we’ve already talked about the devastating toll that COVID has taken globally, but we really need an accurate estimation of cases and mortality projections to be able to make good recommendations for decisionmakers As YuChing was saying, there’s three main types of models that are used for COVID prediction, and we’ll talk about each of them in order The most common ones are these Epidemiological Compartmental Models. So, this was the very first Imperial London model, a traditional SIRD model. Curve fitting models, the very first one that you’ve probably heard of was the IHME, when they first came out these were curve fitting models and lately we have machine learning models I study machine learning, so it’s near and dear to my heart and we actually combined all three of these into our models First let’s talk about what these compartmental models are Ours specifically is a SIRD model You probably hear SIRD, SIRDS So, it’s really sort of simple. You envision the world in four different compartments People who are susceptible to infection and because this is a Novel Virus, we assume that everybody is susceptible And susceptibles transition to infected, and infected you can either recover or sadly, you can die. And so we have each of these four compartments, and we model through a system of differential equations the transition probabilities for each of these compartments Okay, so this is the most common model, it’s been used for over 100 years to describe disease phenomenon We have this system of differential equations, and we can use the literature, we can use prior information, or we can derive parameter values empirically from the data to govern how people transition from each of these compartments These solutions to those differential equations allow for easy joint modeling of the disease state. So we can put recovery, we can put hospitalizations and other states can naturally be incorporated into these separate compartments, which is why they’re so widespread. They also allow for projections forward. So curve fitting, the projections really can’t go that far because you have to project a curve along These SIRD models allow for much longer projections. But, disease transmission rates have changed due to political, societal responses, these lockdowns, these mask ordinances, and, modelers have to model upward, downward transmission rates, and this can be a little bit ad hoc. That’s because we have this compartmental nature, and we assume that you have sort of uniformity or at least homogeneity within each of these compartments. This is what brought up curve fitting models And so there’s a couple different flavors of curve fitting models, and we’ll just talk about some of the more popular ones Ones that you probably see on the internet, a lot of them are Serial Growth Models This is really simple So you figure that the number of new infections is a function of previous infections and so often it can be weighted by or scaled by the reproductive number, this is the r- value you hear so much about. Which just simply put,

is the number of people, of each infected person goes on to infect And this we allow it to be time varying because we know that it varies across time, and oftentimes this is also weighted by the amount of time between infection and infecting another person Then we also have to model death, and deaths are often modeled as a second step Usually, we use something like a negative binomial, that predicts the number of daily deaths conditional on the recent number of infections It’s also important to remember that a lot of these probabilities we’re modeling are conditional and not marginal probabilities Then we have statistical models Here we can actually have a lot of flexibility What you want to do is, you want to fit the data that you observed as a function of time and other covariates And a lot of the popular ones are loglinear, ARIMA- which is the AutoRegressive Moving Average ones, exponential, logistic. Generally these can accommodate the sigmoidal shape of cumulative counts And sigmoidal, that’s sort of that s- shape that you see, that has an exponential part, and then sort of a logarithmic part These type of models can account for this pretty easily, and they can also incorporate time-varying covariates like mobility data, social media information. So I know Google and Apple have been tracking mobility and we can put these into our models. But, the caveat is it’s hard to forecast forward trends Because they’re sort of unknown in the future, we sort of know what they are for the short term, but we don’t know how trends may change in the future So, it can be harder to incorporate information about information on spread if we’re only modeling deaths because we know that deaths lag infections, but deaths give us a really more reliable data So, a lot of the models that you see, they’re modeling deaths and IHME is one of them because they’re more reliably reported. Every day, I track down news stories of cases and testing and problems with the data And so death tends to be much more reliable, in a statistical precision way And we can incorporate all of this information, but they do lag infections and so what we see is going on in terms of interventions, then we see later on, we’ll see it in infections, and then we’ll see in hospitalizations, and then we’ll finally see it in deaths Another caveat is we often have to model one outcome at a time. We are modeling infections, we are modeling deaths, and then we need to take additional steps to predict other quantities Machine learning. So, neural networks are often used in conjunction with compartmental models This beta, this is a parameter that often relates S to I and here’s two good references where they’ve done a really nice job incorporating machine learning into these models and it can also be used for curve fitting. To our knowledge we’re the only ones that combine all three. Some other models to be aware of, so there’s agent-based models and what they do is, they simulate individuals of a population and how they interact. This gives us a really nice mechanism for modeling the effect of interventions so you don’t have that homogeneity assumptions where everybody in that compartment is the same But, it does require assumptions on human behavior and their interactions within the population, as well as the infectivity of SARS-CoV-2 This is not a trivial matter. Feinman, most famously, was commenting on quantum mechanics, and he was saying how much harder his job would be to model the interactions of electrons if they had feelings Modeling human behavior is not for the faint of heart. Okay, so let’s talk about our model. What we wanted to do is, we wanted to combine all three of these modeling techniques to get a very

flexible model. We use the SIRD compartmental model, which I talked about before, that’s susceptible to infected recovered and dead, then we we fit a Bayesian non-linear Mixed Model to case velocity and I’ll talk about this in a second. Then we used a Random Forest to be able to model the death rate and so this is where we used the California Health Interview Survey, specifically for California And we had to use other data sets for for other states, but the CHIS data was absolutely beautiful And for our very first models, we relied very heavily on it and also for California So I swear, this is the only one that has equations on it and I’ll talk about what this means So this is just for people that want to know what our system of differential equations is. Really all this is doing is telling us how do people transition from being susceptible, to be infected, and whether they recover or they die We allow these transition functions to vary in time, and then we want to be able to incorporate covariates and other information as well as uncertainty. So the transition between S and I is determined by this ξ(t) Think of it as the number of confirmed COVID-19 cases Then, we transition out of I with a rate parameter, which is the inverse of the number of days a person is expected to be infectious Transitioning out of I is determined by this death model, which is this theta of T. This is the Random Forest. And as a caveat we we only do deaths from COVID. So we ignore births and imported cases. We’re just modeling this system. So, if somebody gets hit by a car and it’s not due to COVID, we don’t count that So, let’s talk about this ξ(t) term, and how we transition from compartments. So what we do is we want to model the velocity of the new cases, because we want to really know the trajectory. We look at the log cumulative case counts. We look at these, and these are monotonic, so they can never decrease. Cumulative case count can never decrease But we need a lot of continuity, so we fit a cubic spline to the observed log cumulative case and then we observe the derivative at the observed time points. And so really what this gives us is the instantaneous rate of new cases This is very similar to the reproductive number, and we can estimate this velocity by fitting this cubic spline Tthat way, we can really see if cases are accelerating, decelerating, or staying the same. But you might say “oh, but this is a cumulative case count, it can never be negative!” Right. So, we employed this log link to map this velocity across the whole real line This is where the non-linear Bayesian mix model can be used to obtain location specific estimates of this trajectory So we can do this even down to the zip code level, if we have the data. So currently we do for counties, some some states actually provide us zip code level data and we can actually do it by ZIP code. But we also do it just by state as well By doing this, we can actually borrow strength across locations while accommodating for individual variation, and this is where this mixed model comes in Because this allows us to do this in a statistical framework All right, we do this for 500,000 times and each of these instances gets put into our model, so we can actually get estimates of uncertainty Let’s just look at what this tells us Here, for a couple of states you can see the log cumulative cases and the velocity Here are the cumulative cases for a couple states here. Here’s California, here’s New York, and now let’s look at the velocities. You can see that there’s a lot of differences in these velocities. It’s also important that we only did this, since there were 100 or more confirmed cases, because this is where you have community spread And now all 50 states have it, so we’re able to model every state and obviously we can’t model counties that do not have a hundred cases or ZIP codes that don’t

because we don’t have community spread. But we can borrow strength across locations and pool those ones to be able to get estimates for this And you can see that there’s a lot of structure to these So we fit this model, this Non-Linear Bayesian Model, and we get the posterior estimates for these locations, and we can model whether or not they have interventions, what interventions they have if they don’t have it, and to incorporate uncertainty What I was saying before is that we run the model separately for each of these half a million posterior samples for this ξ(t), so we can get interval estimates quantifying uncertainty Okay, so now we need to transition out of I’s. So we’ve gone from S to I, and now we want to figure out how do we get out of I? We assume that it’s the inverse of the expected number of infectious days From the literature they say it’s about 14 days, which is why we have the quarantine for 14 days. But we’re statisticians, and we never like to say we’re certain We like to incorporate uncertainty, so, we sample from a Gaussian with a mean of 14 with a variance 1, to be able to incorporate uncertainty into our estimates And then the transition to death is determined by Random Forest, which is a machine learning algorithm that I find is very useful in many applications. It’s very fast and of course, we’re running half a million posterior samples, so we do need to make sure that our stuff scales What is really nice about it is that it allows for easy incorporation of covariates — age, sex, race, comorbidities, density — anything that we think is useful, we can do it and what’s really nice about Random Forest is unlike traditional linear models, you can actually have more covariates than observations and it ignores ones that aren’t useful for prediction So, we fuse all of these together so we can incorporate the strengths of three different modeling approaches And like we’re saying at the beginning – we really relied heavily on the CHIS data, especially for California, because we live here, we were really interested in modeling California at first and CHIS gets really nice representative samples of things like comorbidities, race, gender that is representative of California as a whole. So we could get really good estimates, especially for death, because we know that patients who are older, have comorbidities and also by sex and by race have differing outcomes in terms of morbidity and mortality, and we wanted to be able to incorporate this into our models Another really nice thing about Random Forest is it allows you to look at covariate importance and it really does find what what we expected — that age and comorbidities, but also density — really matters. So people that are living in dense households have higher risk than those that are sort of alone and you know maybe a single family home And so we can look at these and know that it really makes sense across all the data that we’re seeing worldwide, in our model This is just a mean squared error with the permutation importance and the variables value. If you have questions feel free to email me, this is not meant to be a technical talk But just know that our death model really does make sense with what we’re finding across the world Okay so now let’s talk about predictive accuracy. So I’m going to take a minute to explain what the MASE is and it’s just the mean absolute scaled error I know it sounds like a lot, the posterior median number of cases and deaths and we do this separately What we’re looking at is a ratio, our model, versus a model where we’re just saying that cases or deaths are just a random walk of the training data I see this a lot, everybody has a model these days, and I see it on the web where they’re just doing predictions based on where you were, maybe the week before. This is our baseline, you know of the accuracy of a Random Walk Forecast of the training data. So we get to see all the data and just predicts one, two,

three, four weeks forward. So a MASE of one means that we’re not doing any better than just a random walk Less than one means we’re doing much better and here you can sort of see this one here and so when we did this, this was in early Apri. So here it is across time, and we’re doing much better than a random walk For cases we’re at .4 and for deaths a .32, which is well below 1, suggesting that we have very reliable predictions So let’s talk about some of the results and here is California This is data that was done yesterday This is a really fast-moving epidemic, as you know Let me explain what what this output is So, this green line is the cumulative cases, not on the log scale, but on the native scale. So this green line with this little shaded region is it’s like our 95 percent credible interval, but it’s not not quite a credible interval because for the death model, it’s not. Because Random Forest is not a probability model, but for here it is So this is what our model and each of these data points are actual data, and we plot these after we do our models. Our model is fitting pretty well, so this is the cumulative number of cases in California, by time, and we’ve projected out through August. This one is active infections, which is different from new infections and different from cumulative infections Because people stay infectious for on average 14 days, and they can go on to infect other people, and people that were infected are now becoming uninfected So, in pink, this is what this is and this is our 95 credible interval of where we expect the number of active infections. So we don’t have any dots to plot here, because this is an estimate and nobody knows the number of active infections and so these are our estimates. The orange, these are our deaths and so our death model and because it is Random Forests, it’s very jagged, because even though it’s sort of smooth, it is a discrete step function So each of these dots, is a data point and there is a temporal trend in when cases are reported and also deaths, with fewer deaths being reported on weekends And then after the weekend and holidays we see an increase in death, which is why it sort of goes like this So a little happy note and I really, — fingers crossed — hope that it stays true for our projections each day, our numbers of deaths projected forward is starting to slightly decrease So hopefully, California, we won’t see high levels of deaths like we saw in New York, but deaths do lag infections, and we are waiting to see what happened after Fourth of July And this last one is R(t), so this is this R value and we allow it to move across time and this was early in effect So let me explain a little bit about R, so R(t), it changes over time when R equals one That means that the epidemic is stable, for every person that gets infected, they go on to infect one more person. At 1.2, for every 5 people that get infected, they go on to infect 6 And so this is where we’re really starting to worry. At two is where we start seeing exponential spread. We’re hoping that we’re going to come down, and we’re starting to see signs that the R(t) is decreasing in California. But not quite going below one, so we’re obviously watching this every day Okay Arizona, this is where I’m from and my family’s from, so I worry about it So there’s actual good news coming from Arizona, what looks like their cases, hopefully it stays this way, are peaking and starting to decrease We’re also seeing this in deaths But it’s a really wide interval, so it really depends on people really coming together — or coming together by staying apart — because you can see that the the predicted R(t) is still well above

one. But I’m really hoping that the death rate stays low. But in terms of new cases It actually looks like they may be peaking, but that assumes that what’s happening continues to stay that way, but some actual bit of good news that has been developing this week New York is looking really good, they were hit really really hard and it looks like their death rate is not increasing, they did have a little bump up in transmission here, probably from Memorial Day and all the activities afterwards, but it looks like they are well under control. Georgia, not as much, but we are expecting an increase in deaths, but not as big of an increase that we were seeing, maybe a week ago I tend to be an optimist and try and look at the happier side Florida, we can see that we’re expecting a massive increase in deaths there and really high levels of transmission Michigan’s looking like – it was above, but maybe we’re not going to see as large increases in deaths. Minnesota as well. So another caveat: a lot of people use these online tools, which I think are great, and I’m really happy people are doing that, but you really need to, well I’m a statistician, so I always like to look at the data. And when North Carolina at the beginning of June, it was really different from what we were finding in North Carolina and so what I wanted to do is plot what they were predicting — and this is theirs here in red. — versus what our model was predicting and overlay the actual new cases and they were predicting that they were okay, that their spread was not increasing and that was the opposite of what our model was finding Always look at the data. I don’t trust anybody else’s analysis. I always, always, always want the data and to do it myself So we combined three models, so we can get more accurate predictions of cases and deaths, it’s a very flexible and we hope to be able to help policymakers make informed decisions with this. Here is the the paper and I’m happy to send it to anyone if they want more information and these are the two papers that I referenced in the talk and I’m happy to take any questions Good afternoon everyone, my name is Tiffany Lopes and I’m the Director of Communications here at the UCLA Center for Health Policy Research We’re going to be going through questions from the audience today, I encourage you to continue to use the Q/A function in Zoom to ask questions We’ve got a lot of great questions to get through, so let’s begin with the first question. How do we know that recovery or death are the only outcomes, and is it possible to continue having this virus in some form and should that be in a model, if so? Yeah, so our models are very flexible and we can model many, many different outcomes for which we have data and so where we actually have data are cases and deaths, so this is what are on the state dashboards Hospitalization data is coming It has not been as reliable as we would hope because we can easily model hospitals and so with any infection, you either recover or you die. So we just put it in these sort of dichotomous terms, but there is a whole host of things that can be modeled in between. We just need the data to be able to get good models on this I’m hoping that the hospital’s hospitalization data will become really good and that’d be something. We do model it now, but just separately in different states for which we have good hospitalization data But it is an excellent question and yes, we are really sort of limited on the data, but we do do that and our models do allow for hospitalization data Thank you. What data sources are best for this type of modeling, and do you believe that the shifting of COVID data from the CDC to a database more overseen by the presidential administration will affect our ability to model COVID accurately? For most people that I know that are modeling, we’re not using the CDC data. Most

people that I see were using COVID tracking data Which is done by The Atlantic or by The New York Times or by the states themselves. The CDC data tends to lag by at least a week often times and for this we need daily data and so I’m literally waiting for each state, each day to upload their new data and and then we take it So actually, this doesn’t affect our model at all because we actually we don’t use it So I do find COVID tracking project and also a 1.3 acres. So one of our students was a part of this and they do a very good job What a lot of people are doing is, they’re just scraping each of these state sites for the data, because we are finding that data reliability is a massive issue and it’s hard to reconcile the data from different sources So a lot of people do scrape it from different sites. But as for us, we do not use CDC data, so it won’t change us. Got it, how well did your model predict the increased number of cases expected when California and Los Angeles rolled back the stay-at-home orders? I can show you Yeah because we do have a Bayesian model. So you can see, maybe I could have put little arrows for each of these things and we can actually project out pretty far But the more you project, the least the less reliable it is. But yeah, so we we actually sort of see things before it’s reported And so I think our model is really good and gives, especially in this R(t), really sort of gives us an idea of where we’re going forward and also the deaths do lag, but it’s actually quite accurate and you can sort of see with the depth that it it does track really well Thank you, someone is interested in knowing more about how to interpret R values such as, 1.2 or 1.5, and aren’t those values very concerning giving rise to growing number of cases, and shouldn’t those estimates be relevant to decisions such as whether to open schools? Yes and so countries actually like Germany, regularly report the R values but you know you need to take the totality of evidence This epidemic is really complex, and it’s really hard to really summarize everything into one number. So, the easy way to understand R is when it’s one. For every one person they infect, one more person – so you see this plateau. At 1.2, for every five people, six people are being infected At two is where it really becomes exponential and so two people infect four Four people infect 16, and so that’s where you really see these uncontrolled dynamics And we really do think policy makers need to look at everything including death curve, because what we really need to do is, we cannot overwhelm our hospital systems, because if our hospitals fall, then people who are sick can’t get treatment, and perhaps these people will die when they could have been saved had there been a bed. I mean, fortunately in New York that did not happen Nobody was denied a bed that needed it, nobody was denied a ventilator, thankfully. And our ability to treat COVID has actually gotten better. You know the treatment is better and we’re seeing this in the death rate. In New York, the death rate was incredibly high, and we’re not seeing even though we’re seeing cases go high, we’re not seeing at least for right now, we’re not seeing the death rates get high. And I really think that our wonderful doctors and nurses and scientists have really done a good job and been able to prevent death in COVID. So that that really is good

We’ve also learned how to expand hospitals, and I think the hospital system has been absolutely amazing Opening schools is really sort of tough and I really do worry about you know sort of structural inequalities that really fall on lower socioeconomic parents. Especially parents of young kids, you know if they are more likely to be essential workers and if they have young kids Do they have to make, I don’t even know how you make this decision of, do you send your kids to or do you stay home and watch your kids or do you go out and work to be able to provide food for them. And I mean it’s a terrible choice and especially if you are a single parent, and so the issues are more than just one number and they need to be sort of examined in their totality, because we know that COVID has disproportionately impacted communities of color and those of lower socioeconomic status and closing the schools is going to burden them disproportionately as well. So I really worry about that impact You mentioned earlier about modeling interventions, could you talk a little bit more about it, and would these be the timing of reopening and universal masking roles, and can you talk a little bit about what you’ve done as far as masking? Yeah, so when we started modeling in March and this is before the governments did any interventions, we were thinking to ourselves how do we do this? What would be sort of the best practices and so we looked throughout the literature, and we actually envisioned these interventions at the workplace because an employer has control over their workplace One of the things besides hand sanitization and social distancing, we actually thought of masks as source control because the employer has every incentive to keep their employees safe, but also their customer safe. If customers don’t feel safe, they’re not going to go into their establishment. So we actually modeled the first interventions in the workplace because we didn’t have anything in terms of a governmental intervention, and one of the things we really thought about were masks, especially because we were seeing reports of asymptomatic infection So if you don’t know you’re infected, we don’t want people to inadvertently transmit We just wanted to figure out a way for people to keep their droplets to themselves. At this time, you know, masks were really in short supply, and we did not want to take it away from the healthcare workers, because, you know, if healthcare falls, we all fall And so we’re like, you know, scarves, bandanas — any sort of facial coverings, so that people keep their droplets to themselves — should be able to slow the spread So we could actually model this in terms of transmission and contact rates, and so we did And then California was one of the first to enact it, and we can model it in terms of reduction in transmission Because we’re doing this Bayesian Model, it’s just one more covariate that you can put in. I hope that answers your question I think so. Could you actually talk a little bit more about how you use time covariates – new cases on t1 and new cases on t2 from slide 21? Yes. So if you’re familiar at all with Random Forest, you know the way that you have to put in the data. We know that you have this auto correlation and so we allowed this to be part of our x matrix. So x’s are your predictors, y is your outcome. So as you can expect cases today, look a lot on cases up to two weeks because we know that people are infected for about two weeks We allowed this to be in our model, even though we know that we have correlation here. We actually have a machine learning algorithm that does take into account incorporation that is

an extension of Random Forest, but it is rather computationally intensive and we will like to put out this stuff daily and this already takes overnight to run, because we do do half a million posterior samples. But actually, it doesn’t really change the variable importance in this case because we’ve done all sorts of testing of what happens to variable importance, under correlation. So for this we are actually getting a good variable importance measures for it and all of our usual suspects do appear So we are pretty confident in this. When we match it to actual data, our death model seems to be doing a good job predicting. Were you able to determine a range of lag time between an increase in community spread and increased risk of infection? I’m not quite sure I understand that question and so we do allow the infectious time to vary randomly and so we can incorporate this measure of uncertainty, but certainly if people were infected for shorter periods of time we would see less community spread But barring pharmaceutical intervention, I don’t really see that as happening, unless, you, of course, lock down and you take those people out of circulation Got it, how far does the model predict the infection and death rates? Is it three months ahead or longer? We can put it out until we run out of susceptibles because it is a compartmental model But of course, as with anything, the further you project out the higher or the larger the credible intervals. So, usually, we project out about four weeks, but we can, there is no theoretical limitation other than if we run out of people. Okay, does your model help to better quantify the percentage of persons who have asymptomatic versus symptomatic infection? No, we would need really good testing and testing data to be able to do that So the data that we get are on cases that have been identified and people that die. We are really data driven because we’re statisticians. If we had really — Iceland has had some fantastic studies on transmission because they have done some widespread testing But we are reliant on the data, but that is a fantastic question and one that is really important and as we are seeing more testing, we are seeing a lot more asymptomatic spread. Someone mentions that they recently read that the virus is mutating regularly and that newer strains are less lethal Can your model adjust for the changing death rate and help understand the different strains of the virus? Okay, so the mutation, yeah so it actually has not been shown that there is. The papers are saying that they believe there’s increased ability to infect, so increased infectivity, but they haven’t seen a change in pathogenesis. So it doesn’t look like it’s more lethal, but also doesn’t seem like it’s less lethal. But ,yes, our model could take into account, I mean Betty Corbin’s paper is a really sort of brilliant where it’s showing sort of the co-evolution of these two strains one of them becoming dominant. But I have not read any papers that show that it has any impact on the pathogenesis of the virus, but I look at the literature well, every day, it’s what I do But I haven’t seen any papers that suggests changes in lethality Got it – but our model could easily incorporate that, if we know. If we are actually sequencing the virus and we know which strain somebody is infected with. Which is another thing that Iceland is doing, which is really sort of interesting, because then you can really see where the virus was imported. Got it, does your model account for heterogeneity and the number of infections caused by any infected individual? Yes it does, so this is why we we have this half a million posterior samples

From ξ(t) and for each of these samples we’re running a model, so we have half a million realizations of this model. So it does incorporate a lot of this heterogeneity and we also incorporate/ allow the infectious time to vary randomly, as well, to be able to incorporate this uncertainty, because statistics is never having to say you’re certain, and really point estimates really don’t help inform policy making because it gives you too much confidence in that, you have to have sort of this confidence interval or credible interval around it Because this thing does incorporate this uncertainty because different people have different courses of infections For some people, it affects them terribly and for some people, they’re asymptomatic. But we also have these super spreading events and so we need to be able to incorporate this uncertainty into our models. Great, we have some questions about the the modeling and the methods that are more technical actually for you. So what was the fitting procedure you used for this model, and what methods did you use to get uncertainty estimates? Additionally could you also talk about how fusing the ML model helped over a baseline of vanilla SIRD Yes, and so if you look at the the vanilla SIRD, it really overestimated the number of infections and the number of deaths because it didn’t take into account that it really was – that there was this disproportionate impact by age, that it didn’t kill people sort of equally. So a regular SIRD model, you have to – it’s governed by this dynamic equation and you can’t really, it doesn’t naturally fit these covariates. So, we did our model in R, and we used good sampling, and we did a burn-in of about a 100,000 and we kept 500,000 posterior samples of our sampler So for each of these samples, we run through this SIRD model, so it goes through this model and goes to the death model We get one instance of our output and we do this 500,000 times to get estimates of uncertainty. Got it, and somebody’s interested in knowing sort of the the secret sauce, your software and are you sharing any of your algorithms? Oh yeah, absolutely it is on Github and if you get the paper you can replicate everything It’s all done in R and well I use R studio, but we do make our code available, and our methods are explained in detail in our paper, which is available as a pre-print. It is under revision for plus computational biology and yeah we are happy to share. Thank you, we’ve got time only for a few more questions and there are a lot of questions in the queue, so to anybody who hasn’t had their questions answered, don’t worry we will get to them after But let’s get to a couple before we have to close up The first derivative rate of case velocity was said to be similar to the reproductive number, but looking at the graph many of the values were below one Is there a way to map first derivative to actual R? An excellent question, we really didn’t do that, but remember, yeah, so we can not map. So you’re talking about this one and this is not, it’s not a one two one mapping where it’s on the same scale from pretty much zero to infinity So it’s not quite on the same scale, but it’s really sort of similar because the R value is really sort of showing, is your epidemic increasing, decreasing, plateauing? And the velocity is getting at that same thing. So, velocity doesn’t quite have that, “oh if you have x people infected, they go on to in fact somebody else.” But the R comes with its own sort of assumptions, and so we just really wanted to take the cases and see what the velocity is And to be able to do this naturally, in the data that we were given, but in this framework that we can add covariates. Got it, if you run your model out to its endpoint of no susceptibles left what is your current estimate as the total number of deaths that would occur

in the U.S.A.? Oh an excellent question, I actually have not done that, but that would be really interesting. Like I said our technology increases our ability to treat increases. So I actually really have faith in our medical community and our scientific community that even if our cases keep going up and everybody gets susceptible or gets infected, our medical ability will be able to save a lot of these people that may have died, had they been infected earlier in the epidemic. But I also hope everybody doesn’t get infected, so no I have not done that. This is a really interesting question, is it possible to include housing density into your models to account for vulnerable populations increasing, having to move in together due to loss of employment and service sectors as attempted mitigation of homelessness? Absolutely and that is one of the things that we incorporated, this population, this density. And so because there, we just did simple for states, you know just sort of the average density. So knowing that New York City is much more dense than Wichita Falls We wanted to be able to incorporate that, but it’s an absolutely brilliant question We see this when we are looking at different counties and actually different locations around Los Angeles County that places that are hit harder are ones that are much more dense or have a higher proportion of multi-family living arrangements, because you just have that much more contact between people. And places that have a mass use of subways. Anywhere you see a lot of people congregated together, these give chances for this virus to to infect other people So we put it in with our death model, but we could actually incorporate it in our mixed model as well Are there any policymakers that you’re working with? Are there any policymakers that are using your models and which policies have been informed by your models? So early on we were talking with Mayor Garcetti’s office to use masks and Mayor of London and also some people in South Africa. Mainly right now, we are working with businesses to help them open safely and how to protect their workers But we’re happy to help anybody that that wants our help. Thank you, unfortunately those are all the questions we have time for this afternoon If we didn’t get to your question, please email us at healthpolicy.ucla.edu and we will get back to you. I just wanted to thank you all for attending this month’s webinar on “Combining Traditional Modeling with Machine Learning for Predicting COVID-19” and a big thank you to Dr. Christina Ramirez for presenting this very timely study We will be posting the recording of this webinar online with closed captions within the next two weeks, so visit healthpolicy.ucla.edu As a reminder, if you’d like a copy of today’s presentation you can also email us at healthpolicy.ucla.edu and stay tuned for details on our next webinar in August featuring the students and activities of the Native Hawaiian Pacific Islander COVID-19 race tracker lab happening right here at the UCLA Center for Health Policy Research have a wonderful rest of the day. Thank you