Dr. Linares: Good afternoon My name is Dr. Deborah Linares I serve as a health scientist and project officer in the Division of Research within the office of Epidemiology and Research at the Maternal and Child Health Bureau Health Resources and Services Administration The Division of Research provides ongoing support for maternal child health, or MCH, extramural research activity, including the Engaging Research, Innovations, and Challenges for the EnRICH webinar series You are joining a community of more than 100 participants with an interest in advancing MCH research The EnRICH webinar series provides technical assistance and methodologic updates in-depth stimulating interest in applied and translational MCH research Today’s webinar is entitled “Record Linkage and Data Integration for Maternal and Child Health Research.” Before we start, I would like to briefly introduce our speaker for this afternoon, Dr. Russell Kirby Dr. Kirby is a distinguished university professor and Marrell Endowed Chair at the University of South Florida He is a perinatal and MCH epidemiologist with training in human geography and preventative-medicine epidemiology In his 40-year career, Dr. Kirby has worked on MCH issues in state health agencies and academic medicine focusing on population-based research using most national and state-level MCH secondary data sources I will now turn the program over to Dr. Kirby Dr. Kirby: Okay Well, welcome, everyone I’ve been watching as the names pop up of participants, and there’s people I know and quite a few people I don’t know So welcome to the webinar What we are going to focus on is record linkage and data integration And the problem statement, kind of setting the frame for this, is that MCH professionals work, really, at the interface of several different domains — public health, clinical care, programs, education, and other kinds of social service programs, as well And rarely does a single database include data on all of the phenomena we might be interested in incorporating into our analysis record linkage is a technique that we can use to link records on mothers and children across databases, also over time, as well as potentially multigenerationally And data integration provides a basis for the storage of linkage results that we can then use in future analyses I always have to have learning objectives And I’ve got three here I think these are measurable Hopefully by the end of the webinar, you will all be able to differentiate between deterministic and probabilistic linkage methods and have some thoughts on how to select the appropriate methodology for whatever problem you have You will be able to describe frameworks for data integration of population-based perinatal health data and, hopefully, also identify examples of research questions in our field that require record linkage in order to obtain the necessary data for analyses We also have to have a disclosure statement And the long and short is, even though I’ve been working this field for a long time, I don’t have any current funding related specifically to this presentation, and, of course, we harmed no laboratory animals in the creation of this talk And, then, more broadly, conflict of interest, which I need to disclose Again, there’s nothing really that has any bearing on this particular presentation But I do have relationships with the March of Dimes I’ve worked for several pharmaceutical companies on scientific advisory committees for postmarketing exposure and so on But none of that has anything to do with this presentation So, let’s start with the beginning What is record linkage? This is a quote from Ivan Felligi, and for those of you who are interested in the history of record linkage, look up Ivan Felligi on Google, and you’ll find that he’s really one of the pioneers in terms of the use of record linkage and statistics Canada has, for over a half-century, been one of the leading places where methodology for record linkage has been developed But basically the idea is that if we assume that there are records that relate to individuals and relate to entities, then record linkage is the operation using identifying information we find in a single record that allows us to seek another record in the file or another record in another file that refers to the same entity or individual

And based on that, record linkage has been around for a very long time Genealogy is a form of record linkage you can even argue that the Bible has some aspects of record linkage to it, if you read it closely, as well But in public health, the modern methods that we use really only date back to the 1960s, and the broad use of record linkage in public health, and in MCH more specifically, is really more of a phenomenon of the 1990s up to the present This is just an example of another form of record linkage I got this out of the Birmingham news when I worked in Alabama, and they’re looking at the organizational chart which has a lot of linkages and wondering if it’s an org chart or maybe it’s a family tree And you’ll have to be the judge of that But thinking about record linkage more specifically, I’m going to walk us through the basic questions of who, what, why, when, where, how And we will look at each one of those and get a little bit of insight into them As to which one is most important, I’m not actually sure which one I think several of them are quite important But I think probably, thinking about it from the standpoint of “why” is a really good idea The very first thing that you need to consider if you are wanting to get information from more than one database is, what is the purpose of your study and does record linkage really make sense? And although many people who know me think I’m a guru of record linkage, I always take the first step of thinking about, could I actually answer the question that has been posed without doing record linkage? Could I just do calculations based on examining numerators and denominators or even calculating ratios without doing record linkage? And if that could be done, the amount of effort that you go through to do record linkage may be much greater than the information need of the question that you’ve been asked So you really have to think about that But then if you do decide that record linkage is needed, you really need to think about, how can you structure the record linkage so that the results of the record linkage potentially can be useful to other people? It doesn’t make sense, for example, for eight different research groups in a state all to be doing transgenerational linkage of birth certificates to the birth certificate of the mother, for example That should be done once An agency within the state health department should store that information and potentially make it available, as appropriate, to others, rather than redoing the same linkages over and over again We also have to think about whether the record linkage is technically feasible And sometimes record linkages that might sound like a good idea really turn out not to be And then, again, as I mentioned, about whether record linkage is really necessary Turning to “how,” there’s a variety of different questions we have to think about in terms of this The first one is manual versus automated linkage And sometimes a computer-assisted linkage where you’re looking at data on a screen might be sufficient I had a project when I worked in Alabama where I had a list of children who had been diagnosed with autism spectrum disorder, and I wanted to find their birth certificates And, you know, the number of records was only a few hundred, and they all were born in the same year And rather than doing record linkage with a computer-assisted approach — actually merging the records with a program — I arranged to go down to the state health department and sat in the Vital Records office and pulled up the birth-certificate file on my screen and did a manual search, and I was able to get the whole job done in a day I probably would have spent at least a week of programming time to do the job if I did it in an automated fashion, and I might not have had as good a success rate So you have to really think about that in terms of making a choice Another issue you have to think about is what kind of methodology really makes the most sense for your particular problem And, in general, record-linkage methods fall into the categories of deterministic and probabilistic I’m going to get into some of the details, the differences of them, a little bit later in the webinar But the basic difference is that with deterministic methods,

you are trying to come up with exact matches, whereas with probabilistic methods, you are using weights and establishing probability of linkages rather than exact matches And both of them are scientifically valid approaches The probabilistic methods probably have a stronger theoretical basis because they can relate back to a whole body of statistical series based on probability But they are both valid methods to use The major reason why we might use deterministic methods is that we might have detailed personal identifier information available, and if we have that, we may consider more heavily using deterministic methods But you have to have identifiers in order to do the linkage And without them, you’re really going to have a challenge Basically, that means that you need to have variables that can be found in both of the databases that you want to work with and can be arrayed in a similar fashion so that you can have a similar way that the records are stored, similar variable names, and so on You also have, in terms of record linkage, have some special challenges that come up in terms of names and dates, and, again, we’ll talk a little bit more about both of those later Names are great, but names are not always typed exactly the same They’re not always stored exactly the same There are variations in spelling that may be cultural and may have other aspects to them And dates are potentially issues, as well Sometimes you have incomplete dates Sometimes you have transposed month and day and all sorts of things like that Finally, in terms of the question of software, should you develop your own algorithms? Should you buy specialized software? Should you use a statistical software package? That is a question that comes up a lot In this day and age, there are a lot of options in terms of software that’s been developed by government agencies that are pretty much freely available for use by public health specialists, but you can also develop your own Here at the Florida Birth Defects Registry where I work, we actually use algorithms that we developed and do them within SAS But we’ve been evaluating and improving them for 15 years And, then, finally, if you’re going to do record linkage, it is imperative that you evaluate the linkage results Irrespective of how you did the method, you want to be able to say something about how generalizable your findings are, who are the people who aren’t linked, and so on Who should do the record linkage? This is one of those things that comes up, as well, in terms of the personnel Should you have dedicated linkage specialists? If you have programs set up, can any statistician do it? And, then, of course, the question of whether the linkage staff should be subjected to personality profiles Some of them develop a psychosis that I call “the urge to merge disorder.” And maybe some of you on the phone actually have that I don’t know But there’s also the question when we talk about “who” of defining what records are eligible to be included in the linkage And, again, you have to think about that very carefully because sometimes the reason why a record may or may not be included in one of the databases that you’re using could have an implication for whether they’re likely to be found also in your target data set So you have to think about that, as well Then there is the “what.” What databases should we link? And, again, that has to be thought about very carefully What are functional relationships between the records in each of the candidate data sets? And if you actually achieve the linkage, does it result in data that allows you to answer the question that you’re doing the linkage for? And then, again, also, how does the linkage support the needs for which the linkage was proposed? And it is very easy for the linkage to become the tail wagging the dog It can sometimes become so complex and use up so many resources that it becomes the primary goal when really it is just a small piece of your larger research plan So you want to make sure that you treat it in perspective And, then, in terms of “what,” having a plan for how you’re going to store the data at the end is definitely very important, as well

Then we have “where.” Where should the linkage be done? I have seen it done in a lot of different places I have seen it done in the health statistics agency I have seen it done in epidemiology agencies I have seen it at university research centers, sometimes contracted to those I have also seen it contracted to outside vendors All of those might be appropriate I worry the most about contracting it outside in that you potentially lose control over the process and potentially don’t get back all the information about the linkage and the linkage files at the end And, then, there is the question of where and how should the linkage results be stored And, again, do the researchers keep or do public agencies retain? I think the public agencies should retain I think we should be building repositories of linkage results so that, over time, we are able to build up the ability to link across a wide array of different data sets and domains And, then, of course, that requires a data structure that can support the storage of that information, either a structure that maintains the common link fields that each record might have been linked to or building a more complex relational structure And, then, of course, the question of building full-linked files or stored linkage identifiers At the minimum, you have to store the linkage identifiers If you don’t do that, you really don’t have the linkage anymore And, then, of course, I always like to point out, since I’m a geographer, that geocoding is a form of record linkage It’s a form where we take locational information that might be present on our records and link to either an address file or other kind of administrative geography file, and that enables us to then link our records with other information And, then, of course, when, how often should linkages be done? I don’t have a great answer to this Here in Florida, we do it when we have the resources to do it because it is a fairly lengthy process But some linkages probably need to be done immediately I would say with infant deaths — I think infant death records should be linked to birth certificates immediately upon being filed There’s all sorts of programmatic imperatives that require that, but very few states actually do it On the other hand, the periodicity also can vary depending on when the various data sets are actually created Hospital discharge records might be provided on a quarterly basis, but in some states they might be provided annually Likewise, there could be registry needs If you’re working with a registry that works based on impassive case finding, they may have a need to annually re-create their data set But they might also need to do something more frequently than that Other elaborate linkages might be done a little bit less frequently So, next, I have a few diagrams I’m going to give you a few examples of diagrams for data integration And these are taken from a variety of different programs that I have seen over the years, and we will talk a little bit about each This first one is one that I put together, which is actually a model that we use in Florida Basically, this has to do with linking data across pregnancies and across generations What we do is we link the birth certificate records firstly to hospital discharge records but also to mothers’ hospital discharge records, and we do that linkage longitudinally so that we have hospital data for the child going from the first year, which is 1998, potentially up to the present We do the same for mothers’ hospital records, where we’re also able to link before the birth, as well as after the birth And, also, although we haven’t operationalized it ourselves, it’s a good idea to link the birth certificate of the child to the birth certificate of the mother so you can look at transgenerational effects If you’re really interested in a life-course approach or setting social determinants, there’s a lot of interesting things you can do if you make that kind of linkage But the final thing that we do is we also link the birth certificates across mothers so that we can identify the sibship patterns for individual mothers

Ultimately, if I was able to access the individual record of education data, I would be able to build educational outcome profiles that could be studied within families, for example, but I don’t have the ability to do that yet But that’s the direction that one might go This is an example of the PELL data system This is from the state of Massachusetts And PELL is Pregnancy and Early Life Longitudinal study It is very similar in the middle in the core, with linking vital records data to hospital records But then it moves out and links with a variety of other program data — newborn hearing screening, Birth Defects Registry, program participation data, death certificates, hospital services utilization, and other sources This is an example of how one might use the kernel that I just showed you and then link outward to other databases, as well This one is from the state of North Carolina that shows, for their birth defects monitoring program, where they have a central registry, which is conveniently in the middle, and they link to vital records, birth and death certificates They also link to Medicaid records, both for mother and baby, but they link to a variety of other service programs — early intervention, WIC, child service coordination, Health Department records, clinical records, and so on That is another example of linkage where they start with the birth defects record registry and then kind of link out from there And, then, this is a diagram I don’t want you to pay attention to any specific element of it, but this is a diagram for the state of Florida that shows the variety of record linkages that we do Some of the ones that I showed you a few minutes ago are embedded in this diagram But the idea is to link the vital-records information but link out to hospital, link out to program data We have a program called Children’s Medical Services Link to birth defects and cancer, link to WIC, link to the Florida State Healthy Start data, and then link to a variety of other programs that we use a lot in MCH like PRAMS, immunization registry, and so on Down here, I wanted just to call out Florida, for a number of years, participated in a program that CDC manages called the SMART, the States Monitoring Assisted Reproductive Technology, and this was the program where the state sent individual identifiable records to the CDC, and they were linked with records about assisted reproductive technology, and that supported a lot of interesting population-based research on outcomes of A.R.T This is not a diagram of databases, but it’s worth thinking about in terms of what direction one might be going If you’re thinking about child health, there are really three major domains that you need to be thinking about in terms of child health, in addition to looking at things like diagnoses and so on Firstly, children are growing, and you want to think about how we can measure their growth and take stock of that in a number of different dimensions But the child is also developing, and the child’s development doesn’t occur necessarily on a linear path Children who have disabilities or other kinds of impairments or delays may not be progressing as quickly as others And then education The child is learning like crazy throughout early life and into early childhood And these are overlapping domains that need to be thought about We don’t usually, in public health, have databases where we can measure all of these, particularly on a population level, but we should always be thinking broadly about child health whenever we’re considering it And just to that end, I have tried to put together what databases might look like that you could try to capture some of that information for child health in this diagram The kernel of this diagram is what I call the Kirby Master File, or the KMF I did not make up this name It was made up by an assistant administrator at the Wisconsin Division of Health about 30 years ago when he was exasperated as I kept talking about this But the idea is that you have as your kernel a database that includes information on all the children And ideally it goes back to their birth certificate

If the child was not born in your jurisdiction, you want to create some kind of a dummy record that has enough identifying information so that you can still link to other data sources And, then, of course, these records should be linked to death certificates and pediatric cancer cases We should be linking to hospital discharge data, if we have ER and other outpatient data If you happen to have all-payer claims, go for it That should be linked in here, as well But, then, the part that is a little bit more, I guess, proactive is thinking about what kind of educational data could we potentially link with our records And so we have the child’s health status and school readiness at school entry And there is a lot of different names that people have for what that might be, but that is very important Data on special education placements and what kinds of reasons why children are in special education And then educational outcome data are all important to be able to link together with this And, then, of course, we need to be able to link with birth defect surveillance data and other data that might measure special needs I’ve got that down here Then, of course, developmental disability surveillance if that happens to be happening in your state So that’s a model that you can think about for how to integrate data for looking at child health, growth, and development This is a diagram that we published in our textbook on perinatal epidemiology This was published about 10 years ago And, again, there’s some additional databases that I did not touch on that are also included in here I think I have immunization registry, child abuse and neglect, Child Protective Services, blood lead screening, potentially developmental disability services, if your state has, and, of course, newborn metabolic screening and hearing screening and so on So there is a whole array of domains of data that we can potentially link But the key thing is you need a central holding place for it And one of the cool things about this, if you think about it — Let’s say hypothetically that you have linked the immunization registry with birth certificates and you have also linked the early intervention data set with birth certificates By virtue of that linkage, you can also evaluate aspects, for example, of are children who are in early intervention up to date on their immunizations? Are there opportunities where we could make improvements? By virtue of the fact that you have them linked to a common data set, you also can create a data set where they are linked to each other And this is a diagram that I just received a few weeks ago from my colleague Kay Johnson, who many of you may know, and she was thinking about developmental screening and all the kinds of programs that might be relate to developmental screening and put this together And, again, you can see that there’s a number of different public health programs that play a role in terms of developmental screening And I’m not sure exactly how one would operationalize this from a linkage point of view, but it’s definitely something to think about in terms of looking toward the future A few broader thoughts and concerns about record linkage There’s the question of how might we incorporate our integrated health records and integrated databases And we have some challenges with this Right now a lot of states are wrestling with the problem of Neonatal Abstinence Syndrome, or N.A.S Some people call it NAS Some people look a little more narrowly just at opioid withdrawal syndrome But this is something that a lot of state health departments are wrestling with right now And the question then becomes, how can we put together a better approach to studying N.A.S.? One of the problems is that if all we have is hospital discharge data, we are only going to have information about a diagnosis which may or may not actually be a valid diagnosis We don’t have information about how the diagnosis was made But we don’t know very much about treatment We don’t know very much about long-term outcomes And we have to think about how can we pull, potentially, clinical records in so that we can look at that more systematically And, then, there’s the question, what can we learn from health claims? There are a few jurisdictions that have all-payer claims data, or APCD But, again, how do we use these data? What are some of the methods we need to use for linking them with other sources and for analyzing them? I’ve done some work with claims data where the major challenge is that we don’t have the sociodemographic information

that we’re used to having in our public health databases And if you link those records to public health databases, then you can retrieve that information and be able to analyze it better Then I mentioned about data structures The data integration is really important and should be thought about at the beginning of a project and not as an afterthought because the way that you do your linkage or the things that you want to actually use the linkage for to some extent are related to how you store the data I’m going to move on I just have this quote “If you always do what you always did, you will always get what you always got.” And that is probably true in a lot of other aspects besides record linkage I wanted to spend just a few minutes talking more specifically about research methods and, in particular, there’s two different classes of methods — deterministic and probabilistic And we will take a look at both of them, starting with deterministic data linkage and look at some of the key issues and concerns when we use this approach So, firstly, if you want to do deterministic linkage, you need to make sure that you have variables that are common to both data sets that are stored in a similar manner and coded in a similar manner So, again, my examples here are primarily based on SAS, which I find a very useful environment for teaching principles of record linkage …but do a PROC Contents and look And then I just want to make — This is just a huge caveat I can’t express it often enough If you are using a relational database software where you think that there are variables and different tables that represent data from different sources that you can just do a join, don’t You need to do a lot more looking at the data before you are going to get any kind of useful result from that There are just so many differences in the way that information is stored, that you will wind up with something that is not a very good result And likewise, if you are just going to use a single identifying variable or require a match on that variable together with others, don’t You need to gain strength from multiple variables in your linkage Social Security number is great It’s carried on a lot of health records, and it’s great, but if you rely only on the Social Security number, there is going to be a sizable proportion of your data set that you are not going to be to link at all, and you might even throw them out because you could not link them, and that is not good Because typically what we find is that the unlinked records in any record linkage are often very interesting and important Sometimes they’re very high-risk cases, and throwing them out of the database leaves you with an incomplete understanding of the nature of the problem you’re studying But on the other hand, once you have linked the records, creating a common identifier that you store in both data sets would be really great because then you would be able to put the data sets together readily in the future Okay. This was a little aside Let’s say we have two different data sets We have a birth-certificate data set We have a newborn-screening data set These are examples of four different variables that we might have on both of the data sets We might have information about the mother’s name We might have information about the date of birth of the child But if we want to link on these, we have to make sure that these are all stored in the same way When we’re looking at the names, for example, maiden name is another variable that might be there It’s possible that in one of these data sets the maiden name has been stored rather than the legal last name We have to make sure that the data are the same, and we have to make sure dates are formatted in the same way across both data sets, as well Usually, I find I have to break the dates down into month, day, and year and recast them in a different metric in order to be able to assure that I am linking correctly And, then, we also have, similarly, variables that are common on the child The last, middle, and first name of the child, the gender, the date of birth Maybe the newborn screening doesn’t have the date of birth, but it might have the screening date, and you can make some assumptions about dates that can potentially enable you to link there Another important variable for this linkage is the hospital of birth,

which can be very useful in blocking your analysis In fact, here it is here Hospital And, then, of course, the ZIP code of maternal residence is another These are all fields that potentially could be useful in doing a deterministic linkage What we next have to do is to look for missing data in the linkage variables And you know what? If you have a record in one of your data sets that’s missing on all of the attributes you’re trying to match to and there are other records in the other data set that are missing on all the attributes, they’re going to match because they match on being missing So you have to think about, what do you want to do with that? But, again, the decision about how you want to handle that, what people sometimes do is they do a sort on the set of variables they are trying to do their linkage on and exclude any records that are missing but don’t throw them away Store them in another data set where you can pull them back in for further processing later But you do have to be concerned about how you handle that And, then, again, you’re looking for records that share the same values in each of the databases And then when you find records that do share the same values, the question is, what do you do with those? And you also have to think about, what variables are the best variables? And, again, there are some scientific approaches to thinking about this, but, really, common sense can help a lot If you have a variable that has a lot of missingness, probably not a great choice And, then, again, what do you know about the variables? What kind of information do they have? How specific is it? You have to think about those kinds of things We always recommend that you use the most discriminating combination of variables first and loosen the criteria as you go along If you think about it, in terms of most strict to least strict — If you think about it, gender is not a particularly useful discriminating variable because in babies, almost all of them are going to fall into one of two categories There’s a little bit of noise in terms of that, but they’re going to fall into one of two categories, and that’s not going to help you very much in linking the records because it doesn’t block you very specifically But you want to look, again, at the variables and make decisions based on that And then you want to start with your most strict criteria for linkage If you can get a perfect match on a vector of five different variables, that is probably going to be a pretty good match On the other hand, if the record falls through and only matches on linkage step six, where you have a perfect match on only two of the variables, you’re going to need to evaluate that record more closely to see whether it truly is a match And, again, you start with the most strict and you go to least strict Again, you always want to set things up so that you can merge back with the original data set And you have to create some kind of an I.D. number that you retain across all the records no matter where they end up in your process because you can very easily end up with a large number of files that you are working with and you want to make sure that you don’t end up with multiple copies of the same record in the data set And, again, really important to use five variables I’ve just made up this example here Let’s say we have a birth-certificate file and some kind of medical record file, and we just say DATA LINKED, MERGE BCERT MED You know what’s going to happen if you do that without a bi-statement? You’re going to end up with — If you say the birth-certificate file has N records and the medical record file has M records, you’re going to generate N-times-M records because every birth-certificate record is going to match with every medical record, and that’s not going to be very useful Bivariables are essential And you want to use them in a way that allows you to discriminate across the different variables that you want to match on It’s also very important to unduplicate You actually need to unduplicate your files before you’ve linked and then after you’ve linked And in SAS, the NODUPKEY is a useful key And there’s also a way that you can export the NODUP records so that you can use them for future analyses Multiple births are a lot of fun, and that’s frequently a group we need to process separately from the rest of the birth certificates, and, again, we can subset them out here And then you want to merge

by the linkage variables that you have chosen You want to create a data set that only has the linked records, keep track of what link level the records merged on I’ve seen deterministic algorithms that have as many as 250 different steps in trying to come up with different combinations of variables But you don’t want to throw away the records that fail the match at each step You want to pull them in so that you can reanalyze them And, also, if you remember back to the probability class that you had in your introductory statistics course, consider full replacement It might be better if you actually run the entire data set through each link level Then you’ll be able to get much more information about what links are possible Problem is, it might actually be that you have a transaction record — say it’s a newborn-screening record — that wants to match to more than one birth certificate If you pull out records at the first match, that record doesn’t get a chance to possibly match with another record later on But sometimes there are errors in our data that can potentially allow that to happen So you want to be thinking about all the possibilities Okay Then you want to put everything back together You merge it all back together And what I typically do is create updated unlinked records to go to the next level of the linkage algorithm And then put it all together Always study the unlinked records They are very interesting They may actually tell you more than what you learned from other things Be looking for bias, looking for systematic errors, hopefully things that you can potentially correct And then you want to evaluate the quality of the records that you actually have linked I don’t ever want to see anybody on this call publishing a report about record linkage where they don’t tell me something about the linkage itself I don’t want to see a method section that says, “We linked hospital discharge records for infants with their birth certificates and analyzed 122,422 records.” That doesn’t tell me anything about which records didn’t match and what you did to try to learn more about your data set Epidemiology is all about understanding what our reference population is, and if you don’t give me that information, I don’t know So make sure you do that Okay. We are running a little late on time I wanted at least to spend a little bit of time on probabilistic linkage, so I’ll do that really quickly The idea here is that we use probabilities to determine whether a particular pair of records, one from each data set, refer to the same individual, and we calculate weights to quantify the likelihood that a particular pair are a true match This is computationally intensive because we’re basically comparing each record in the data set with every other record in the data set Again, our probabilistic weights can be either nonspecific or specific to particular values in the data set Thinking about nonspecific, we may be just looking for agreement on a particular variable For example, direct agreement on date of birth gets a higher weight than match on sex Again, date of birth is much more specific There’s 365 different, or 366 in a leap year, compared to sex, which typically only has two values But then a disagreement on sex should get a higher penalty weight than a disagreement on date of birth That makes sense, I hope Value-specific weights relate to particular variables and the values that they might hold For example, the letter Z might get a higher weight than the letter S, but disagreement on the letter S might also be given a higher penalty than disagreement on the letter Z The weights allows us to objectively reflect our confidence in a match But there is individual choice involved And we also have to think, when we’re all done, what is the overall score that we want to set as our cutoff for throwing out matches that are not sufficiently high probability? And that is a subjective activity that you have to really know your data to understand Now, probabilistic linkage methods A lot of people write their own programs There’s a lot of packages out there Some of them are expensive and difficult to use Pretty much all of them are actually a little complicated until you understand the basics Some of them are actually available as freeware or shareware

And I have this slide here Don’t pay any attention to the dollar amounts I haven’t really researched that in a long time But Automatch is a program that has been around since the 1990s It’s very expensive Top-end corporations use it It’s probably built into Oracle software GRLS is another program that’s been around for a long time It’s built into Oracle’s healthcare software But there are number of other programs that are out there that a lot of people use LinkPro is a program that is commercially available There’s also a version of it called Links This is the software that was initially developed by the University of Manitoba for their integrated provincewide database But there’s a whole bunch of other programs And I think somebody asked a question about what CDC might have available Link the King and Link Plus are two examples of software out there there were developed either by CDC or Samsung There are others, as well There’s also open-source freeware FEBRL is a program that was developed at Australian National University FRIL is a program developed at Emory but with input from people at CDC’s Birth Defects Center These are very flexible programs They’re not ideal. They take quite a bit of time to learn But they can generate very high-quality results So you have choices on that Finally, linkage evaluation I did want to spend just a minute on this It’s a lot easier to do linkage evaluation with probabilistic methods because you’ve already built in the process for it into your algorithms But you still have to decide at what level of tolerance will you accept matches, and that, again, is something that is subjective There are ideas out there in the health information management literature, but you really have to make your own decisions on that And, then, again, document, document, document I can’t tell you how many times I have encountered colleagues in state health departments who have been hired — I know a number of people who have been hired specifically to do record linkage, and they get there, and they have to start over from scratch because nobody’s kept the metadata that are necessary to enable them to do their job So, document, keep track of everything that you do, and create kind of a resource book that outlines all of your methodologies and things that you learned from doing it this time that you’re going to improve next time Document all of that because you might think that if you go through this process once, next year, you can just run the whole program script, just change a couple of dates And it doesn’t work that way You really have to be paying attention at every step through the process We talked about linkage and data integration Are they required? The answer is maybe, maybe not A few things we need to think about How precise is the need to know? Can the question be answered through calculations? Do we actually have individual-level records that have appropriate identifiers and could get the necessary permissions to do the linkage? How accurate do we want our linkage to be? What’s our match rate that we’re trying to achieve? Do we have the resources necessary to conduct and evaluate the linkage project? Do we have the resources to analyze the data when we’re done? Can we store the results so that they can be used later and potentially be used to inform future analyses? All of those things need to be thought about I did want to give you my contact information if you want to contact me You can also give me a call And I will be happy to follow up with anything that you might want to ask me I’m going to turn things back to Deborah I think we might have run a little over But I’ll turn things back Dr. Linares: Thank you so much, Dr. Kirby What an informative and interesting presentation We really appreciate you taking the time to share your expertise with the MCH community We are now ready for the question-and-answer period Our first question is, “How feasible is it to link PRAMS data with insurance claims data? Ideally one year before pregnancy? To three years after delivery?” Dr. Kirby: So, asking about linking PRAMS records The first thing to note about that is that you probably don’t want to link PRAMS records You want to link the birth certificates associated with PRAMS respondents because the PRAMS record itself isn’t going to have identifiers that will enable you

to do that kind of linkage You really have to think about the nature of your inquiry to decide what kind of window you want to work with for linking back to health care records Firstly, it depends on what health care records you’re working with If you’re working with Medicaid records, you want to parse through the Medicaid data set and make sure that the mother was continuously eligible for Medicaid during the time period that you are trying to link records for That’s going to subset things in possibly some ways you don’t like, but you’ll get incomplete data and potentially unusable data if you don’t do that If you’re working, say, with a health plan — Say, hypothetically, you might be in California and you are linking with the Kaiser Permanente database of Northern California, where you have a bunch more stable health-care population, that would probably work better But you really need to think about the specific nature of the inquiry to decide For example, for preconception health-care issues, it depends on what it is, as to whether you want to look just at three months prior to conception or a full year prior to conception or, even if it’s a woman who had a previous birth, trying to go back to right after the previous delivery But it really is going to depend in terms of how you want to do that Dr. Linares: Great Our next question is, “If researchers are interested more in preconception health, pregnancy-related outcomes, and newborn health, are there any linked data sets available that you would recommend researchers to use?” Dr. Kirby: That’s a really good question I’m going to say it kind of depends And the unfortunate thing is that the United States is way behind many other Western countries in terms of thinking about data integration that would support that kind of research We do have some sources that could be useful, but oftentimes they are databases that are hard to link So, for example, I don’t know how many people on the call are familiar with the Listening to Mothers surveys And there’s a third wave I think they might actually be doing a fourth wave right now But this is a survey of women who have recently given birth, and it collects a wide array of information about the birth experience and how they interacted with health care providers and has some information on exposures and what kinds of health procedures they had during the pregnancy and so on But you really can’t link it to anything because it’s a sample survey that doesn’t have any personal identifiers So thinking about it gives us a lot of insights, but it doesn’t necessarily get us linked back to programs And likewise, we have a wide array of programs that collect information on early childhood and infant and toddler care States have home visiting programs that are collecting data and so on, but it doesn’t necessarily enable us to look holistically There’s a lot of things we can do with it, but, comprehensively, looking from preconception care to early childhood, outside of being within one of the staff-model HMOs, is still a challenge I’m not going to say that there aren’t any longitudinal research programs that would have those kinds of data There probably are some But for the kind of things that we do on population-based public health, it’s a bit more difficult But we can always look to the future It could be down the road Dr. Linares: Great Our next question is if there’s any state data sets linking with the National Survey of Children’s Health, especially for children with special health care needs You mentioned earlier some data sets in your presentation for children with special health care needs and special education data Dr. Kirby: Exactly And the National Survey of Children’s Health was actually created in part to fill the gap because of the fact that we had much more limited data on special needs than many state Title V programs needed in order to understand their population and evaluate their programs But the problem is the National Survey of Children’s Health is, again, a representative sample survey It’s publicly available It doesn’t have identifiers I’m going to say it’s theoretically possible,

working with the Maternal and Child Health Bureau and the Bureau of the Census, you might be able to get permission to do a record linkage with vital statistics, for example But I have not heard of anybody actually trying to do that, and I would think that, from a lot of perspectives, that would be a pretty challenging thing to actually do What you can do, however, is build a state-level database about characteristics, health care characteristics, economics, and other kinds of factors at the state level and use that information in a multilevel analysis where you use the information about the children nested within their state to do that kind of analysis But in terms of linking the NSCH directly on a record-level basis, I don’t really think we can do that And I don’t know if there’s somebody from MCHB who wants to chime in, but I’m pretty sure that would be very difficult to do Dr. Linares: Thanks, Russell We can inquire about that internally and get back to the person who asked that question So, our next question is, “Do you have a recommendation for any linkage data sets to study the outcomes for children born to women with and without mental health and substance use issues?” Dr. Kirby: Ohh That’s a really good question It depends on what kind of programs the women are in There actually are databases SAMHSA has databases Many states have databases relating to substance-abuse treatment that potentially could be used for that Here in Florida, I have colleagues who have access to Medicaid data, and we’ve actually been looking at the association between mental health and substance use and risk for neonatal abstinence syndrome The problem is we only have — Again, it’s claims data We don’t have a lot of demographic information that we can work with And because of the nature of those data, the researchers who have them are prohibited from linking them with any other data sources But I think there probably are similar databases to that available in many states, and it’s certainly worth exploring Probably the best thing to do would be to talk with the state-level agencies that administer mental health and substance abuse programs in your state and learn about what kind of data sources they have available and kind of see from there I do know that in Florida our analyses would be much enriched if we had the ability to link those records with birth certificates, just for one example Dr. Linares: Great Thank you so much again for a great presentation and for sharing your expertise If you did not get a chance to ask your question for the speaker, please still feel free to submit your question through the Q&A field We’ll try to respond to your questions after the webinar We are now almost at the end of our program After this webinar, you will receive a request to complete an evaluation We hope that you will fill this out and provide the MCH Division of Research with feedback on today’s event Your response will help us plan future webinars in the EnRICH series Thank you all for your attendance and participation I also want to thank Jen Rogers, Rebecca Harnik, and Jim Wetherill at Altarum for helping to organize this event An archive of today’s webinar will be available on the Division of Research website in several weeks Have a wonderful afternoon, everyone

You Want To Have Your Favorite Car?

We have a big list of modern & classic cars in both used and new categories.