Background

Gates Open Res

Gates Open Research

2572-4754

F1000 Research Limited

London, UK

10.12688/gatesopenres.15418.2

Data Note

Articles

Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling

[version 2; peer review: 1 approved, 2 approved with reservations]

Haddock

Beatrix

Conceptualization Investigation Methodology Software Validation Writing – Review & Editing 1 Pletcher

Alix

Investigation Methodology Validation Visualization Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0009-0009-1585-4766 1 Blair-Stahn

Nathaniel

Investigation Methodology Validation Writing – Review & Editing 1 Keyes

Methodology Writing – Review & Editing 1 Kappel

Matt

Software https://orcid.org/0000-0003-2430-5661 1 Bachmeier

Steve

Supervision Writing – Review & Editing 1 Lutze

Syl

Investigation Methodology Writing – Review & Editing https://orcid.org/0009-0005-7858-0222 1 Albright

James

Software Writing – Review & Editing 1 Bowman

Alison

Investigation Methodology Writing – Review & Editing 1 Kinuthia

Caroline

Project Administration Supervision 1 Burke-Conte

Zeb

Investigation Methodology Software Validation Writing – Review & Editing 1 Mudambi

Rajan

Software Supervision Writing – Review & Editing 1 Flaxman

Abraham

Conceptualization Funding Acquisition Investigation Methodology Supervision Writing – Review & Editing https://orcid.org/0000-0001-6033-4713 a 1 1Institute for Health Metrics and Evaluation, University of Washington, Seattle, Washington, 98195, USA

a abie@uw.edu

No competing interests were disclosed.

18 10 2024

2024

9 10 2024

2024

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets with configurable noise approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information.

Methods

We created the simulated population data available for noising with pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems.

Results

Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.

Entity resolution (ER) microsimulation

Gates Foundation

INV-060835

U.S. Census Bureau

CooperativeAgreementCB21RMD0160001

This work was supported by the Gates Foundation [INV-060835] and Cooperative Agreement CB21RMD0160001 with the US Census Bureau. The findings, interpretations, and conclusions expressed in this work are those of the authors and do not necessarily reflect the views of the US Census Bureau.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Revised Amendments from Version 1

The text of this manuscript has been revised to make it clearer that the simulated data described in this note is input to the pseudopeople Python package, and the pseudopeople package adds configurable data corruption algorithms such as noise introduced by typographic errors. We also edited to provide additional context around our simulated data approach, comparing it to perturbed-data approaches and work conducted in secure data enclaves, as well as adding a reference to another application of simulated data to record linkage that came out recently. In response to reviewer comments, we have also included additional limitations related to data we use to generate names and household structures for our simulants.

Introduction

Entity resolution (ER) is a foundational element of data science and has emerged as a crucial research task in a variety of disciplines, from the social sciences to epidemiology to forensics ¹. Put simply, ER is the process of linking the records corresponding to a single “entity” (e.g., an individual person) from one or multiple data sources when there is not a unique key on which to join them. In this context, an entity may be anything a row of data corresponds to, for example a person, household, business, or establishment. Record linkage of administrative data can enable the analysis of events across government services and systems ^{2–
4}.

For researchers who work with large-scale, individual-level data, such as those working with the US Census Bureau, ER typically uses personally identifiable information (PII) such as name, address, date of birth, or government-issued identification numbers. Protecting PII is crucial to safeguarding individuals’ privacy, security, and personal well-being in an increasingly interconnected and data-driven world. As such, restrictions on access to these data have presented a barrier to developing and testing new methods and software for ER ¹. Although it is possible for some research to proceed using perturbed or synthetic data or for researchers to work with confidential data in a secure data enclave, it seems that technical barriers inherent in these approaches have prevented them from overcoming these barriers at present.

In 2021, the US Census Bureau (USCB) awarded a cooperative agreement to the University of Washington’s Institute for Health Metrics and Evaluation (IHME) Simulation Science team to expand and improve ER methodological research and technology ⁵. As part of this work, we have used simulation to address the research barrier caused by PII. Through the development of a simulated version of various administrative datasets, including a simulation of the confidential data gathered by the USCB, we hope to help researchers develop new techniques for linking datasets together that are compatible with the privacy protections necessary for such sensitive and consequential information – and to do so without needing access to the real data. The goal of the generation of these data is to use them for ER methods research, and the datasets themselves are not intended to replicate or reconstruct protected data for social scientific research. It should be noted that our team is not the first to attempt such a data synthesis project. Prior approaches include the Australian National University’s “Freely extensible biomedical record linkage” ⁶ and Data Generator and Corruptor projects ⁷; and from the University of Arkansas Little Rock, the synthetic occupancy generator approach ⁸. There is also relevant work from the University of Edinburgh, which developed an R package for producing synthetic data called synthpop ⁹; and from the United Kingdom Ministry of Justice, which developed synthetic data for testing the Python package Splink ¹⁰. Recently another group has use simulation specifically to generate data for record linkage, which might be considered the first paper to use microsimulation for generating data for record linkage ¹¹. There are certain features of our project, however, that differentiate it from other efforts, most notably the scale of our simulated data: We have simulated a 100% sample of the USA over 20 years.

We have recently released pseudopeople, a Python software package that allows users to generate realistic simulated data about a fictional United States population over multiple decades. Both the generation and distribution of the dataset are governed by a system of “relational governance” ¹², in which data subjects play a central role; a paper on our governance approach is currently under preparation ¹². The package produces simulated datasets similar to census, survey, and administrative datasets routinely used by USCB in their data linkage practice, and it allows the user to configure the levels of noise in each dataset. This includes noise that leaves fields blank, chooses wrong options, replaces names with nicknames and fake names, swaps months and days in dates, misreports ages, and writes wrong digits in numbers and zip codes, as well as adding phonetic, optical character recognition, and typographical errors, following the approach pioneered by Christen and colleagues ^{6,
7,
13}. Readers interested in more details on how our pseudopeople software adds noise to the data generated by this simulation are referred to the pseudopeople documentation website, pseudopeople.readthedocs.io, which includes implementations and extensions of many of the data corruption approaches listed in the previous paragraph.

The simulated datasets generated by pseudopeople are based on the results of an individual-based microsimulation built with our Vivarium simulation framework. This simulation is calibrated with real, publicly accessible data about the United States population, including realistic household and family structures, at a large scale. The purpose of this data note is to describe this simulation, which we hope will aid researchers in using our pseudopeople package to develop new algorithms and software.

Methods Using Vivarium

Vivarium is a mature, open-source simulation framework ¹⁴ that uses standard scientific Python tools, such as NumPy and pandas ^{15,
16}. A simulation in Vivarium consists of user-written components that encapsulate the simulation logic, a machine-readable model specification that describes what components are in the model and how they are configured, and a data file containing all data used in the simulation. The framework provides a set of services to assist users in writing their model components, an engine for executing the simulations from both an interactive Python session and from the command line, and abstractions to help manage and format model input data. Models built in Vivarium are typically individual-based, representing people in a population as agents or “simulants,” each with their own age, sex, and other characteristics relevant to the specific model. They typically use discrete “time steps” at which events may occur. In this work, each Vivarium simulant represented a person living in the United States, and each discrete time step included changes relevant to data used in record linkage, such as births, deaths, moving to another address, and changing jobs.

We use a pre-established workflow when developing a Vivarium simulation, with roles for researchers and software engineers. The researchers lead the model development process through background research, conceptualizing modeling strategies, validating strategies with domain experts, guiding the conceptual development of the modeling software, and generating analytics for simulation inputs and outputs. The software engineers lead the development of simulation code, including model components and outputs, and tools supporting model and input data analytics.

In the following sections, we will cover the different input data sources and data processing strategies used to inform our simulation of the US population. We describe the Vivarium model components for simulant characteristics including basic demographics, household structure, mortality, fertility, migration, and employment dynamics. We also describe the addition of simulant names, physical addresses, employer names, and other attributes which we implemented as a post-processing step, rather than during the simulation itself.

Simulation time

We initialized our simulation to begin on January 1, 2019, and step forward in time with 28-day time steps until the simulation clock exceeded May 1, 2041. We chose this time step duration to balance the complexity of changes in demographics, housing, employment, etc. with the computational demand of running a simulation with over 300 million simulants.

Concept model

The concept model diagrams ( Figure 1 and Figure 2) provide a visualization of the logical dynamics underlying this simulation and indicate how the various components of the simulation relate. The simulation components can be divided into three overarching categories:

simulation events (i.e., birth, death, migration, and employment change),

ii)

simulant attributes (i.e., demographics, household structure, location, and government-issued identification numbers such as SSNs), and

iii)

simulated dataset observers (i.e., how the simulants are observed over the course of the simulation, through routinely collected surveys and administrative datasets, such as the Decennial Census, tax forms, household surveys, and government-related social safety programs).

Figure 1. Simplified version of simulation concept model diagram, denoting the four different overarching simulation events that can occur (migration, employment, mortality, and fertility) and how each of these events are observed during the simulation run. Figure 2. Concept model diagram showing the interaction between different components of the simulation, illustrated at the level of an individual simulant.

Each arrow in this diagram represents a dependence between two distinct components of the simulation; an arrow from component X to component Y indicates that X affects Y (for example, the employment component simulates changes in jobs, which leads to a change in income; the basic demographics of a simulant affect the probability of death so that older simulants are realistically more likely to die during a simulation time step).

Figure 1 shows what influences the occurrence of each simulation event and how these events are captured in our data collection, while Figure 2 shows how the simulant components interact with one another at the individual level. When a simulant undergoes an event (e.g., gives birth, changes jobs, changes address), the simulant’s attributes change accordingly. Those attributes are then captured by the observers.

Input data. We informed the simulated datasets we developed for pseudopeople using open-source input data, including data released publicly by the Social Security Administration (SSA) and the USCB. We informed physical addresses from the training data of the Python package libpostal, as repackaged by the deepparse project ¹⁷. In the sections that follow, we elaborate on how we used these data sources, and how our simulation could be extended to be even more realistic in future work.

Basic demographics. We initialized the simulated population’s demographic characteristics, including age and date of birth, race/ethnicity, sex, nativity (i.e., whether a simulant was born within the US), geographic area, and household structure by sampling from the 2016–2020 ACS Public Use Microdata Sample (PUMS) ¹⁸. By sampling from PUMS, we were able to match the univariate distribution of each attribute as well as joint distributions of arbitrary complexity between the attributes at the Public Use Microdata Area level, while also preserving structure within sampled household units. For instance, the PUMS data capture the age distribution of people in America, where more people were born from 1945 to 1965 than from 1965 to 1985. The PUMS data are not without limitations, however. For example, the granularity of the PUMS data is limited by privacy considerations, and specific details that might be crucial for a detailed analysis are sometimes obscured, which could affect the precision of simulations based on these data, especially in socioeconomic and health-related contexts.

Age is reported in PUMS in floored integer years, but our simulation uses precise ages in fractional years. We assigned simulants a uniformly random precise age consistent with their nominal age as sampled from PUMS. For ER research and development, it was particularly important that we did not generate simulants who are much more similar to one another than would be expected in a real population, which would make linkage unrealistically difficult. Our simulated population is the size of the US population, but every simulant is initialized from a person in PUMS, which is a 5% sample of the US. Therefore, many simulants are created from the same person in PUMS, which could create unrealistic clustering. To decrease similarity without assuming total independence between attributes, we perturbed age values at sampling time. In different components of the simulation, we sampled different entity types from the PUMS: entire households, individuals living in group quarters (GQs), or individuals living in households (non-GQs). For each entity sampled, we added a random age shift taken from a standard normal distribution to that entity’s age value(s). When perturbation led to a negative age value, we flipped the negative age value’s sign. We then defined each simulant’s date of birth to be consistent with their precise age.

Sex is reported in PUMS as binary (male or female), so we initialized a sex attribute this way as well for each simulant. We mapped separate PUMS indicators of race and ethnicity to a single composite “race/ethnicity” indicator, with the following exhaustive and mutually exclusive categories: “White,” “Black,” “Latino,” “American Indian and Alaskan Native,” “Asian,” “Native Hawaiian and Other Pacific Islander,” and “Multiracial or Some Other Race.” We defined these categories in accordance with the guidelines provided by the US Office of Management and Budget (OMB) ¹⁹. Nativity describes whether a simulant was born in the United States or elsewhere, and we modeled this as a binary variable in our simulation. We used this nativity attribute to inform the likelihood that the simulant had a Social Security Number (SSN). Table 1 provides a sample of the basic demographics present in the simulated population used in pseudopeople (note that these are entirely simulated data and therefore do not constitute Confidential Unclassified Information under US Code).

Table 1. Sample of basic demographics from simulated population.

These are entirely simulated data and therefore do not constitute Confidential Unclassified Information under US Code.

Simulant ID	First name	Middle initial	Last name	Age	Date of birth	Sex	Race/ethnicity
2	Melanie	L	Herrod	26	8/5/1993	F	White
3	Jordan	C	Herrod	26	12/20/1993	M	White
923	John	E	Mckeever	77	6/29/1942	M	Black
6176	Gail	K	Durand	67	1/3/1953	F	Multiracial or Other
18770	Ann	J	Molina	60	10/24/1959	F	Latino

Household structure. Our simulants lived in either residential households or group quarters (GQ). We used the ACS PUMS data to inform the residential household structure regarding how each simulant is related to a reference person in their household. Simulants living in GQ do not have such a relationship and GQs do not have a reference person. Residential households and GQs have geographic locations as well as physical and mailing street addresses, which may be different, because some residential households receive mail at a PO box (we do not simulate other kinds of mailing-only addresses, such as rural route addresses).

PUMS data were not sufficient to identify precisely which type of GQ each simulant resided in; they only provided information on whether it was an institutional or non-institutional GQ. We subdivided institutional GQ into three mutually exclusive and collectively exhaustive categories of carceral, nursing homes, and other institutional. We also subdivided non-institutional GQ into college, military, and other non-institutional. We chose a GQ type uniformly at random for each simulant out of the three types consistent with their institutional/non-institutional status.

For simulants living in residential households, we modeled a relationship to the reference person of their household based on the relationship values in the PUMS ²⁰. Possible relationship values were reference person, biological child, adopted child, stepchild, sibling, parent, grandchild, parent-in-law, child-in-law, other relative, roommate, foster child, other non-relative.

Mortality. To model mortality, we used our standard Vivarium approach, informed by data from the age- and sex-specific estimate of all-cause mortality for the US in 2019 as produced by the IHME Global Burden of Disease Study ²¹. When a simulant who was the reference person in a non-GQ household died, we made the oldest remaining simulant in that household the new reference person and updated all other relationships (this produces some households with an unrealistically young simulant as the reference person). Unlike many of our past Vivarium simulations, we did not model the underlying cause for any simulant’s death. However, we could extend this simulation to model specific causes of death in future iterations of the simulation, such as to facilitate research and development in cancer registry linkage applications.

Fertility. We used our standard Vivarium approach to an age-specific fertility model in which each female simulant has a probability of having a birth event at each time step, derived from the age-specific fertility rate for the USA. In the current version of our model, only one female parent is identified, representing the simulant who gave birth. The birth event is considered to occur at a randomly chosen time during the 28-day time step, which informs the date of birth and age of the simulants born. We select a random 4% of birth events to be the birth of twins (two newborn simulants), and for the other birth events we add a single newborn simulant. We expect that the inclusion of twins will create some particularly challenging ER data, where simulants have the same last name, address, and date of birth. We do not include adoption or any other complexities of family structure.

The newborn simulant inherits certain attributes from their mother simulant, including household, race/ethnicity, and last name (recall that the simulation associates a newborn with only a single parent, so these attributes are inherited from this individual unambiguously). These simplifying assumptions allowed us to avoid modeling the complex dynamics of relationships but precluded us from following the dominant patriarchal naming pattern present in the US. The nativity of children born in the simulation is set to reflect that they were born in the US; therefore, all children born in the simulation are assigned an SSN. Additionally, we assigned newborns a relationship to the reference person in their household (which is also their parent’s household) based on the relationship between their parent and the household reference person, using a set of logical business rules.

Migration. We attempted to include accurate patterns of migration in our simulation, as migration leads to changing addresses, which constitutes an important challenge in ER. As with basic demographics, all data informing migration in our simulation come from ACS PUMS. We used PUMS to calculate migration by demographics. There are a huge number of attributes that could explain moving behavior, and they may interact in complex ways in the real world. We modeled only some of this complexity and captured three types of household and individual migration events: migration within the simulation (domestic migration), migration into the simulation (in-migration), and migration out of the simulation (out-migration).

Domestic migration

We modeled domestic migration events as happening at a rate determined by age, sex, and race/ethnicity; we held these rates constant across time in the simulation. Individual domestic migration caused a single simulant to move and might reflect an individual moving out of their current living situation (i.e., GQ or residential household) and establishing a new one-person household, moving into GQ, or joining an existing residential household as a non-reference person. For individual migrations in which a simulant establishes a new household, we always classified the simulant as the reference person. For individual migrations in which a simulant joins an existing household, the simulant is always classified as an “Other nonrelative.” We assumed that simulants have at most a single individual migration event per time step.

When a simulant who was the reference person in a non-GQ household moved, we assigned the oldest remaining simulant in their household to be the reference person and updated all other relationships in the household according to logical business rules.

Household domestic migration caused an entire household of more than one simulant to move as a unit. As with individual migration, we calculated the rate of household migration per household-year and stratified by demographics. Because households do not have overall demographic characteristics, we used the demographics of the reference person for this stratification. Unlike individual migration, we did not change any relationships in household migrations.

We used a simplifying assumption that all simulants who moved and were of working age (which we define as age 18 and older) changed employment.

International immigration

We modeled immigration by adding new simulants to our simulation to represent individuals moving into the US from other countries. We sampled simulants immigrating to the US from the subset of the 2016–2020 ACS PUMS who had immigrated to the US in the year before they were surveyed and did not perturb their age. This approach relies on our assumption that the number-per-year and demographic characteristics of recent immigrants in the 2016–2020 ACS PUMS will not change substantially for all future years of the simulation.

We modeled three kinds of immigration events in our simulation: household moves, GQ person moves, and non-reference-person moves. As with domestic migration, a household move is when an entire non-GQ household enters from outside the country as a unit, preserving relationships within the unit. A GQ person move is when a simulant enters from outside the country and joins group quarters. Because simulants who reside in GQs do not have tracked relationships in PUMS or our simulation, these moves have no relationship structure. Lastly, a non-reference-person move is when an individual simulant enters from outside the country and joins an existing non-GQ household with some relationship other than “reference person.”

We used the weighted number of last-year immigration events of each type from the ACS PUMS to inform the yearly rate at which immigration events of each move type occurred in our simulation. We simulated constant rates over time and did not model seasonal or temporal fluctuation in immigration.

International emigration

Emigration occurs when a simulant leaves the US to live in another country. We used the Net International Migration (NIM) estimates from the Census Bureau’s Population Housing Unit (PopEst) program to determine the number of emigrants per year, by subtracting immigration numbers from ACS to isolate emigration. The NIM estimates are made by the PopEst team by combining information about immigration from ACS with information about emigration from demographic analysis (for those born outside the US) and analysis of foreign censuses (for those born in the US) ²².

There are three types of emigration events that can occur in our simulation: household moves, GQ-person moves, and non-reference-person moves. These cause an entire household, a GQ person, or a household member who is not a reference person to leave the US, respectively. We stratified emigration rates by age group, sex, race/ethnicity, nativity, and US state of residence, and we assumed that these stratified rates were constant over time, without a long-term trend or seasonal variation. We stopped tracking households and individuals after an emigration event and assumed that they would not return to the US or appear in any pseudopeople data after they had emigrated.

Employment dynamics. We consider all simulants aged 18 years or older to be working age; all such simulants either have an employer or are considered unemployed. We only allow a single employer at a time for each simulant. We initialized the working-age simulants to be unemployed, employed in the military, or employed otherwise, and we considered the military to be a single employer. To employ the rest of the simulants (those with non-military jobs), we generated employers with an initial size attribute chosen from a skewed distribution to ensure that there are a few large employers and many small employers. In order to assign individual simulants to employers such that the size attribute is (roughly) accurate at the population level, we selected each simulant’s employer from the categorical distribution where the probability of each employer is proportional to its initial size attribute.

Working-age simulants (including those who are unemployed) change employment randomly at a rate of 50 changes per 100 person-years (a rate we selected subjectively to provide an appropriate challenge in record linkage). When a simulant changes employment, we sample a new employer with the same procedure used at initialization. This approach to selecting a new employer ensures that at the population level, the number of simulants employed by any given employer will remain roughly proportional to the initial size attribute sampled for that employer.

We also simulate income, which affected which datasets a simulant appeared in. For instance, the WIC dataset only recorded simulants with household income below a certain threshold. We approximated the income distribution with a log-normal distribution for each age group, sex, and race/ethnicity combination, fit to the ACS PUMS. See Appendix 1 for more detail on the distribution parameters used for each demographic group. We simulated only income earned through wages; unemployed simulants had no income. To simplify our model, we assumed statistical independence between wages and employer for employed simulants.

Post-processing

We added some elements to the simulated data after the simulation ran. This includes features that would require additional computing resources to track for the simulation’s duration, such as simulant names (first and last), employer names, and government-issued identification numbers (i.e., SSNs and ITINs).

We developed simulant first and last names based on two distinct data sources: first and middle names are sourced from SSA data, which allowed use to match name frequencies to age and sex; while last names are generated from Census data with hyphens and spaces added to make linking tasks realistically more challenging, which allowed use to match the frequencies by race/ethnicity ^{23,
24}. We generated SSNs in accordance with the current algorithm used to issue unique SSNs. We selected the first three digits uniformly at random from 001 to 899, excluding 666; the next two digits from 01 to 99; and the final four digits from 0001 to 9999 ²⁵. We generated Individual Taxpayer Identification Numbers (ITINs) for simulated 1040 filings by simulants without an SSN using a similar process ²⁶.

We based our simulated employer names on a database of 5,321,506 “location names” from the SafeGraph “Core Places of Interest USA” dataset released in June 2020 ²⁷. To create a representation of bigrams from this dataset, we constructed a directed multigraph. Each word in a location name was treated as a node, and we included special <start> and <end> nodes. We included a directed multi-edge for each occurrence of a word pair in sequence in each location name. To generate simulated employer names, we performed a random walk through the bigram graph. Starting from the <start> node, we traversed directed edges selected uniformly at random until we reached the <end> node or exceeded a predetermined maximum path length. We then combined the words associated with each node that was encountered along the path to form the simulated employer name. This approach resulted in a diverse range of names that maintained a realistic quality. In the sample data included with the pseudopeople package, the W2 and 1099 employer names in 2020 include 212 distinct names and the three most commons are “San Benito Martinez Landscape Supply”, “Tony's Family Practice Inc”, and “Pikes Creek Campground”.

Dataset validation

For dataset validation, we followed the standard workflow used across all of our Vivarium models, using a process often referred to as verification and validation (V&V) ²⁸. In this process, model results are verified by the research team by checking that a given model approximately replicates target values it was explicitly designed to replicate (e.g., verifying that the proportion of simulants living in group quarters as opposed to individual households matched the value specified by the research team). Results are also validated by the research team, ensuring that model results are logically viable or sensible (e.g., checking that the US population size and structure does not change drastically over the time period modeled). In the event that model results did not meet verification and validation criteria, model implementation and/or design were iteratively adjusted appropriately until criteria were satisfied.

We validated pseudopeople datasets through automated testing conducted on the engineering side as well as manual, systematic testing of the simulated population and post-processing data on the research side. In an effort to be as systematic as possible with our user-led data testing processes, we aspired to specify our verification and validation strategies before the synthetic population model was developed by our engineers. For example, we used an interactive simulation in a Jupyter Notebook to verify that simulants were dying at the age- and sex-specific all-cause mortality rates estimated for the USA by the IHME Global Burden of Disease Study.

Dataset limitations

There are a variety of limitations to our simulation strategy which may affect its ability to reflect real-world dynamics, including but not limited to those regarding migration, employment, physical or mailing address, guardianship, household structure and simulant relationships, and simulant identification.

For instance, to make possible the simulation of complex migration dynamics, there are a series of assumptions we made regarding how simulants and households move around, into, and out of the simulation. We assumed that domestic and international migration do not change over time, but rather remain at the average rate from 2016 to 2020 in each future year of the simulation. When a simulant moves, we assume that their mailing address, physical address, and employer all change. In addition, complexities particular to household sub-structure interacting with migration are largely not captured in our simulation. For instance, a child can move out of their household without a parent, or a simulant could move without their spouse into a different household. Additionally, we assume that relationship does not affect emigration rates and that all household types are equally likely to have a simulant move out of or into them. Furthermore, for any individual migration of a simulant from one household into another, we assign the relationship “other nonrelative” in their new household. Thus, as time passes within the simulation, the proportion of households with irregular relationship structures grows. In the sample data included with the pseudopeople package, the 2020 decennial census has 4% of rows have “Other nonrelative” as relationship to the reference person, and in 2030 this rises to 16%. Even in the early years of our simulation, it is possible that there are rare, but challenging, household structures which are not sampled in ACS and therefore not represented in our data either, for example a very small fraction of very large households might present a problem in real-work linkage work that would not be identified when testing with pseudopeople data.

Similarly, there are several assumptions that we made to simplify our model of employment dynamics in our simulation. We do not model retirement, and each simulant can only have one employer at a time. There is a myriad of business dynamics that we currently do not model, including new businesses opening, existing businesses closing, business name changes, or business mergers and acquisitions. As with household physical and mailing addresses, when a business address is vacated, it is not reused. In effect, this likely makes business record linkage with these data easier than it will be in practice.

Our age-specific fertility and mortality models do not account for variations related to income or race/ethnicity, and in future iterations of this work, we wish to address more complicated dynamics between the various elements of our simulation.

There are also limitations in simulant identification because of privacy protections in the name data we have used. The data on first names excludes names with fewer than five occurrences while the data on last names included only names with at least 100 occurrences. Furthermore, we did not model the correlation between first and last names explicitly. We hope to address these limitations in future refinements of our model.

Results

Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub ( github.com/ihmeuw/pseudopeople/issues). These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.

Table 2 shows a sample of the simulated data that might be found in administrative sources on income tax (note that these are entirely simulated data and therefore do not constitute Confidential Unclassified Information under US Code).

Table 2. Sample of tax data from simulated population.

These are entirely simulated data and therefore do not constitute Confidential Unclassified Information under US Code.

Simulant ID	First name	Last name	SSN	Employer	Wages	Tax Form
4	Eric	Alonso Tellez	584-16-0130	Pikes Creek Campground	$10,192	W2
5	Erin	Alonso Tellez	854-13-6295	Red’s Dairy Queen	$28,355	W2
5	Erin	Alonso Tellez	854-13-6295	Warrensburg	$18,243	W2
5621	Derick	Castillo	674-27-1745	Nashville City Properties	$7,704	W2
5623	Heather	Castillo	794-23-1522	Ecr Whipple Oliver Finley Shoe Sensation	$3,490	1099

Conclusions

By generating population data with complexity and scale comparable to that of large organizations and federal agencies, like the US Census Bureau, we hope to circumvent the common data privacy– and access-related barriers to ER research and development. We intend for this data note to serve as a comprehensive guide for researchers contemplating the use of pseudopeople to develop and test their fresh theories, algorithms, and software systems.

Acronyms

Acronym	Full Form
ACS	American Community Survey
CPS	Current Population Survey
ER	Entity Resolution
GQ	Group Quarters
IHME	Institute for Health Metrics and Evaluation
ITIN	Individual Taxpayer Identification Number
NIM	Net International Migration
OMB	Office of Management and Budget
PII	Personally Identifiable Information
PUMA	Public Use Microdata Area
PUMS	Public Use Microdata Sample
SSA	Social Security Administration
SSN	Social Security Number
USCB	United States Census Bureau
V&V	Verification and Validation
WIC	Special Supplemental Nutrition Program for Women Infant and Children

Data availability

While a small amount of pseudopeople data is openly available as part of the Python package, access to the full datasets will require users to be both transparent and accountable to a committee of interested parties, including civil society organizations, privacy experts, and data subject representatives. To read more about how to access the full datasets associated with pseudopeople, please visit our website at https://www.pseudopeople.org/.

Software availability

Source code available from: https://github.com/ihmeuw/vivarium_census_prl_synth_pop

Archived source code at time of publication: https://doi.org/10.5281/zenodo.10967291 ²⁹.

License: BSD 3-Clause License.

The steps to reproduce and run the software are available at the site listed above. Please note that although this software is freely available, running the simulation and post-processing at full-USA scale is a resource-intensive process that requires substantial computational resources. Our approach used parallelization across 334 individual jobs, with each job accommodating an optimal population size estimated to be between 750,000 and 1.5 million individuals. For a run of 1 million individual simulants, we used approximately 55 gigabytes of memory and a runtime of 21.5 hours.

Binette

Steorts

: (Almost) all of entity resolution. Sci Adv. 2022;8(12): eabi8021. 35333582

10.1126/sciadv.abi8021

Connelly

Playford

Gayle

: The role of administrative data in the big data revolution in social science research. Soc Sci Res. 2016;59:1–12. 27480367

10.1016/j.ssresearch.2016.04.015

Fischer

Richter

FGC

Anthony

: Leveraging administrative data to better serve children and families. Public Adm Rev. 2019;79(5):675–83. 10.1111/puar.13047

Christen

Schnell

: Thirty-three myths and misconceptions about population data: from data capture and processing to linkage. Int J Popul Data Sci. 2023;8(1): 2115. 37636835

10.23889/ijpds.v8i1.2115

10454001

United States Census Bureau: Four cooperative agreements: census bureau research on record linkage and entity resolution. Census.gov. [cited 2023 Nov 1]. Reference Source

Christen

: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, NY, USA: Association for Computing Machinery; (KDD ’ 08),2008;1065–8. 10.1145/1401890.1402020

Tran

Vatsalan

Christen

: GeCo: an online personal data generator and corruptor.In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York, NY, USA: Association for Computing Machinery; (CIKM ’ 13),2013;2473–6. 10.1145/2505515.2508207

Talburt

Zhou

Shivaiah

: SOG: a synthetic occupancy generator to support entity resolution instruction and research.In:2009. Reference Source

Nowok

Raab

Dibben

: Synthpop: bespoke creation of synthetic data in R. J Stat Softw. 2016;74(11):1–26. 10.18637/jss.v074.i11

Lindsay

Kennedy

Hepworth

: Splink: latest developments and applications. Int J Popul Data Sci. 2023;8(2): 2245. 10.23889/ijpds.v8i2.2245

10929522

Schnell

Weiand

: Microsimulation of an educational attainment register to predict future record linkage quality. Int J Popul Data Sci. 2023;8(1): 2122. 37649490

10.23889/ijpds.v8i1.2122

10463005

Viljoen

: A relational theory of data governance. Yale LJ. 2021;131: 573. Reference Source

Christen

Ranbaduge

Schnell

: Linking sensitive data.2020. 10.1007/978-3-030-59706-1

Ihmeuw/vivarium: archival release. [cited 2023 Nov 1]. Reference Source

Harris

Millman

van der Walt

: Array programming with NumPy. Nature. 2020;585(7825):357–62. 32939066

10.1038/s41586-020-2649-2

7759461

McKinney

: Data structures for statistical computing in python.In: Austin, Texas;2010;56–61. Reference Source

Yassine

Beauchemin

Laviolette

: Leveraging subword embeddings for multinational address parsing.In: 2020 6th IEEE Congress on Information Science and Technology (CiSt).2020;353–60. 10.1109/CiSt49399.2021.9357170

United States Census Bureau: Public Use Microdata Sample (PUMS). Census.gov. [cited 2023 Nov 1]. Reference Source

Federal Register: Revisions to the standards for the classification of federal data on race and ethnicity.1997; [cited 2023 Nov 1]. Reference Source

2016-2020 ACS PUMS Data Dictionary.

Wang

Abbas

Abbasifard

: Global age-sex-specific fertility, mortality, Healthy Life Expectancy (HALE), and population estimates in 204 countries and territories, 1950-2019: a comprehensive demographic analysis for the Global Burden of Disease study 2019. Lancet. 2020;396(10258):1160–203. 33069325

10.1016/S0140-6736(20)30977-6

7566045

Bhaskar

Cortés

Scopilliti

: Estimating net international migration for 2010 demographic analysis: an overview of methods and results. Reference Source

Popular baby names. [cited 2023 Nov 1]. Reference Source

Comenetz

: Frequently occurring surnames in the 2010 census.

The social security number verification service. [cited 2023 Nov 1]. Reference Source

Individual Taxpayer Identification Number. Internal Revenue Service, [cited 2023 Nov 1]. Reference Source

Places data curated for accurate geospatial analytics.SafeGraph, [cited 2023 Nov 1]. Reference Source

Allen

Collins

Rankin

: Enabling model complexity through an improved workflow.In: Washington D.C.;2019. Reference Source

Albright

Haddock

Bachmeier

: ihmeuw/vivarium_census_prl_synth_pop: data release (v2.0.1). Zenodo. 2024. http://www.doi.org/10.5281/zenodo.10967291

10.21956/gatesopenres.17690.r38295

Reviewer response for version 2

Schnell

Rainer

1 Referee https://orcid.org/0000-0001-7843-4974 1University of Duisburg-Essen, Duisburg, Germany

Competing interests: No competing interests were disclosed.

28 10 2024

2024

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

The paper has improved a lot.

It is now clearer that detailed corruption is done within "pseudopeople".

However, neither the intended workflow nor the relationship between the input generated here, pseudopeople and vivarium is evident from the paper alone.

Since this is a microsimulation, a reference to the standard textbooks (for example, Handbook of Microsimulation, Practical Microsimulation or Spatial Microsimulation (...)) seems to be required.

Since the aim of the paper seems to generate records for ER, the dependencies of the changes of QIDs due to the simulated processes are essential for an unbiased estimate of successful ER. For example, when does a name change to which variant, caused by marriage, divorce, free will or naturalization? I cannot find a description in the text, although these processes are essential to the problems of ER in practice. If these processes are implemented, they need parameters derived from empirical studies, which should be referenced.

The authors state: "There are certain features of our project, however, that differentiate it from other efforts, most notably the scale of our simulated data: We have simulated a 100% sample of the USA over 20 years." Population microsimulation is common; for example, the UK models and the German Mikrosim project cover a whole nation for 20 years or more (which hardly makes sense in microsimulation). Furthermore, there is a Canadian and a Swedish microsimulation of the population. So, the special feature is that it is a US model. Neither the population covering aspect, the duration, nor the ideas of the main mechanisms of the microsimulation are new.

Are sufficient details of methods and materials provided to allow replication by others?

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Partly

Are the protocols appropriate and is the work technically sound?

Partly

Reviewer Expertise:

microsimulation, record linkage, PPRL, census operations

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

10.21956/gatesopenres.17690.r38294

Reviewer response for version 2

D. Gaboardi

James

1 Referee https://orcid.org/0000-0002-4776-6826 1Oak Ridge National Laboratory, Oak Ridge, USA

Competing interests: No competing interests were disclosed.

22 10 2024

2024

recommendation

approve

I have no further comments.

Are sufficient details of methods and materials provided to allow replication by others?

Yes

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

GIScience, Spatial Optimization, Geocomputation, Research Software Science

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

10.21956/gatesopenres.16769.r36776

Reviewer response for version 1

Schnell

Rainer

1 Referee https://orcid.org/0000-0001-7843-4974 1University of Duisburg-Essen, Duisburg, Germany

Competing interests: No competing interests were disclosed.

17 7 2024

2024

recommendation

reject

This is an interesting project and an interesting article.

My main concerns relate to the assumptions in using the names. The names are from top-lists and therefore do not contain rare names. The entropy of the QIDs is therefore underestimated, and linkage quality will be overestimated,

Furthermore, first names and last names are used independently and this will also underestimate the entropy in real data sets. Again, linkage quality will be biased.

By inflating ACS data, the same problem will occur. For example, rare household compositions, such as very large households, will be rare or even non-existing in ACS. Exactly these households will cause problems in real-world linkages. For example, similar names and birthdays at the same address are more in census data than in samples of 1 or even 5 % of the population. Furthermore, using name distributions independent of very large household compositions will cover up the main census linkage problem in my experience: Separating very similar persons. These "rare" cases of about very few percent of a population will cause underestimation of subpopulations using a biased sample (given the aim are census estimates).

The terminology is a bit strange. For example, naming the simulated entities "simulants" is unusual in microsimulation (see, for example, O'Donoghue).

Another example is “shards”, used a synonym for computational jobs. However, in general, a shard is

is a horizontal partition of data in a database or search engine (cited from Wikipedia), not a job. There are many more issues like these, and it might be helpful to check the usage of the terminology again.

I miss a clear statement, which parameters of the model are available within the programs provided or as parameter files. The time available for reviews do not permit running these programs for replication, therefore more much more details on what can be done with the data available at the GitHub link is required.

If a corrupter for QIDs is part of the program collection is also unclear to me. If so, the details of the parameters and their dependency on the simulated processes are of prime interest. For example, the text states: "These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers." Thefefore, is this paper describing "pseudopeople", "Vivarium", their joint usage or something else?

Some previous work is not cited. For example, to the best of my knowledge, we have published the first paper on using microsimulation for generating data for record-linkage (Schnell/Weiand 2023).

Furthermore, the book on linking techniques by Christen et al. discusses many previous generators and corrupters systematically.

Are sufficient details of methods and materials provided to allow replication by others?

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Partly

Are the protocols appropriate and is the work technically sound?

Partly

Reviewer Expertise:

microsimulation, record linkage, PPRL, census operations

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

References 1

: Microsimulation of an educational attainment register to predict future record linkage quality. Int J Popul Data Sci .2023;8(1) : 10.23889/ijpds.v8i1.2122 2122

37649490

10.23889/ijpds.v8i1.2122

: Linking Sensitive Data.2020; 10.1007/978-3-030-59706-1

10.1007/978-3-030-59706-1

: Practical Microsimulation Modelling.2021; 10.1093/oso/9780198852872.001.0001

10.1093/oso/9780198852872.001.0001

Flaxman

Abraham

University of Washington, Seattle, USA

Competing interests: No competing interests

2 10 2024

This is an interesting project and an interesting article.

My main concerns relate to the assumptions in using the names. The names are from top-lists and therefore do not contain rare names. The entropy of the QIDs [ADF1] is therefore underestimated, and linkage quality will be overestimated, and furthermore, first names and last names are used independently and this will also underestimate the entropy in real data sets. Again, linkage quality will be biased.

Response: We appreciate the reviewer identifying these important limitations, and we have added text to our limitations section to call attention to them. In future work, we hope to compare entropy from our approach and some of the datasets without such redactions, such as from voter registration data, and also to enhance our approach to have more realistic marginal distributions for first and last names as well as joint distributions over (first name, last name) pairs.

Response: We have added to the limitations section to call attention to this limitation as well, and appreciate the reviewer highlighting it for us.

The terminology is a bit strange. For example, naming the simulated entities "simulants" is unusual in microsimulation (see, for example, O'Donoghue).

Another example is “shards”, used a synonym for computational jobs. However, in general, a shard is a horizontal partition of data in a database or search engine (cited from Wikipedia), not a job. There are many more issues like these, and it might be helpful to check the usage of the terminology again.

Response: This is due to some internal jargon that has developed over time in our multidisciplinary group at IHME, and we appreciate the reviewer identifying it. We can easily can change our terminology to avoid some of the cognitive burden of unfamiliar language, and we have removed the term “shard” as part of this effort. We appreciate the precision “simulant” brings to discussing data fields that would be considered confidential information if they were about a real person, however, and have kept this terminology, which we hope becomes more common over time.

Response: This can be an important limitation for users, and we have added details on the parameters which are inherent to the simulation and cannot be modified in the pseudopeople “post processing”.

Response: Our goal was to describe a specific simulation that we built with our Vivarium framework, to generate data for our pseudopeople package. Upon reflection, this is sure to be really confusing for anyone outside of our team, and we apologize for not making it clearer! We have added more detail about what is in pseudopeople and what is in this simulation; this was also confusing to reviewer 1.

Some previous work is not cited. For example, to the best of my knowledge, we have published the first paper on using microsimulation for generating data for record-linkage (Schnell/Weiand 2023).

Response: Thank you for highlighting this relevant literature, to which we have added a citation.

Furthermore, the book on linking techniques by Christen et al. discusses many previous generators and corrupters systematically.

Thank you for highlighting this, which is now even more important for us to refer to, since we have added information on the pseudopeople package that uses the simulated data described in the first draft of our paper, and the data corrupters included therein are substantially influenced by Christen et al’s FEBRL and GeCO methods.

10.21956/gatesopenres.16769.r37051

Reviewer response for version 1

D. Gaboardi

James

1 Referee https://orcid.org/0000-0002-4776-6826 1Oak Ridge National Laboratory, Oak Ridge, USA

Competing interests: No competing interests were disclosed.

17 7 2024

2024

recommendation

approve

The authors present a thorough and holistic population simulation procedure with Vivarium & the means to access it with pseudopeople. This framework is operationalized through the use of publicly-available data products from the US Census Bureau, the Social Security Administration, and other auxiliary datasets. The overarching purpose of the framework is synthetic Entity Resolution (ER), and the implementation takes chronology and life events into account (e.g., age, residence change, birth, death). Person- & household-level records are simulated that are also linked to related demographic information, such employment and migration. The authors produce large-scale simulated census information for a span of 20 years into the future, a sample of which is entirely open-sourced while more detailed data is available through request.

I am curious for the authors' insights into the potential of integration into other simulation frameworks, such as ChiSim ¹ and UrbanPop ^2,3. As a full disclosure, I am affiliated with the UrbanPop project.

C.M. Macal, N.T. Collier, J.Ozik, E.R. Tatara, and J.T. Murphy (2018). "Chisim: An agent-based simulation model of social interactions in a large urban area," in 2018 winter simulation conference (WSC), pp. 810820, IEEE. DOI: 10.1109/wsc.2018.8632409.

J.V. Tuccillo, R. Stewart, A. Rose, N. Trombley, J. Moehl, N.N. Nagle, and B. Bhaduri (2023) "UrbanPop: A spatial microsimulation framework for exploring demographic influences on human dynamics," Applied Geography, vol. 151, pp. 102844. DOI: 10.1016/j.apgeog.2022.102844.

J.V. Tuccillo and J.D. Gaboardi (2023) "Spatial Microsimulation and Activity Allocation in Python: An Update on the Likeness Toolkit," Proceedings of the 22nd Python in Science Conference, pp. 93-100. DOI: 10.25080/gerudo-f2bc6f59-00c.

Are sufficient details of methods and materials provided to allow replication by others?

Yes

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

GIScience, Spatial Optimization, Geocomputation, Research Software Science

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

References 1

An agent-based simulation model of social interactions in a large urban area. winter simulation conference (WSC) .2018;

: UrbanPop: A spatial microsimulation framework for exploring demographic influences on human dynamics. Applied Geography .2023;151: 10.1016/j.apgeog.2022.102844

10.1016/j.apgeog.2022.102844

Spatial Microsimulation and Activity Allocation in Python: An Update on the Likeness Toolkit. Python in Science Conference .2023;

Flaxman

Abraham

University of Washington, Seattle, USA

Competing interests: No competing interests

2 10 2024

Thank you for this positive assessment and fun opportunity to read up and think about; at a high level, it seems that some integration must be possible! If I understand correctly, it appears that the time scale of the ChiSim and UrbanPop projects is substantially finer-grained than what we used in pseudopeople --- while they capture patterns of mobility over the course of a day, we have focused on internal migration patterns that might happen only once in a year. That said, the general area is very similar, and the work in our setting might be useful for this line of research. And vice versa! Thank you for calling our attention to this work.

10.21956/gatesopenres.16769.r36771

Reviewer response for version 1

Kum

Hye Chung

1 Referee https://orcid.org/0000-0002-6882-8053 1Population Informatics Lab, Texas A&M University, College Station, Texas, USA

Competing interests: No competing interests were disclosed.

17 7 2024

2024

recommendation

approve-with-reservations

This paper describes a python package that can simulate population data based on distributions of various individual and household characteristics in datasets released from the US census Bureau, with an explicit purpose for supporting and facilitating entity resolution (ER) algorithm and software development without accessing real confidential data. ER algorithms require access to personally identifiable information (PII) such as names, SSN, DOB making privacy a major concern in this type of research. Using simulated data, such as the one described in the paper is one approach to dealing with the privacy issues in ER. It relies on using the existing Vivarium simulation platform the same team developed, but attempts to incorporate some of the complexities of the ER problem with real data such as difficulties with twins.

The paper is well written and easy to follow for the most part. Here are some questions/comments to consider for improvement.

[Major comments]

Data errors (e.g., typos in names and dob, flipped first/last name in different systems) or variations (e.g., nicknames, change in last name due to marriage, Jr and Sr) are a common source of difficulties in ER using real data which does not seem to be modeled yet. Maybe at least mention it as important future work.

Similarly, duplicate records that exist in almost any real data make ER much more difficult because one cannot assume a one-to-one match unless the tables have been deduplicated first. One of the main difficulties in ER is that a duplicate record for one person cannot be mathematically differentiated from very similar records for 2 people such as twins, father/son, or even totally unrelated people (to date, I have met 2 people who told me after a ER talk that they know “of” this other person with the EXACT same first/last name and dob because they are in the same city and they have been confused before with the other person. Often it requires an explicit field stating that the person is a twin to be certain.

I highly recommend reading the following paper for ideas for future work to make the simulation more realistic for ER research as well as adding a limitation section to the current paper. Such a section would be very important for potential users to know what the simulated data does not cover, and may need to still consider in their ER algorithm.

Goth G. Running on EMPI. Health information exchanges and the ONC keep trying to find the secret sauce of patient matching. Health data management. 2014;22(2):52-, 4, 6 passim.

Also, consider the following for the limitation section. It seems that this simulated data may be more supportive of ER algorithms that leverage structure (e.g., household structure or based on clustering) without a pair generation. However, many real world applications do not have much structure information to use, and may need to still consider pairwise ER algorithms. Complexities for this type of ER algorithms may not have been well modeled here (see point above).

A few sentences describing and talking about data governance/privacy concerns even for the simulated data in the introduction would be important. To my understanding, at a minimum a simulation of this scale may have simulatants that are too similar to real people, even if the data was all made up which may still give rise to privacy issues. It is very important that everyone becomes more educated on fundamental properties of information privacy. Such as, information privacy is not a binary property but rather a continuous concept that with every use, there is some level of disclosure/risk. There is no way to benefit from using data for social good/research with 0 risk. The more people understand this, the more constructive the conversations about the risk and benefits of using large datasets about people.

A small paragraph on the pros/cons of approaches to privacy protection (e.g., simulated data, perturbed data, using secure data enclaves for access to real data) for ER research will be helpful to frame the paper and better understand when to use the simulated population or other alternatives. Some papers on perturbed data or using data enclaves are below.

Ramezani M, Ilangovan G, Kum HC. Evaluation of machine learning algorithms in a human-computer hybrid record linkage system. In CEUR workshop proceedings 2021 Jan (Vol. 2846, No. 4).

Ilangovan, Gurudev (2019). Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage. Master's thesis, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /186390.

P. Christen, “Preparation of a real temporal voter data set for record linkage and duplicate detection research,” 2013.

Kum HC, Ahalt S. Privacy-by-design: Understanding data access models for secondary data. AMIA Summits on Translational Science Proceedings. 2013;2013:126.

“We select a random 4% of birth events to be the birth of twins (two newborn simulants), and same last name, address, and date of birth.”: simulating multiple births is one of the key parts of making the simulated data useful for the intended purpose of ER. There is no reference to why 4%, it may be worth using a slightly higher percentage (see comment below). Also, twins often have even more similarity in my experience of real data (e.g., similar first name or same first letter in first name. SSN is only 1 digit off). I am quoting from my published work in ER on this topic which includes a reference. “The most difficult links to resolve involve twins. In this case, much of the identifying information is validly the same or very similar. Often, SSNs are only one digit off, and one system might have assigned the SSN in one way, while another system has it assigned in the other way. These types of data errors make it almost impossible to automatically resolve entities without human intervention. Multiple birth rates have been rising in the United States, with twin birth rates at 29.3 per 1,000 births in 2000 [Reynolds et. al. 2003]. That is approximately six twins in every 103 children born, not including triplets and higher-order births. These are substantial numbers that need to be considered when performing record linkage on people level data. In one system of education data that performs record linkage on a regular basis, we have seen a twin field being regularly collected to differentiate data errors from real twins.” [Kum et. al. 2013]

Kum HC, Ahalt S, Pathak D. Privacy-preserving data integration using decoupled data. Springer New York; 2013.

Reynolds MA, Schieve LA, Martin JA, Jeng G, Macaluso M (2003) Trends in multiple births conceived using assisted reproductive technology, United States, 19972000. Pediatrics 111 (Supp 1):1159–1162

Why is marriage or father not modeled in the births? Jr/Sr in the US naming culture do cause complexities in ER. So I am not sure I would agree with the stated hypothesis “We hypothesize that this limitation in realistic naming will not lead to substantially harder or easier data for linkage. “

“In the event that model results did not meet verification and validation criteria, model implementation and/or design were iteratively adjusted appropriately until criteria were satisfied.”: it would be helpful to have some sense of how often these adjustments were needed, and how much of an adjustment was needed. This would help readers judge how close or not the final simulated results were before the fine tuning.

“Thus, as time passes within the simulation, the proportion of households with irregular relationship structures grows.”: it is understandable that this is a limitation of this type of simulation, but a bit more explanation on issues such as the following would be useful. (1) what are the issues related to this happening for ER; (2) how much is this a problem (e.g., rate of issues); and (3) limits on when the simulation should no longer be used due to this issue.

“As with household physical and mailing addresses, when a business address is vacated, it is not reused. In effect, this likely makes business record linkage with these data easier than it will be in practice.” It is unclear why the business address is not reused especially given this statement that for ER for businesses it may not be a good choice since this was supposed to target ER usage.

Also, how names for business were generated using graphs was confusing. Some concrete examples would be good to include.

Given that this is based on the Vivarium simulation platform, some brief explanation of what it is would be helpful. Either not refer to it, if readers do not need anything about it or explain a little something about what it is beyond that it was used so that the paper can be somewhat stand alone.

[Clarification questions]

“28-day time steps”: is there a reason not to use 30 days=1 month? 28 days seems to be the smallest days in a month, but unclear why this is good to do. It would seem that 1 year would not model well in this way. 30 days=360 days in a year. Which may be better.

“When perturbation led to a negative age value, we flipped the negative age value’s sign”. Why not just use a perturbation that would not give a negative value? It seems that flipping the negative value would not result in the target distribution.

“change employment randomly at a rate of 50 changes per 100 person-years (a rate we selected subjectively to provide an appropriate challenge in record linkage)”: It is unclear how employment is directly related to ER except maybe indirectly through change of address which in this particular simulation is tied to employment. Actually, it was unclear if employment change came before move, or the other way, or both. Also, unclear why this rate is appropriate. It maybe better to describe ER in terms of changes in address rather than employment. If not, a bit more clarity on how ER is related to employment would be helpful.

“For instance, a child can move out of their household without a parent”: I assume this means even an infant can move out on their own. If we consider child adoptions/foster care this may be fine. But move out rates for children under 18, and adults should be quite different. This does not seem too difficult to implement and may be important to consider. Or for simplicity, maybe adoptions are not modeled but an age cutoff is implemented.

Are sufficient details of methods and materials provided to allow replication by others?

Partly

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Partly

Are the protocols appropriate and is the work technically sound?

Partly

Reviewer Expertise:

Entity Resolution; Perturbing data for entity resolution; Privacy issues in and entity resolution

Flaxman

Abraham

University of Washington, Seattle, USA

Competing interests: No competing interests.

2 10 2024

Comment: This paper describes a python package that can simulate population data based on distributions of various individual and household characteristics in datasets released from the US census Bureau, with an explicit purpose for supporting and facilitating entity resolution (ER) algorithm and software development without accessing real confidential data. ER algorithms require access to personally identifiable information (PII) such as names, SSN, DOB making privacy a major concern in this type of research. Using simulated data, such as the one described in the paper is one approach to dealing with the privacy issues in ER. It relies on using the existing Vivarium simulation platform the same team developed, but attempts to incorporate some of the complexities of the ER problem with real data such as difficulties with twins.

The paper is well written and easy to follow for the most part. Here are some questions/comments to consider for improvement.

Response: Thank you for this assessment, and for taking the time to offer this valuable feedback.

Comment: [Major comments]

Response: Thank you for calling attention to this. We have developed the pseudopeople Python package (

https://pseudopeople.org/

) that includes additional affordances to add various kinds of noise to the simulated data described in this paper to afford a configurable amount of typos and other sources of error including some of those listed above, and we intend to add other noise such as the reviewer has identified in future updates to pseudopeople. In this paper we hope to complement the pseudopeople documentation with a detailed description of the simulation process.

As we continue to add additional realism to challenge ER algorithms, it will be an interesting challenge to balance what we can make a configurable add-on (like the rate of first/last name flipping) and what we need to include in the simulation itself (e.g. the marriage-related name changes might need to be simulated together with the marriages) and therefore may not be able to make easily configurable. We have added text to the abstract and introduction to emphasize that this paper is focused on the simulation and not the data errors.

Comment:

Response: As with our previous comment, this is an important area that we hope to address in more detail in future work, and we appreciate the reviewer calling attention to some of the unique challenges here. In our next round of pseudopeople enhancements, we plan to add “simple duplication” as an additional configurable type of “row_noise”. More complex duplication, such as may be caused by individuals who are reported in two distinct households during a decennial census, is something we have a start on in our simulation already and hope to also enhance in future work. Another area for enhancement that we did not think of until reading this comment is the prospect of including an explicit field stating that the person is a twin --- just such a field is present in the Washington State Drivers License database (and was repurposed to create a twin registry for scientific research in Washington).

Comment:

Goth G. Running on EMPI. Health information exchanges and the ONC keep trying to find the secret sauce of patient matching. Health data management. 2014;22(2):52-, 4, 6 passim.

Response: Thank you for pointing out this reference, and we agree that there are additional future work and limitations that it identifies admirable. We believe that many (e.g. appending title to surname, as in FLAXMANPHD) which will be worthy additions to pseudopeople, but we did not feel any were so relevant to our simulation as to merit specific additional to the limitations section of this paper.

Comment:

Response: We appreciate the reviewer calling attention to the potential for using household structure in ER, which we agree is an underdeveloped avenue worthy of future research. We agree that many real-world applications of ER will not be able to use such an approach, because it is not frequently available, and we hope that our sim and our python package will be a help in both developing novel methods and understanding when these methods are more or less applicable.

Comment:

Response: Thank you for calling attention this important (and subtle) feature of simulated data of this scale. Both the generation and distribution of the dataset are governed by a system of “relational governance” which we developed for this simulated-but-realistic data together with the data ethicist Os Keyes. In Os’s approach, data subjects play a central role; a paper on their approach to data governance is currently under preparation. We have added a sentence to this effect to the introduction section.

Comment:

Ramezani M, Ilangovan G, Kum HC. Evaluation of machine learning algorithms in a human-computer hybrid record linkage system. In CEUR workshop proceedings 2021 Jan (Vol. 2846, No. 4).

P. Christen, “Preparation of a real temporal voter data set for record linkage and duplicate detection research,” 2013.

Kum HC, Ahalt S. Privacy-by-design: Understanding data access models for secondary data. AMIA Summits on Translational Science Proceedings. 2013;2013:126.

Response: We appreciate the reviewer’s suggestion and have added a sentence along these lines in the introduction. We will also consider all of these valuable references when framing and positioning our data governance approach in the paper that Os is currently working on.

Comment:

Kum HC, Ahalt S, Pathak D. Privacy-preserving data integration using decoupled data. Springer New York; 2013.

Reynolds MA, Schieve LA, Martin JA, Jeng G, Macaluso M (2003) Trends in multiple births conceived using assisted reproductive technology, United States, 19972000. Pediatrics 111 (Supp 1):1159–1162

Response: Thank you for these valuable references. We use 4% of births without a suitable data source, and plan to enhance this in future work. Fortunately, our ad hoc number is similar to the evidence you shared --- our 4% of births for 2020-2040 is not too much higher than the 3% found in 2000. Your identification of similar SSNs is another challenge that we should incorporate when enhancing our twin generation component.

Comment:

Response: Good point, we did not consider the challenge of distinguishing between father and son with same names except for Jr and Sr suffixes. We have removed this hypothesis, and in our next major update to our simulation, we hope to include fathers, as well as Jr/Sr naming practices.

Comment:

Response: This is a very reasonable request that is unfortunately very difficult to meet, due to a lack of careful information keeping. We do have a complete revision history of the development of our model available publicly on GitHub, which (at the time I write this) shows

640 code commits

grouped into

359 pull requests

, but we do not have a straightforward way to identify which of these were changes in response to verification and validation issues. In the future, I think we can easily incorporate a “tagging” approach to PRs to identify code changes from V&V.

Comment:

Response: We believe that this will not be an issue for most common approaches to ER, and will only be relevant for experimental approaches that use household structure as part of their linkage strategy. In the publicly available sample data, in the 2020 decennial census, 4% of rows have “Other nonrelative” as their relationship to the reference person. In 2030, this rises to 16%, and in 2040, it is 20%. We plan to address this by adding more complex logic for relationship changes following migration in a future update. We have added a sentence about this to the limitations section.

Comment:

Response: This is a limitation that we plan to address in future update, and we agree with the reviewer that reusing business addresses will make this data more suitable for testing business linkage algorithms.

Comment:

Also, how names for business were generated using graphs was confusing. Some concrete examples would be good to include.

Response: Good idea, we have added a sentence with examples from the sample data distributed openly with the pseudopeople package.

Comment:

Response: We feel that it is valuable to refer to the vivarium software that we used to implement this simulation and agree with the reviewer that it is important to include a brief explanation of what Vivarium is. Since we already attempted such an explanation in our original draft, we have revised it to include additional detail in the first paragraph of the methods section. Perhaps the reviewer had some additional details in mind, however, and we welcome further feedback about how we can make this somewhat stand alone.

[Clarification questions]

Comment:

Response: We prefer to work in units of days or weeks, since larger units of time like months and years have annoying variation in length. So 28 days = 4 weeks is similar in length to one month, but always includes (for example) the same number of weekend days. This is not particularly important for this simulation but can matter for simulations or data that included day-by-day variation. We are confident that changing the timestep to 30 days would not lead to any difference in the difficulty of record linkage tasks with this simulated data.

Comment:

Response: We prefer this approach to obtaining a truncated distribution that does not include an unexpectedly high density at zero (as truncation would) and does not shift the mean as much as a non-negative perturbation (like adding an exponentially distributed offset) would. It is developed in detail in Bernard W. Silverman, Density Estimation for Statistics and Data Analysis, 1986 (p. 30).

Comment:

Response: This particular ER challenge might be a bit niche, but it arises in systems like the US Census Bureau’s Person Identification Validation System where a tax id is available in employment records. As the reviewer notes, changes in address represent an important challenge in this setting, but changes in employment when address does not change could potentially be a barrier to high-quality record linkage in this setting as well.

Comment:

Response: We do include some of this complexity in our current simulation, and as we write, “We modeled domestic migration events as happening at a rate determined by age, sex, and race/ethnicity”, but we also want to note that surprising or even illogical household composition can arise from the limited approach we have taken. We hope to increase the realism of these dynamics in the future.