Child development with the D-score: turning milestones into measurement [version 1; peer review: 1 approved with reservations]

The chapter equips the reader with a basic understanding of robust psychometric methods that are needed to turn developmental milestones into measurements, introducing the fundamental issues in defining a unit for child development and demonstrates the relevant quantitative methodology. It reviews quantitative approaches to measuring child development; • introduces the Rasch model in a non-technical way; • shows how to estimate model parameters from real data; • puts forth a set of principles for model evaluation and assessment of scale quality; • analyses the relation between early D-scores and later intelligence; • and compares the D-scores from three studies that all use the same instrument. •

It reviews quantitative approaches to measuring child development; • introduces the Rasch model in a non-technical way; • shows how to estimate model parameters from real data; • puts forth a set of principles for model evaluation and assessment of scale quality; • analyses the relation between early D-scores and later intelligence; • and compares the D-scores from three studies that all use the same instrument.

Introduction
This introductory section outlines why we utilize the D-score: • reviewing key discussions about the first 1000 days in a child's life (1.1) • highlighting the relevance of early childhood development for later life (1.2) • discussing the use of stunting as a proxy for development (1.3) • pointing to existing instruments to quantify neurocognitive development (1.4) • explaining why we have written this chapter (1.5) • delineating the intended audience (1.6)

First 1000 days
The first 1000 days refers to the time needed for a child to grow from conception to the second birthday. It is a time of rapid change. During this period the architecture of the developing brain is very open to the influence of relationships and experiences (Shonkhoff et al., 2016). Early experiences affect the nature and quality of the brain's developing architecture by reinforcing some synapses and pruning others through lack of use. The first 1000 days shape the brain's architecture, but higher-order brain functions continue to develop into adolescence and early adulthood (Kolb et al., 2017).
The classic nature versus nurture debate contrasts the viewpoints that variation in development is primarily due to either genetic or environmental differences. The current scientific consensus is that both genetic predisposition and ecological differences influence all traits (Rutter, 2007). The environment in which a child develops (before and soon after birth) provides experiences that can modify gene activity (Caspi et al., 2010). Negative influences, such as exposure to stressful life circumstances or environmental toxins may leave a chemical signature on the genes, thereby influencing how genes work in that individual.
During the first 1000 days, infants are highly dependent on their caregivers to protect them from adversities and to help them regulate their physiology and behavior. As Figure 1.1 illustrates, caregivers can do this through responsive care, including routines for sleeping and feeding. To reach their developmental potential, children require nutrition, responsive caregiving, opportunities to explore and learn, and protection from environmental threats (Black et al., 2017). Gradually, children build self-regulatory skills that enable them to manage stress as they interact with the world around them (Johnson et al., 2013).

Relevance of child development
The first 1000 days is a time of rapid change. Early experiences affect brain development and influence lifelong learning and health (Shonkhoff et al., 2016). Healthy development is associated with future school achievement, well-being, and success in life (Bellman et al., 2013).
Professionals and parents consider it important to monitor children's development. Tracking child development enables professionals to identify children with signs of potential delay. Timely identification can help children and their parents to benefit from early intervention. In a normal population, developmental delay affects about 1-3% of children. A delay in development may indicate underlying disorders. About 1% of children have an autism spectrum disorder (Baird et al., 2006), 1-2% a mild learning disability, and 5-10% have a specific learning disability in a single domain (Horridge, 2011).
Children develop at different rates, and it is vital to distinguish those who are within the "normal" range from those who are following a more pathological course (Bellman et al., 2013). There is good evidence that early identification and early intervention improve the outcomes of children (Britto et al., 2017). Early intervention is crucial for children with developmental disabilities because barriers to healthy development early in life impede progress at each subsequent stage.
Monitoring child development provides caregivers and parents with reliable information about the child and an opportunity to intervene at an early age. Understanding the developmental health of populations of children allows organisations and policymakers to make informed decisions about programmes that support children's greatest needs (Bellman et al., 2013).

Stunting as proxy for child development
Stunting is the impaired physical growth and development that children experience from poor nutrition, repeated infection, and inadequate psychosocial stimulation. Linear growth in children is commonly expressed as length-for-age or heightfor-age in comparison to normative growth standards (Wit et al., 2017). According to the World Health Organization (WHO), children are stunted if their height-for-age is more than two standard deviations below the Child Growth Standards median. Stunting caused by chronic nutritional deprivation in early childhood is as an indicator of child development (Perkins et al., 2017).
There is consistent evidence for an association between stunting and poor child development, despite heterogeneity in the estimation of its magnitude (Miller et al., 2016;Sudfeld et al., 2015). Considering impaired linear growth as a proxy measure for child development is easy to do, and quite common. Yet, using impaired height growth as a measure for child development is not without limitations: • The relation between height and child development is weak after adjustment for age; • Height is a physical indicator that does not take into account a direct evaluation of a child's cognitive or mental performance; • There is considerable heterogeneity in heights of children all over the world; • Height is not sensitive to rapid changes in child development.

Measuring neurocognitive development
Assessment of early neurocognitive development in children is challenging for many reasons (Ellingsen, 2016). During the first years of life, developmental change occurs rapidly, and the manifestation of different skills and abilities varies considerably across children. Moreover, a child's performance on a cognitive task is very susceptible to measurement setting, timing and the health of the child that day.
Recently, a toolkit was published that reviews 147 assessment tools developed for children ages 0-8 years in low-and middleincome countries . Some of the most widely used tools include the Ages & Stages Questionnaires (ASQ), Achenbach Child Behavior Checklist (CBCL), Bayley Scales of Infant Development (BSID), Denver Developmental Screening Test (DEN), Griffiths Scales of Child Development (GRF), Mullen Scale of Early Learning (MSEL), Strengths and Difficulties Questionnaire (SDQ), Wechsler Intelligence Scale for Children (WISC), and its younger age counterpart Wechsler Preschool and Primary Scale of Intelligence (WPPSI).
Each of these tools has its strengths and limitations. For example, the ASQ and DEN are screeners for general child development. The CBCL and SDQ are screeners for behavioral and mental health, not cognition or general development. DEN is relatively easy and quick to administer, but not very precise. It is out of production, not being sold or re-normed. The BSID, MSEL, and GRF provide a clinical assessment at the individual level and requires a skilled professional to administer. Some instruments collect observations through the caregiver (ASQ), whereas others emphasize traits and behavior over performance (SDQ, CBCL). Also, the age ranges to which the instruments are sensitive vary. Furthermore, they may cover different domains of development.
The ideal child development assessment would be easy to administer and has high reliability, validity, and cross-cultural appropriateness. It should also show appropriate sensitivity in scores at different ages and ability levels. It is no surprise that no test can meet all of these criteria. Many tests are too long, difficult to administer, lack cross-cultural validity, or have low reliability. Also, many instruments are proprietary and costly to use.

Why this chapter?
We believe that there cannot be one instrument for measuring child development that is suitable for all situations. In general, the tool needs tailoring to the setting. For example, to find a delayed child, we need an instrument that is precise for that individual child, and that is sensitive to different domains of delay. In contrast, if we want to estimate the proportion of children that is developmentally on track in a region, we need one culturally unbiased, relatively imprecise low-cost measurement made on many children across many ages. The optimal instrument will look quite different in both cases.
We also believe that there can be one scale for measuring child development and that this scale is useful for many applications. Such a scale is similar to well-known measures for body height, body weight or body temperature. These measurements have a clearly defined unit (i.e., centimetre, kilogram, degree Celsius), which moreover is assumed to be constant across all scale locations. We express measurements as the number of scale units (e.g. 92 cm). Note that there may be multiple instruments for measuring a child height (e.g. ruler, laser distance meter, echolocation, ability to reach the door handle, and so on). Still, their result translates into scale units (cm here). The opposite is also true, and perhaps more familiar. We may have one instrument and express the result in multiple units (e.g. cm, inches, light-years).
Instruments and scales are different things. Currently, instruments for measuring child development define their own scales, which renders the measurements made by distinct tools incomparable. No measurement unit for child development yet exists. It would undoubtedly be an advance if we could tailor the measurement instrument to the setting while retaining the advantage of a scale with a clearly defined unit across different tools. We can then compare the data collected by distinct devices. This chapter explores the theory and practice for making that happen.

Intended audience
We aim for three broad audiences: • Professionals in the field of child growth and development; • Policymakers in international settings; • Statisticians, methodologists, and data scientists.
Professionals in child development will become familiar with a new approach to measuring child development in early childhood. We plan to separate the measurement instrument from the scale used to express the result. This formulation allows the user to select the instrument most suited for a particular setting. Since instruments differ widely in age coverage, length, administration mode, and domain coverage (Boggs et al., 2019), the ability to choose the instrument, while not giving up comparability, represents a significant advance over routines that marry the scale to the instrument.
Policymakers in international settings wish to know the effect of different interventions on child development. Gaining insight into such effects is not so easy since different studies use different instruments. The ability to place measurements made by different instruments onto the same scale will allow for a more accurate understanding of policy effects. It also enables the setting of priorities and actions that are less dependent on the way the data were collected.
Statisticians and data scientists generally prefer numeric values with an unambiguous unit (e.g., centimeters, kilograms) over a vector of dichotomous data points. This chapter shows how to convert a series of PASS/FAIL scores to a numeric value with interval scale properties. The existence of such a scale opens the way for the application of precise analytic techniques, similar to those applied to child height and body weight. The techniques have a solid psychometric backing, and also apply to other types of problems.

Short history
The measurement of child development has quite an extensive history. This section • reviews definitions of child development (2.1) • discusses concepts in the nature of child development (2.2) • shows a classic example of motor measurements (2.3) • summarizes typical questions whose answers need proper measurements (2.4)

What is child development?
In contrast to concepts like height or temperature, it is unclear what exactly constitutes child development. Shirley (1931) executed one of the first rigorous studies in the field with the explicit aim that the many aspects of development, anatomical, physical, motor, intellectual, and emotional, be studied simultaneously.
Shirley gave empirical definitions of each of these domains of development.
Certain domains advance through a fixed sequence. Figure 2.1 illustrates the various stages needed for going from a fetal posture to walking alone. The ages are indicative of when these events happen, but there is a considerable variation in timing between infants.
Gesell (1943) (p. 88) formulated the following definition of development: Development is a continuous process that proceeds stage by stage in an orderly sequence.
Gesell's definition emphasizes that development is a continuous process. The stages are useful as indicators to infer the level of maturity but are of limited interest by themselves. Liebert et al. (1974) (p. 5) emphasized that development is not a phenomenon that unfolds in isolation.
Development refers to a process in growth and capability over time, as a function of both maturation and interaction with the environment.
Cameron & Bogin (2012) (p. 11) defined an endpoint of development, as follows: " Growth" is defined as an increase in size, while "maturity" or "development" is an increase in functional ability… The endpoint of maturity is when a human is functionally able to procreate successfully … not just biological maturity but also behavioural and perhaps social maturity.
Berk (2011) (p. 30) presented a dynamic systems perspective on child development as follows: Development cannot be characterized as a single line of change, and is more like a web of fibres branching out in many directions, each representing a different skill area that may undergo both continuous and stagewise transformation.
There are many more definitions of child development. The ones described here illustrate the main points of view in the field.

Theories of child development
The field of child development is vast and spans multiple academic disciplines. This short overview, therefore, cannot do justice to the enormous richness. Readers new to the field might orient themselves by browsing through an introductory academic titles (Berk, 2011;Santrock, 2011), or by searching for the topic of interest in an encyclopedia, e.g., Salkind (2002).
The introductions by Santrock (2011) and Berk (2011) both distinguish major theories in child development according to how each answer to following three questions: Does development  evolve gradually as a continuous process or are there qualitatively distinct stages, with jumps occurring from one step to  another? Many stage-based theories of human development have been proposed over the years: social and emotional development by psycho-sexual stages introduced by Freud and furthered by Erikson (Erikson, 1963), Kohlberg's six stages of moral development (Kohlberg, 1984) and Piaget's cognitive development theory (Piaget & Inhelder, 1969). Piaget distinguishes four main periods throughout childhood. The first period, the sensorimotor period (approximately 0-2 years), is subdivided into six stages. When taken together, these six stages describe "the road to conceptual thought." Piaget's stages are qualitatively different and aim to unravel the mechanism involved in intellectual development.

Continuous or discontinuous?
On the other hand, Gesell and others emphasize development as a continuous process. Gesell (1943) (p. 88) says: A stage represents a degree or level of maturity in the cycle of development. A stage is simply a passing moment, while development, like time, keeps marching on.

One course or multiple parallel tracks?
Stage theorists assume that children progress sequentially through the same set of stages. This assumption is also explicit in the work of Gesell.
The ecological and dynamic systems theories view development as continuous, though not necessarily progressing in an orderly fashion, so there may be multiple, parallel ways to reach the same point. The developmental path taken by a given child will depend on the child's unique combination of personal and environmental circumstances, including cultural diversity in development. Figure 2.2 illustrates that children vary in appearance. Are genetic or environmental factors more important for influencing development? Most theories generally acknowledge the role of both but differ in emphasis. In practice, the debate centres on the question of how to explain individual differences.

Nature or nurture?
Maturation is the process of becoming fully developed, much like the natural unfolding of a flower. The process depends on both genetic factors (species, breed) as well as environmental influences (sunlight, water, nutrition). Some theorists emphasize that differences in child development are innate and stable over time, although there may be differences in unfolding speed due to different environments. Others argue that environmental factors drive differences in development between children, and changing these factors could very well impact child development.
Our position in this debate has practical implications. If we believe that differences are natural and stable, then it may not make much sense trying to change the environment, as the impact on development is likely to be small. On the other hand, we may consider developmental potential as evenly distributed, with its expression governed by the environment. In the latter case, improving life circumstances may have substantial pay-offs in terms of better development.

Shirley's motor data.
For illustration, we use data on locomotor development from a classic study on child development among 25 babies. Shirley (1931) collected measurements of the baby's walking ability, starting at ages around 13 weeks, in an ingenious way. The investigator lays out a white paper of twelve inches wide on the floor of the living room, and lightly greases the soles of the baby's feet with olive oil. The baby was invited to "walk" on the sheet. Of course, very young infants need substantial assistance. Footprints left were later coloured by graphite and measured. Measurements during the first year were repeated every week or bi-weekly.  For ease of plotting, the categories on the vertical axis are equally spaced. The height of the jump from one stage to the next has no sensible interpretation. We might be inclined to think that the vertical distance portrays to how difficult it is to achieve the next stage, but this is inaccurate. Instead, the ability needed to set the next step corresponds to the horizontal line length between stages. For example, on average, the line for stepping is rather short in all plots, so going from stepping to standing is relatively easy. occasions that showed a jump. Thus the data collection needs to be intense and costly to obtain individual curves. Fortunately, there are alternatives that are much more efficient.

Typical questions asked in child development
The emotional, social and physical development of the young child has a direct effect on the adult he or she will become. We may be interested in measuring child development for answering clinical, policy or public health questions.

Motivation for age-based measurement.
Milestones form the based building blocks for instruments to measure child

Individual
What is the child's gain in development since the last visit?
Individual What is the difference in development between the child and peers of the same age?

Individual
How does the child's development compare to a norm?

Group
What is the effect of this intervention on child development?

Group
What is the difference in child development between these two groups?
Population What is the change in average child development since the last measurement? Population What was the effect of implementing this policy on child development?

Population
How does this country compare to other countries in terms of child development?
development. Methods to quantify growth using separate milestones relate the milestone behaviour to the child's age. Gesell (1943) (p. 89) formulated this goal as follows: We think of behaviour in terms of age, and we think of age in terms of behaviour. For any selected age it is possible to sketch a portrait which delineates the behaviour characteristics typical of the age.
There is an extensive literature that quantifies development in terms of the ages at which the child is expected to show a specific behaviour. The oldest methods for quantifying child development calculate an age equivalent for achieving a milestone, and compare the child's age to this age equivalent. Figure 3.1 graphs the ages at which each of the 21 children enter a given stage in Shirley's motor data of Table 2.1. Since standing follows stepping, children who can stand are older than the children who are stepping. Hence the ages for standing are located more to the right.

Age equivalent and developmental age.
Since age and development are so intimately related, we can express the difficulty of a milestone as the mean age at which children achieve it. For example, Stott (1967) (p. 25) defines the age equivalent and its use for measurement, as follows: The age equivalent of a particular stage is simply the average age at which children reach that particular stage. Thus, a child that is stepping beyond the age of 16.1 weeks is considered later than average, whereas a child already stepping before 27.2 weeks earlier than average. We may also calculate age delta as the difference between the child's age and the norm age, and express it as "two weeks late" or "three weeks ahead." Summarizing age delta's over different milestones has led to concepts like developmental age as a measure of a child's development.

Limitations of age-based measurement.
Age-based measurement is easy to understand, and widely used in the popular press, but not without pitfalls: 1. Age-based measurement requires us to know the ages at which the child entered a new stage. The mean age can be a biased estimate of item difficulty if visits are widely apart, irregular or missing.
2. Age-based measurement can inform us whether a child is achieving a given milestone early of late. However, it does not tell us what behaviours are characteristic for children of a given age.
3. Age-based measurement cannot exist without an age norm. When there are no norms, we cannot quantify development.
4. Age-based measurement works only at the item level. Although we may average age delta's over milestones, the choice of milestones is arbitrary.

Probability-based measurement
An alternative is to calculate the probability of achieving a milestone at a given age and compare the child's response to that probability.
The passing probability is an interpretable and relevant measure. An operational advantage of the approach is that the  necessary calculations place fewer demands on the available data and can be done even for cross-sectional studies. Figure 3.3 plots the percentage of children achieving each of Shirley's motor stages against age. There are four cumulative curves, one for each milestone, that indicate the percentage of children that pass.

Example of probability-based measurement.
In analogy to the age equivalent introduced in Section 3.1.2 we can define the difficulty of the milestone as the age at which 50 per cent of the children pass. In the Figure  Observe there is a gradual decline in the steepness as we move from stepping to walk_alone. For example, we need an age interval of 13 weeks (33 -20) to go from 10 to 90 per cent in standing, but need 19 weeks (71 -52) to go from 10 to 90 per cent in walking alone. Thus, one step on the age axis corresponds to different increments in probability. The flattening pattern is typical for child development and represents evidence that evolution is faster at earlier ages.

Limitations of probability-based measurement.
Probability-based measurement is a popular way to create instruments for screening on developmental delay. For example, each milestone in the Denver II (Frankenburg et al., 1992) has markers for the 25th, 50th, 75th and 90th age percentile.
1. The same age step corresponds to different probabilities.
2. The measurement cannot exist without some norm population. When norms differ, we cannot compare the measurements.
3. Interpretation is at the milestone level, sometimes supplemented by procedures for counting the number of delays.
No aggregate takes all responses into account.

Motivation for score-based measurement.
Score-based measurement takes the responses on multiple milestones and counts the total number of items passed as a measure of development. This approach takes all answers into account, hence leading to a more stable result.
One may order milestones in difficulty, and skip those that are too easy, and stop administration for those that are too difficult.
In such cases, we cannot merely interpret the sum score of a measure of development. Instead, we need to correct for the subset of administered milestones. The usual working assumption is that the child would have passed all easier milestones and failed on all more difficult ones. We may repeat this procedure for different domains, e.g. motor, cognitive, and so on. Figure 3.4 is a gross-motor score calculated as the number of milestones passed. It varies from 0 to 3.

Example of score-based measurement.
The plot suggests that the difference in development between scores 0 and 1 is the same as the difference between, say, scores 2 and 3. This is not correct. For example, suppose that we express the difficulty of the milestone as an age-equivalent. From Section 3.1.2 we see that the difference between stepping and standing is 27.2 -16.1 = 11.1 weeks, whereas the difference between walking alone and walking with help is 63.3 -43.3 = 20 weeks. Thus, according to age equivalents scores 0 and 1 should be closer to each other, and ratings 2 and 3 should be drawn more apart.

Limitations of score-based measurement.
Score-based measurement is today's dominant approach, but is not without conceptual and logistical issues.
1. The total score depends not only on the actual developmental status of the child, but also on the set of milestones administered. If a milestone is skipped or added, the sum score cannot be interpreted anymore as a measure of developmental status. It might be possible to correct for starting and stopping rules under the assumptions described in Section 3.3.1, but such will be involved if intermediate milestones are missing. 2. It is not possible to compare the scores made by different instruments. Some instruments allow conversion to ageconditional scores. However, the sample used to derive such transformations pertain to that tool and does not generalise to others.
3. Domains are hard to separate. For example, some cognitive milestones tap into fine motor capabilities, and vice versa. There are different ways to define domains, so domain interpretation varies by instrument.
4. Administration of a full test may take substantial time.
The materials are often proprietary and costly.

Motivation for unit-based measurement.
Unit-based measurement starts by defining ideal properties and derives a procedure to aggregate the responses on milestones into an overall score that will meet this ideal.
Section 2.4 highlighted questions for individuals, groups and populations. There are three questions: • What is the difference in development over time for the same child, group or community?
• What is the difference in development between different children, groups or populations of the same age?
• How does child development compare to a norm?
In the ideal situation, we would like to have a continuous (latent) variable D (for development) that measures child development. The scale should allow us to quantify ability of persons, groups or populations from low to high. It should have a constant unit so that a given difference in ability refers to the same quantity across the entire scale. We find the same property in height, where a distance of 10 cm represents the same amount for molecules, people or galaxies. When are these conditions are met, we say that we measure on an interval scale.
If we succeed in creating an interval scale for child development, an enormous arsenal of techniques developed for quantitative variables opens up to measure, track and analyze child development. We may then evaluate the status of a child in terms of D points gained, create age-dependent diagrams (just like growth charts for height and weight), devise age-conditional measures for child development, and intelligent adaptive testing schemes. Promising studies on Dutch data van Buuren (2014) suggest that such benefits are well within reach. Figure 3.5 is similar to Figure 3.3, but with Age replaced by Ability. Also, modelled curves have replaced empirical ones, but this is not essential.

Example of unit-based measurement.
We estimated the ability values on the horizontal axis from the data. The values correspond to the amount of development of each visit. Likewise, we calculated the logistic curves from the data. These reflect the probability of passing each milestone at a given level of ability. The increase in ability that is needed to go from 10 to 90 per cent is about five units here. Since all curves are parallel, the interval is constant for all scale locations. Thus, the scale is an interval scale with a constant unit of measurement, the type of measurement needed for answering the basic questions identified in Section 3.4.1.

Limitations of unit-based measurement.
While unitbased measurement has many advantages, it cannot perform miracles.
1. An important assumption is that the milestones "measure the same thing," or put differently, are manifestations of a continuous latent variable that can be measured by empirical observations. Unit-based measurement won't work if there is no sensible latent scale.
2. The portrayed advantages hold only if the discrepancies between the data and the model are relatively small. Since the simplest and most powerful measurement models are strict, it is essential to obtain a good fit between the data and the model.
3. The construction of unit-based measurement requires psychometric expertise, specialized computer software and considerable sample sizes.

A unified framework
This section brings together the four approaches outlined in this section into a unified framework. More generally, measurement is the process of locating milestones and children on a line. This line represents a latent variable, a continuous construct that defines the different poles of the concept that we want to measure. A latent variable ranges from low to high.
The first part of measurement is to determine the location of the milestones on the latent variable. In many cases, the instrument maker has already done that. For example, each length marker on a ruler corresponds to a milestone for measuring length. The manufacturer of the ruler has already placed the marks at the appropriate places on the tool, and we take for granted that each marker has been calibrated correctly.
A milestone for child development is similar to a length marker, but • we may not know how much development the milestone measures, so its location on the line is unknown, or uncertain; • we may not know whether the milestone measures child development at all so that it may have no location on the line.
The second part of measurement is to find the location of each child on the line. For child height, this is easy: We place the horizontal headpiece on top of the child's head and read off the closest height marker. Since we lack a physical ruler for development, we must deduce the child's location on the line from the responses on a series of well-chosen milestones.
By definition, we cannot observe the values of a latent variable directly. However, we may be able to measure variables (milestones) that are related to the latent variable. For example, we may have scores on tasks like standing or walking with help.
The measurement model specifies the relations between the actual measurements and the latent variable. Under a given measurement model, we may estimate the locations of milestones and children on the line. Section 4.5 discusses measurement models in more detail.

Why unit-based measurement
This section distinguished four approaches to measure child development: age-based, probability-based, score-based and unit-based measurement. Table 3.1 summarizes how the approaches evaluate on nine criteria.
Age-based measurement expresses development in age equivalents, whose precise definition depends on the reference population. Age-based measurement does not support multiple milestones and does not use the concept of a latent variable.
Probability-based measurement expresses development as age percentiles for a reference population. It is useful for individual milestones but does not support multiple items or a latent variable interpretation.
Score-based measurement quantifies development by summing the number of passes. Different instruments make different selections of milestones, so the scores taken are unique to the tool. Thus comparing the measurement obtained by different devices is difficult. Skipping or adding items require corrections.
Unit-based measurement defines a unit by a theoretical model. When the data fit the model, we are able to construct instruments that produce values in a standard metric.

The D-score
Section 2 provided historical background on the nature of child development. Section 3 discussed three general quantification approaches. This section explains how to apply the unitbased approach to arrive at the D-score scale. The text illustrates the process with real data.
• Dutch Development Instrument (DDI) (4.1) • Milestone passing by age and by D-score (4.2, 4.3) • How do age and D-score relate? (4.4) • Role of the measurement model (4.5) • Item and person response functions (4.6) • Engelhard invariance criteria (4.7) • Why the Rasch model? (4.8)  The milestones form two sets, one for children aged 0-15 months, and another for children aged 15-54 months. The YHC professionals administer an age-appropriate subset of milestones at each of the scheduled visits, thus building a longitudinal developmental profile for each child.

Description of SMOCC study. The Social Medical Survey of Children Attending Child Health Clinics (SMOCC)
study is a nationally representative cohort of 2,151 children born in The Netherlands during the years 1988-1989 (Herngreen et al., 1994). The study monitored child development using observations made on the DDI during nine visits covering the first 24 months of life. The SMOCC study collected information during the first two years on 57 (out of 75) milestones.
The standard set in the DDI consists of relatively easy milestones that 90 per cent of the children can pass at the scheduled age. This set is designed to have maximal sensitivity for picking up delays in development. A distinctive feature of the SMOCC study was the inclusion of more difficult milestones beyond the standard set. The additional set originates from the next time point. The success rate on these milestones is about 50 per cent. Table 4.1 shows the 57 milestones from the DDI for ages 0 -30 months as administered in the SMOCC study. Items are sorted according to debut, the age at which the item appears in the DDI. The response to each milestone is either a PASS (1) or a FAIL (0). Children who did not pass a milestone at the debut age were re-measured on that milestone during the next visit. The process continued until the child passed the milestone.

Codebook of DDI 0-30 months.
4.2 Probability of passing a milestone given age Figure 4.1 summarizes the response obtained on each milestone as a curve against age. The percentage of pass scores increases with age for all milestones. Note that curves on the left have steeper slopes than those on the right, thus indicating that development is faster for younger children.
The domain determines the coloured (blue: gross motor, green: fine motor, red: communication). In general, domains are well mixed across age, though around some ages, e.g., at four months, multiple milestones from the same domain appear. How can the relation between per cent pass and age be so different from the relation between per cent pass and the D-score? The next section explains the reason.  • In the default orientation (age on the horizontal axis, D-score on the vertical axis), we see a curvilinear relation between the age and item difficulty.

Relation between age and the D-score
• Rotate the graph (age on the horizontal axis, passing percentage on the vertical axis). Observe that this is the   • Rotate the graph (D-score on the horizontal axis, passing percentage on the vertical axis). Observe that this pattern is the same as in Figure 4.2 (with equal slopes).
All patterns can co-exist because of the curvature in the relation between D-score and age. The curvature is never explicitly modelled or defined, but a consequence of the equal-slopes assumption in the relation between the D-score and the passing percentage of a milestone.
4.5 Measurement model for the D-score

What are measurement models?
From section 3.5 we quote: The measurement model specifies the relations between the data and the latent variable.
IRT models enable quantification of the locations of both items (milestones) and persons* on the latent variable. We reserve the term item for generic properties, and milestone for child development. In general, items are part of the measurement instrument, persons are the objects to be measured.
An IRT model has three major structural components: • Specification of the underlying latent variable(s). In this work, we restrict ourselves to models with just one latent variable. Multi-dimensional IRT models do have their uses, but they are complicated to fit and not widely used; • For a given item, a specification of the probability of success given a value on the latent variables. This specification can take many forms. Section 4.6 focuses on this in more detail; • Specification how probability models for the different items should be combined. In this work, we will restrict to models that assume local independence of the probabilities.
In that case, the probability of passing two items is equal to the product of success probabilities.

Adapt the model? Or adapt the data?
The measurement model induces a predictable pattern in the observed items. We can test this pattern against the observed data. When there is misfit between the expected and observed data, we can follow two strategies: • Make the measurement model more general; • Discard items (and sometimes persons) to make the model fit.    In this work, we opt for the -rigorous -Rasch model (Rasch (1960)) and will adapt the data to reduce discrepancies between model and data. Arguments for this choice are given later, in Section 4.8.

Item response functions
Most measurement models describe the probability of passing an item as a function of the difference between the person's ability and the item's difficulty. A person with low ability will almost inevitably fail a heavy item, whereas a highly able person will almost surely pass an easy item.
Let us now introduce a few symbols. We adopt the notation used in Wright & Masters (1982). We use β n (ability) to refer to the true (but unknown) developmental score of child n. Symbol δ i (difficulty) is the true (but unknown) difficulty of an item i, and π ni is the probability that child n passes item i. See Appendix A for a complete list.
The difference between the ability of child n and difficulty of item i is In the special case that β n = δ i , the person will have a probability of 0.5 of passing the item.

Logistic model.
A widely used method is to express differences on the latent scale in terms of logistic units (or logits) (Berkson, 1944). The reason preferring the logistic over the linear unit is that its output returns a probability value that maps to discrete events. In our case, we can describe the probability of passing an item (milestone) as a function of the difference between β n and δ i expressed in logits. Figure 4.5 shows how the percentage of children that pass the item varies in terms of the ability-difficulty gap β n -δ i . The gap can vary either by β n or δ i so that we may use the graph in two ways: • To find the probability of passing items with various difficulties for a child with ability β n . If δ i = β n then π ni = 0.5. If δ i < β n then π ni > 0.5, and if δ i > β n then π ni < 0.5. In words: If the difficulty of the item is equal to the child's ability, then the child has a 50/50 chance to pass. The child will have a higher than 50/50 chance of passing for items with lower difficulty and have a lower than 50/50 chance of passing for items with difficulties that exceed the child's ability.
• To find the probability of passing a given item δ i for children that vary in ability. If β n < δ i then π ni < 0.5, and if β n > δ i then π ni > 0.5. In words: Children with abilities lower than the item's difficulty will have lower than 50/50 chance of passing, whereas children with abilities that exceed the item's difficulty will have a higher than 50/50 chance of passing.
Formula (4.1) defines the standard logistic curve: One way to interpret the formula is as follows. The logarithm of the odds that a person with ability β n passes an item of difficulty δ i is equal to the difference β n -δ i (Wright & Masters, 1982). For example, suppose that the probability that person n passes milestone i is π ni = 0.5. In that case, the odds of passing is equal to 0.5/(1 -0.5) = 1, so log(1) = 0 and thus β n = δ i . If β n -δ i = log(2) = 0.693 person n is two times more likely to pass than to fail. Likewise, if the difference is β n -δ i = log(3) = 1.1, then person n is three more likely to pass. And so on.

Types of item response functions.
The standard logistic function is by no means the only option to map the relationship between the latent variable and the probability of passing an item. The logistic function is the dominant choice in IRT, but it is instructive to study some other mappings. The item response function maps success probability against ability. Figure 4.6 illustrates several other possibilities. Let us consider five hypothetical items, A-E. Note that the horizontal axis now refers to the ability, instead of the ability-item gap in 4.5.
• A: Item A is the logistic function discussed in Section 4.6.
• B: For item B, the probability of passing is constant at 30 per cent. This 30 per cent is not related to ability. Item B does not measure ability, only adds to the noise, and is of low quality.
• C: Item C is a step function centred at an ability level of 1, so all children with an ability below 1 logit fail and all children with ability above 1 logit pass. Item C is the ideal item for discriminating children with abilities above and below 1. The item is not sensitive to differences at other ability levels, and often not so realistic in practice.
• D: Like A, item D is a smoothly increasing logistic function, but it has an extra parameter that allows it to vary its slope (or discrimination). The extra parameter can make the curve steeper (more discriminatory) than the red curve, in the limit approaching a step curve. It can also become shallower (less discriminatory) than the red curve (as plotted here), in the limit approaching a constant curve (item B). Thus, item D generalizes items A, B or C.
• E: Item E is even more general in the sense that it need not be logistic, but a general monotonically increasing function. As plotted, the item is insensitive to abilities between -1 and 0 logits, and more sensitive to abilities between 0 to 2 logits.
These are just some examples of how the relationship between the child's ability and passing probability could look. In practice, the curves need not start at 0 per cent or end at 100 per cent. They could also be U-shaped, or have other non-monotonic forms. See Coombs (1964) for a thorough overview of such models. In practice, most models are restricted to shapes A-D.

Person response functions.
We can reverse the roles of persons and items. The person response function tells us how likely it is that a single person can pass an item, or more commonly, a set of items.  Let us continue with items A, C and D from Figure 4.6, and calculate the response function for three children, respectively with abilities β 1 = -2, β 2 = 0 and β 3 = 2.
Figure 4.7 presents the person response functions from three persons with abilities of -2, 0 and +2 logits. We calculate the functions as the average of response probabilities on items A, C and D. Thus, on average, we expect that child 1 logit will pass an easy item of difficulty -3 in about 60 per cent of the time, whereas for an intermediate item of difficulty of -1 the passing probability would be 10 per cent. For child 3, with higher ability, these probabilities are quite different: 97% and 90%. The substantial drop in the middle of the curve is due to the step function of item A.

Engelhard criteria for invariant measurement
In this work, we strive to achieve invariant measurement, a strict form of measurements that is subject to the following requirements (Engelhard Jr., 2013, 14):

Item-invariant measurement of persons:
The measurement of persons must be independent of the particular items used for the measuring.
2. Non-crossing person response functions: A more able person must always have a better chance of success on an item that a less able person.

Person-invariant calibration of test items:
The calibration of the items must be independent of the particular persons used for calibration.
4. Non-crossing item response functions: Any person must have a better chance of success on an easy item than on a more difficult item.

5.
Unidimensionality: Items and persons take on values on a single latent variable. Under this assumption, the relations between the items are fully explainable by the scores on the latent scale. In practice, the requirement implies that items should measure the same construct. (Hattie, 1985) Three families of IRT models support invariant measurement: 1. Scalogram model (Guttman, 1950) 2. Rasch model (Andrich, 1978;Rasch, 1960;Wright & Masters, 1982) 3. Mokken scaling model (Mokken, 1971;Molenaar, 1997) The Guttman and Mokken models yield an ordinal latent scale, while the Rasch model yields an interval scale (with a constant unit).

Why take the Rasch model?
• Invariant measurement: The Rasch model meets the five Engelhard criteria (c.f. Section 4.7).
• Interval scale: When it fits, the Rasch model provides an interval scale, the de-facto requirement for any numerical comparisons (c.f. Section 3.4.1).
• Parsimonious: The Rasch model has one parameter for each item and one parameter for each person. The Rash model one of the most parsimonious IRT models, and can easily be applied to thousands of items and millions of persons.
• Specific objectivity: Person and item parameters are mathematically separate entities in the Rasch model. In practice, this means that the estimated difference in ability between two persons does not depend on the difficulty of the test. Also, the estimated differences in difficulties between two items do not depend on the abilities in the calibration sample. The property is especially important in the analysis of combined data, where abilities can vary widely between sources. See Rasch (1977)  • Fits child development data: Last but not least, as we will see in Section 6, the Rasch model provides an excellent fit to child development milestones.

Computation
This section explains the basic computations needed for fitting and evaluating the Rasch model. We distinguish the following steps: • Identify nature of the problem (5.1) • Estimation of item parameters (5.2) • Anchoring (5.2.2) • Estimation of the D-score (5.3) • Estimation of age-conditional references (5.4) Readers not interested in these details may continue to model evaluation in Section 6.

Identify nature of the problem
The SMOCC dataset, introduced in Section 4.1.2, contains scores on the DDI of Dutch children aged 0-2 years made during nine visits. Table 5.1 contains data of three children, measured on nine visits between ages 0 -2 years. The DDI scores take values 0 (FAIL) and 1 (PASS). In order to save horizontal space, we truncated the column headers to the last two digits of the item names.
Since the selection of milestones depends on age, the dataset contains a large number of empty cells. Naive use of sum scores as a proxy to ability is therefore problematic. An empty cell is not a FAIL, so it is incorrect to impute those cells by zeroes.
Note that some rows contain only 1's, e.g., in row 2. Many computer programs for Rasch analysis routinely remove such perfect scores before fitting. However, unless the number of perfect scores is very small, this is not recommended because doing so can severely affect the ability distribution.
In order to effectively handle the missing data and to preserve all persons in the analysis we separate estimation of item difficulties (c.f. Section 5.2) and person abilities (c.f. Section 5.3).

Pairwise estimation of item difficulties.
There are many methods for estimating the difficulty parameters of the Rasch estimation. See Linacre (2004) for an overview.
We will use the pairwise estimation method. This method writes the probability that child n passes item i but not item j given that the child passed one of them as exp(δ i )/(exp(δ i ) + exp(δ j )).
The method optimizes the pseudo-likelihood of all item pairs over the difficulty estimates by a simple iterative procedure.
Zwinderman (1995) has shown that this procedure provides consistent estimates with similar efficiency computationally more-intensive conditional and marginal maximum likelihood methods.
The beauty of the method is that it is independent of the ability distribution, so there is no need to remove perfect scores. We use the function rasch.pairwise.itemcluster() as implemented in the sirt package (Robitzsch, 2016). • Both the zero in the logit scale, as well as its variance, depend on the sample used to calibrate the item difficulties.
Rescaling preserves the properties of the Rasch model. To make the scale independent of the specified sample, we transform the scale so that two items will always have the same value on the transformed scale. The choice of the two anchor items   is essentially arbitrary, but they should correspond to milestones that are easy to measure with small error. In the sequel, we use the two milestones to anchor the D-score scale: With the choice of Table 5.2, D-score values are approximately 0D around birth. At the age of 1 year, the score will around 50D, so during the first year of life, one D unit corresponds to approximately a one-week interval. Figure 5.2 shows the difficulty estimates in the D-score scale.

Estimation of the D-score
The second part of the estimation process is to estimate a D-score. The D-score quantifies the development of a child at a given age. Whereas the instrument developer is responsible for the estimation of item parameters, D-score estimation is more of a task for the user. To calculate the D-score, we need the following ingredients: • Child's PASS/FAIL scores on the milestones administered; • The difficulty estimates of each milestone administered; • A prior distribution, an estimate of the D-score distribution before seeing any PASS/FAIL score.
Using these inputs, we may use Bayes theorem to calculate the position of the person on the latent variable.

Role of the starting prior.
The first two inputs to the D-score will be self-evident. The third component, the prior distribution, is needed to be able to deal with perfect responses. The prior distribution summarizes our knowledge about the D-score before we see any of the child's PASS/FAIL scores. In general, we like the prior to be non-informative, so that the observed responses and item difficulties entirely determine the value of the D-score. In practice, we cannot use truly noninformative prior because that would leave the D-score for perfect responses (i.e., all PASS or all FAIL) undefined. The choice of the prior is essentially arbitrary, but we can make it in such a way that its impact on the value D-score is negligible, especially for tests where we have more than, say, four items.
Since we know that the D-score depends on age, a logical choice for the prior is to make it dependent on age. In particular, we will define the prior as a normal distribution equal to the expected mean in Figure 4.3 at the child's age, and with a standard deviation that considerably higher than in Figure 4.3. Numerical example: the mean D-score at the age of 15 months is equal to 53.6D. The standard deviation in Figure 4.3 varies between 2.6D and 3.0D, with an average of 2.9D. After some experimentation, we found that using a value of 5.0D for the prior yields a good compromise between non-informativeness and robustness of D-score estimates for perfect patterns. The resulting starting prior for a child aged 15 months is thus N(53.6,5).
The reader now probably wonders about a chicken-and-egg problem: To calculate the D-score, we need a prior, and to determine the prior we need the D-score. So how did we calculate the D-scores in Figure 4.3? The answer is that we first took at rougher prior, and calculated two temporary models in succession using the D-scores obtained after solution 1 to inform the prior before solution 2, and so on. It turned out that D-scores in Figure 4.3 hardly changed after two steps, and so there we stopped. Figure 5.3 illustrates starting distributions (priors) chosen according to the principles set above for the ages of 1, 15 and 24 months. As expected, the assumed ability of an infant aged one month is much lower than that of a child aged 15 months, which in turn is lower than the ability of a toddler aged 24 months. The green distribution for 15 months corresponds to the normal distribution N (53.6,5).

Starting prior: Numerical example.
Another choice that we need to make is the grid of points on which we calculate the prior and posterior distributions. Figure 5.3 uses a grid from -10D to +80D, with a step size of 1D. These are fixed quadrature points, and there are 91 of them. While these quadrature points are sufficient to estimate D-score for ages up to 2.5 years, it is wise to extend the range for older children with higher D-scores.

EAP algorithm.
The algorithm for estimating the D-score is known as the Expected a posteriori (EAP) method, first described by Bock & Mislevy (1982). Calculation of the D-score proceeds item by item. Suppose we have some vague and preliminary idea about the distribution of D, the starting prior (c.f. section 5.3.1), based on age. The procedure uses Bayes rule to update this prior knowledge with data from the first item (using the child's FAIL/PASS score and the estimated item difficulty) to calculate the posterior. The next step uses this posterior as prior before processing the next item, and so on. The procedure stops when the item pool is exhausted. The order in which items enter does not matter for the result. The D-score is equal to the mean of the posterior calculated after the last question.

EAP algorithm: Numerical example.
Suppose we measure two boys aged 15 months, David and Rob, by the    Table 5.3 shows the difficulty of each milestone (in the column labelled "Delta"), and the responses of David and Rob for the standard five DDI milestones for the age of 15 months.
The mean D-score for Dutch children aged 15 months is 53.6D, so the milestones are easy to pass at this age, with the most difficult is ddicmm037. David passed all milestones but has no score on the last. Rob fails on ddifmm012 and ddigmm067. How do we calculate the D-score for David and Rob? Figure 5.4 shows how the prior transforms into the posterior after we successively feed the measurements into the calculation. There are five milestones, so the calculation comprises five steps: 1. Both David and Rob pass ddifmd011. The prior (light green) is the same as in Figure 5.3. After a PASS, the posterior will be located more to the right, and will often be more peaked. Both happen here, but the change is small. The reason is that a PASS on this milestone is not very informative. For a child with a true D-score of 53 D, the probability of passing ddifmd011 is equal to 0.966. If passing is so common, there is not much information in the measurement. 2. David passes ddifmm012, but Rob does not. Observe that the prior is identical to the posterior of ddifmd011. For David, the posterior is only slightly different from the prior, for the same reason as above. For Rob, we find a considerable change to the left, both for location (from 54.3D to 47.1D) and peakedness. This one FAIL lowers Rob's score by 7.2D.
3. Milestone ddicmm037 is more difficult than the previous two milestones, so a pass on ddicmm037 does have a definite effect on the posterior for both David and Rob.
4. David's PASS on ddigmm066 does not bring any additional information, so his prior and posterior are virtually indistinguishable. For Rob, we find a slight shift to the right.
5. There is no measurement for David on ddigmm067, so the prior and posterior are equivalent. For Rob, we observe a FAIL, which shifts his posterior to the left.
We calculate the D-score as the mean of the posterior. David's D-score is equal to 55.7D. Note that the measurement error, as estimated from the variance of the posterior, is relatively large. Rob's D-score is equal to 47.7D, with a much smaller measurement error. This result is consistent with the design principles of the DDI, which is meant to detect children with developmental delay.
The example illustrates that the quality of the D-score depends on two factors, the match between the true (but unknown) D-score of the child and the difficulty of the milestone.

Technical observations on D-score estimation
• Administration of a too easy set of milestones introduces a ceiling with children that pass all milestones, but whose true D-score could extend well beyond the maximum. Depending on the goal of the measurement, this may or may not be a problem.
• The specification of the prior and posterior distributions requires a set of quadrature points. The quadrature points are taken here as the static and evenly-spaced set of integers between -10 and +80. Using other quadrature points may affect the estimate, especially if the range of the quadrature points does not cover the entire D-score range.
• The actual calculations are here done item by item. A more efficient method is to handle all responses at once. The result will be the same.

Motivation.
The last step involves estimation an ageconditional reference distribution for the D-score. This distribution can be used to construct growth charts that portray the normal variation in development. Also, the references can be used to calculate age-standardized D-scores, called DAZ, that emphasize the location of the measurement in comparison to age peers.
Estimation of reference centiles is reasonably standard. Here we follow van Buuren (2014) to fit age-conditional references of the D-score for boys and girls combined by the LMS method. The LMS method by Cole & Green (1992) assumes that the outcome has a normal distribution after a Box-Cox transformation. The reference distribution has three parameters, which model respectively the location (M), the spread (S), and the skewness (L) of the distribution. Each of the three parameters can vary smoothly with age.  The area between the -2SD and +2SD lines delineates the D-score expected if development is healthy. Note that the shape of the reference is quite similar to that of weight and height, with rapid growth occurring in the first few months. The references are purely cross-sectional and do not account for the correlation structure between ages. For prediction purposes, it is useful to extend the modelling to include velocities and change scores.

Conversion of D to DAZ, and vice versa.
Suppose that M t , S t and L t are the parameter values at age t. Cole (1988) shows that the transformation  We may derive any required centile curve from Table 5.4. First, choose Z α as the Z-score that delineates 100α per cent of the distribution, for example, Z 0.05 = -1.64. The D-score that defines the 100α centile is equal to

Evaluation
The properties cut-off Rasch model (c.f. Section 4.8) only hold when the data and model agree. It is, therefore, essential to study and remove discrepancies between model and data. This section explains several techniques that aid in the evaluation of model fit.
• Item fit (6.1) • Person fit (6.2) • Differential item functioning (6.3) • Item information (6.4) • Reliability (6.5) These topics address different aspects of the solution. In practice, we have found that item fit is the most critical concern.

Item fit
The philosophy of the Rasch model is different from conventional statistical modelling. It is not the task of the Rasch model to account for the data. Rather it is the task of the data to fit the Rasch model. We saw this distinction before in Section 4.5.2. The goal of model-fit assessment is to explore and quantify how well empirical data meet the requirements of the Rasch model. One way to gauge model-fit is to compare the observed probability of passing an item to the fitted item response curve for endorsing the item.
The fitted item response curve for each item i is modeled as:ê where ˆn β is the estimated ability of child n (the child's D-score), and where ˆi δ is the estimated difficulty of item i. This is equivalent to formula (4.1) with the parameters replaced by estimates. Section 5 described process of parameter estimation in some detail.

Well-fitting item response curves.
The study of item fit involves comparing the empirical and fitted probabilities at various levels of ability. Figure 6.1 shows the item characteristics curves of two DDI milestones. The orange line represents the empirical probability at different ability levels.
The dashed line represents the estimated item response curve according to the Rasch model. The observed and estimated curves are close together, so both items fit the model very well.

Item response curves showing severe underfit.
There are many cases where things are less bright. Figure 6.2 shows three forms of severe underfit from three artificial items. These items were simulated to have a low fit, added to the DDI, and we estimated their parameters by the methods of Section 5. For the first item, hypgmd001, the probability of passing is almost constant across ability, so retaining this item essentially only adds to the noise. Item hypgmd002 converges to an asymptote around 80 per cent and has a severe dip in the middle. The strong relation to age causes the drop. Item hypgmd003 appears to have the wrong coding. Also, we often see the spike-like behaviour in the middle when two or more different items erroneously share identical names.
Removal of items with a low fit can substantially improve overall model fit. Figure 6.3 shows two artificial items with two forms of overfitting. The curve of item hypgmd004 is much steeper than the modelled curve. Thus, just this one item is exceptionally well-suited to distinguish children with a D-score below 50D from those with a score above 50 D. Note that the item isn't sensitive anywhere else on the scale. In general, having items like these is good news, because they allow us to increase the reliability of the instrument. One should make sure, though, that FAIL and PASS scores are all measured (not imputed) values.

Item response curves showing overfit.
Multiple perfect items could hint to a violation of the local independence assumption (c.f. Section 4.5). Developmental milestones sometimes have combinations of responses that are impossible. For example, one cannot walk without being able to stand, so we will not observe the inconsistent combination (stand: FAIL, walk: PASS). This impossibility leads to more consistent responses that would be expected by chance alone. In principle, one could combine the two such items into one three-category item, which effectively set the probability of inconsistent combinations to zero.
Item hypgmd005 is also steep, but has an asymptote around 80 per cent. This tail behaviour causes discrepancies between the empirical and modeled curves around the middle of the probability scale. In general, we may remove such items if a sufficient number of alternatives is available.

Item infit and outfit.
We quantify item fit by item infit and outfit. Both are aggregates of the model residuals. The observed response x ni of person n on item i can be 0 or 1.
The standardized residual z ni is the difference between the observed response x ni and the expected response p ni , divided by the expected binomial standard deviation,   • If infit and outfit > 1.0, then the item is not fitting well. The amount of underfit is quantified by infit and outfit, as in 6.2; • If infit and outfit < 1.0, then the item fits the model better than expected (overfit). Overfitting is quantified by infit and outfit, as in 6.3.
Infit is more sensitive to disparities in the middle of the probability scale, whereas outfit is the better measure for discrepancies at probabilities close to 0 or 1. Lack of fit is generally easier to spot at the extremes. The two measures are highly correlated. Achieving good infit is more valuable than a high outfit.
Values near 1.0 are desirable. There is no cut and dried cut-off value for infit and outfit. In general, we want to remove underfitting items with infit or outfit values higher than, say, 1.3.
Overfitting items (with values lower than 1.0) are not harmful. Preserving these items may help to increase the reliability of the scale. The cut-off chosen also depends on the number of available items. When there are many items to choose from, we could use a stricter criterion, say infit and outfit < 1.0 to select only the absolute best items. Figure 6.4 displays the histogram of the 57 milestones from the DDI (c.f. Section 4.1).

Infit and outfit in the DDI.
Most infit values are within the range 0.6 -1.1, thus indicating excellent fit. The two milestones with shallow infit values are ddigmd052 and ddigmd053. These two items screen for paralysis for newborns, so the data contain hardly any fails on these milestones. The outfit statistics also indicate a good fit.

Person fit
Person fit quantifies the extent to which the responses of a given child conform to the Rasch model expectation. The Rasch model expects that a more able child has a higher probability of passing an item than a less developed child. Person fit analysis evaluates the extent to which this is true.

Person infit and outfit.
In parallel to item fit, we can calculate person infit and person outfit. Both statistics evaluate how well the responses of the persons are consistent with the model. Outlying answers that do not fit the expected pattern increase the outfit statistic. The outfit is high, for example, when the child fails easy items but passes difficult ones. The infit is the information weighted fit and is more sensitive to inlaying, on-target, unexpected responses.
Similar to item fit, person fit is also calculated from the residuals, but aggregated differently. We calculate person infit as

= ∑
A threshold for person fit > 3.0 is customary to clean out children with implausible response patterns.

Relevance of DIF for cross-cultural equivalence.
An essential assumption in the Rasch model is that a given item has the same difficulty in different subgroups of respondents. Climbing stairs is an example where this assumption is suspect. The exposure to stairs, and hence the opportunity for a child to practice, varies across different cultures. It could thus be that two children with the same ability but from different cultures have different success probabilities for climbing stairs. When these probabilities systematically vary between subgroup, we say there is Differential Item Functioning, or DIF (Holland & Wainer, 1983). DIF is undesirable since it can make the instrument culturally biased.

How to detect DIF? Zumbo (1999) provided a clear definition of DIF:
DIF occurs when examinees from different groups show differing probabilities of success on (or endorsing) the item after matching on the underlying ability that the item is intended to measure.
There are various ways to detect DIF. Here we will model the probability of endorsing an item by logistic regression using the observed item responses as the outcome. Predictors include the ability, the grouping variable, and the ability-grouping interaction. If the latter two terms explain the residual variance of the item scores after adjusting for ability, the item shows DIF for that group. DIF can be visually inspected by plotting the curves for the subgroups separately.
There are two forms of DIF: • Uniform DIF: The item response curves differ between groups in location, but are parallel; • Non-uniform DIF: The item response curve differ between groups in location, in slope and possibly in other characteristics.
These forms correspond to, respectively, the main effect of group and the ability-group interaction in the logistic regression model. Figure 6.6 shows an example comparing boys and girls. For both milestones, the item response curves are similar for boys and girls, so we see no evidence of DIF here. Figure 6.7 displays two milestones with DIF between boys and girls. Provided that the ability estimate (as estimated from all items in the model) is fair for both boys and girls, we see that milestone ddifmm019 ("Takes off shoes and socks") is easier for girls by about 0.86 logits (= the difference in ability at the intersection of 50 per cent). Conversely, milestone ddigmm064 ("Crawls forward, abdomen on the floor") is easier for boys by about 0.84 logits. These are the most substantial differences found for sex in the DDI. Both are uniform DIF.

Examples of DIF.
In practice, having milestones with opposite directions of DIF in the same instrument will cancel out one another, so one need not be overly concerned in that case. However, we should be careful when the tool consists of milestones that all have DIF in the same direction.  The DDI did not contain items for which the ability-group interaction was statistically significant, so we conclude that there is no non-uniform DIF in the DDI.

Item information at a given ability.
Items are generally sensitive to only a part of the ability scale. Item information is a psychometric measure that quantifies how illuminating the item is at different levels of ability. We may visualize item information as a curve per item.
The formula to obtain the item information is the first derivative of the item response curve and can be written as follows:ˆ( where ( ) i P δ is the conditional probability of endorsing item i, and where ˆi δ is the estimated item difficulty in the logit scale. For example for milestone ddicmm039 ("Says three words") ˆi δ equals 4.06. Figure 6.8 displays the item information curves for two milestones from the DDI. Note that the amount of information for the item is maximal around the item difficulty.
The probability of endorsing milestone ddicmm039 for a child with an ability of 2 logits is

Item information at a given age.
In practice, it is often interesting to express the item information against age. By doing so, one can identify at what ages an item provides the most information. Figure 6.9 shows that the sensitive age ranges differ considerably between items. Suppose we use 0.05 as a criterion. Then ddigmm060 is susceptible between ages 4-8 months, a period of four months. Item ddicmm039 is receptive in the period 10-19 months, a range that is about twice as broad.
The symmetric nature of the curves in Figure 6.8 is not present in Figure 6.9. In general, the relation between age and item sensitivity is more complicated than the relationship between ability and item sensitivity.
The item information by age curve helps to determine at what ages we should administer the item. The item will be most informative if delivered at the age at which 50% of the children will pass the milestone. This age corresponds to an item information is equal to 0.5 * 0.5 = 0.25. Administering the item closely around that age provide the most efficient measurement of ability. When space is at a premium (e.g. as in population surveys) using a well-chosen set of age-sensitive milestones will help in reducing the total number of milestones.
In other contexts, milestones may be used as a screening instrument to identify developmental delay. In that case, it is more efficient to administer items that are very easy for the age, e.g. milestones on which, say, 90% of the children will pass.

Reliability
The reliability is a one-number summary of the accuracy of an instrument. Statisticians define reliability as the proportion of variance attributable to the variation between children's abilities relative to the total variance. More specifically, the reliability R of a test is written as In general, high reliability is desirable. We often use reliability to decide between instruments. Cronbach's α is a widely used estimate of the lower bound of the reliability of a test. In the Rasch model, we can estimate reliability by the ratio σ is more complicated. We use the modelled person abilities and item difficulties to generate a hypothetical data set of the same size and same missing data pattern, and re-estimate the person ability from the simulated data. Then  computable as the variance of the difference between the modelled and re-estimated person ability.
The estimated variance of the modeled abilities is 2 β σ = 76.6, and the variance of the difference between modeled and re-estimated abilities is equal to 2 e σ = 1.74. The corresponding standard error of measurement (sem) is ê σ = 1.32 logits.
The estimated reliability in the SMOCC data is equal to (76.6 -1.74)/76.6 = 0.977. We may interpret this estimate in the same way as Cronbach's α, for which typically any value beyond 0.9 is classified as excellent. Note that the reliability is very high because of the large variation in D-scores. Newborns are very different from 2-year old toddlers, which helps to increase reliability. In practice, it is perhaps more useful to use a measure of accuracy that is less dependent on the variation within the sample. The sem, as explained above, seems to be a more relevant measure of precision.

Validity
Validity is a generic term that refers to the question of how well an instrument measures what it claims to measure. There are various aspects of validity. This section briefly reviews the main types of validity: • Internal validity (7.1) • External validity (7.2)

Content validity.
Content validity is the extent to which the D-score represents all facets of development. In contrast to "face validity," which assesses whether the test appears valid to respondents, content validity is about what is measured.
One important form of content validity is that we wish to make sure that the measurement scale represents the various developmental domains in a fair way. In the simplest case, we can assign each milestone uniquely to one domain and evaluate coverage by splitting the cumulative item information.

Construct validity.
Construct validity is the extent to which the D-score behaves like the theory says the construct should behave. For example, we expect that child development advances with age. Figure 4.3 provides convincing evidence that the D-score increases fastest in the first six months and keeps rising at a slower rate as children age. This phenomenon is consistent with theories in growth and child development.
In Section 4, we assumed that child development is a latent variable. Figure 7.2 provides one way to evaluate the validity of this assumption. The figure plots the item fit for each milestone coloured by domain. Items from different domains fit equally well, so there is no evidence that the D-score favours a particular area. Put in more technical terms; the DDI domains do not explain differences in the item fit residuals of the model.

Discriminatory validity.
Discriminatory validity indicates the extent to which the D-score can distinguish children with non-normal development from children that are developing normally. We may evaluate this by identifying children with lagging development, for example, indicated by reflex or tonus problems, and study whether the D-score can discriminate those children from the general population. Section 9.3 presents some examples.

Convergent and divergent validity.
Convergent validity is the extent to which the D-score relates to similar constructs. We measure it by the correlation between the D-score and the total score on Bayley-III or Denver.
The correlation with the other construct should be 0.6, or higher for good convergent validity. Unfortunately, at present, only limited data is available for the DDI, so we cannot assess convergent validity for the D-score at this point.
Divergent validity is the extent to the D-score is uncorrelated with measures of a different construct. Figure 7.3 shows both convergent and divergent validity at work. The figure shows that, as expected, there is a strong and almost linear relation between body height and the D-score. However, after correction for the child's age, the relationship between height and D-score almost disappears. Thus, growth and development are entirely different concepts.
We can also evaluate the strength of the relations between the D-score and proxy measures of child development, such as stunted height growth (see section 1.3). The low correlation between DAZ and HAZ suggests that stunting is a poor proxy for child development.

Predictive validity.
Predictive validity refers to the degree to which the D-score predicts the score on a criterion that is measured later. For the D-score, we may compare to measures for IQ at the school-age as a possible criterion.
Vlasblom et al. (2019) found strong evidence that individual milestones of the DDI measured during the first years of life predict later intellectual functioning at ages 5-10 years. It is to be expected that the D-score, which builds upon these individual items, will also predict limited intellectual functioning, perhaps even better.

Precision
This section shows the properties of the D-score when calculated from short tests. The study of quick tests is useful because it reveals the behaviour of the D-score when the measurement is inherently imprecise.
This section covers: • Structure of milestone subsets (8.1) • Impact of short tests on D-score (8.2)  • Impact of short tests on predicting IQ (8.3)

SMOCC design: Standard and additional milestones
At each visit, the SMOCC study collected scores on a set of standard milestones (that about 90 per cent of the children will pass) and a set of additional milestones (that about 50 per cent of the children will pass). See Section 4.1.2.
The SMOCC dataset covers nine different waves. The set of milestones used in the DDI varies per visit. The number of standard milestones varies between 2 and 7 on various occasions. The additional milestones equal the standard ones from the next wave.   ddifmd002, and the five additional ones are ddicmm031 -ddigmd057. And so on.
8.2 D-score from short tests

Milestone sets.
In the analyses done thus far, we have calculated D-scores from responses on the combined (standard plus additional) milestones. Thus, at the 2-month visit, the D-score was calculated from 2 (standard) + 5 (additional) = 7 milestones.
In daily practice, the set of additional milestones is often lacking. This section explores the impact of using the (smaller) subset of standard milestones on measurement error and prediction.
This section reports and compares three D-scores: 1. D-score from standard milestones; 2. D-score from additional milestones.
3. D-score from all available milestones; Estimation of 1 is more complicated than for 2 and 3, for the following reasons: • There are fewer milestones, so the estimate is less precise and more influenced by choice of the prior distribution; • The standard set contains only easy milestones, which are uninformative for the majority of children. Figure 8.2 shows the D-score, separately calculated from the standard, additional and all milestones for children aged two months. The colour of the dots represents the number of FAIL ratings within each set of milestones.

Milestone sets at month 2. The vertical axis of
At month two there are just two standard milestones: ddicmm030 and ddifmd002. About 90 per cent of the infants will pass these. The green dots in the left-hand side figure represent the estimated D-scores corresponding to two passes. As explained in Section 5.3.2, we calculate the D-score with an age-dependent prior. If the ages vary (and they do), then the D-score for infants having the same total score will also vary.
If a child fails either ddicmm030 or ddifmd002, then the D-score is substantially lower. The left-hand figure shows a gap between the green dots (perfect score) and the yellow dots (one FAIL). The impact of a FAIL on the D-score is substantial. For example, the D-score of an infant with one FAIL on a standard milestone drops from about 20D to 14D. Thus, with these two milestones, there cannot be a D-score in the range 15D -18D. It depends on the purposes of the measurement if this is acceptable. We can prevent gaps by measuring more milestones, e.g., milestones taken from the additional set. Another gap occurs between 14D and 11D. These gaps illustrate that precision is constrained if we administer only two milestones.
The middle figure shows the estimated D-score at the same visit but now calculated from the five additional milestones (i.e., the standard milestones from month 3). Infant aged two months have approximately a 50 per cent chance of passing each.
Note that administration of the additional milestones will cover the range 14D-20D quite well. Note the ceiling is also higher with these milestones.
Note that the range of the estimated D-scores is quite similar in both plots. This similarity is a result of accounting for the difficulty level of milestones. The estimate of the D-score is unbiased for difficulty.
The figure on the right-hand side provides the D-score calculated from all milestones. We can easily recognise the points coming from the standard and additional sets. Also, there is a limited number of ratings on easier items that belong to month 1. We rescored these because the child failed these milestones at the previous visit. Rescoring effectively extends the range of possible D-scores to the lower end, so now we can find some children who have D-score lower than 10D. Figure 8.3 is the same plot as before, but now for month 3. Compared to Figure 8.2, all points shifted upwards because the children are now one month older.

Milestone sets at month 3.
The additional milestones from month 2 are the standard milestones of month 3. In Figure 8.2, there were at least 11 children (in purple) failed all five additional milestones. One month later, one child has five fails. Figure 8.4 plot the D-score distribution for all occasions. Some observations:

Floor and ceiling effects.
• Ceiling effect: The ceiling effect (green) is most prominent in the standard set, but is also present in the other two sets. None of the three sets can filter out children with really advanced development. To achieve more precision at the upper end, we would need to include more difficult milestones.
• Floor effect: There are almost no floor effects in the standard and all sets. These sets discriminate well among children with delayed development, which was the designed purpose of the DDI. Note that floor effects are visible in the additional set.
• Average level: All three sets capture the overall relation between age and development. The additional set is quite efficient for measuring average levels development but lacks detail on the extremes. Figure 8.4 shows that a short test (5-6 milestones) can precisely measure the lower tail of the D-score distribution (standard set) or the middle of the D-score distribution (additional set), but cannot do both at the same time.
8.3 Impact of short tests on predicting IQ

Measurement and prediction.
In Section 8.2, we saw that a short test can measure the middle or one tail of the distribution, but cannot be precise for both at the same time.   If we want to identify children at risk for delayed development, we are interested in the lower tail of the distribution, so in that case, the standard set is suitable. But what set should we use if we want to predict a later outcome?
This section explores that effect of taking different milestone sets on the quality of prediction. Hafkamp-de Groen et al. (2009) studied the effect of the D-score on later intelligence, using a subset of 557 SMOCC children that were followed up at the age of five years.

UKKI.
The Utrechtse Korte Kleuter Intelligentietest (UKKI) (Baarda, 1978) is a short test to measure intelligence. The UKKI is a simple test with just three components: • Redraw five figures (square, triangle, cross, trapezoid, rhomboid); • Draw human figure, with 28 characteristics, like legs, eyes, and so on; • Give meaning to 13 words like knife, banana, umbrella, and so on.
Administration time is about 15-20 minutes. The UKKI has a reasonable test-retest reliability for group use (Pearson r = 0.74, 3-month interval). Figure 8.5 shows the empirical IQ distribution of 557 children. The mean IQ score is 108, and the standard deviation is 15, so the IQ-scores of children in the sample is about a half standard deviation above the 1978 reference sample. Figure 8.6 shows that the relation between the D-score 0-2 years and IQ at five years is positive for all milestone sets and all ages. The strength of the association increases with age. At the age of 2 years, the regression coefficient for D-score is equal to β (D) = 1.4 (SE: 0.21, p < 0.0001), so on average an increase of 1.0 unit in the D-score at the age of 2 years corresponds to a 1.4 IQ-score points increase at the age five years. Table 8.2 summarizes the Pearson correlations between the D-score and later IQ. The association between D-score and IQ is weak during the first year of life but gets stronger during the second year. In general, having more (and more informative) milestones helps to increase the correlation, but the effects are relatively small. So even from the standard set of the seven easy milestones at 24m, we obtain a reasonable correlation of 0.245.

Exploratory analysis.
All in all, these results suggest that neither the amount nor the difficulty level of the milestones is critical in determining the strength of the relation between the D-score and IQ.

Three studies
This section compares child development between samples from three different studies: • SMOCC, a representative sample of Dutch children ( Each study used the same measurement instrument, the DDI (see Section 4.1). The section compares D-scores between studies.
9.1 SMOCC study Figure 9.1 shows the D-score distribution by age in the SMOCC data. The grey curves represent references calculated from the SMOCC data. The top figure illustrates that rise of the D-score with age, whereas the bottom chart shows that the DAZ distribution covers the references well.
The ceiling effect causes low coverage after the age of 24 months. There are also less prominent ceiling effects for younger children. Without these effects, the references would presumably show some additional variation.
9.2 POPS study Figure 9.2 presents the D-score and DAZ distributions for the POPS cohort of children born very preterm or with very low birth weight. The distributions of the D-score and DAZ are similar to those found in the SMOCC study.
Since the D-scores are calculated using the same milestones and difficulty estimates as used in the SMOCC data, the D-scores are comparable across the two studies. When the milestones differ between studies (e.g. when studies use different measurement instruments), it is still possible to calculate D-scores. This problem is a little more complicated, so we treat it in Chapter II (van Buuren & Eekhout, 2021).
The primary new complication here is the question whether it is fair to compare postnatal age of children born at term with postnatal ages of very preterm children. This section focuses on this issue in some detail.

POPS design.
In 1983, the Project On Preterm and Small for Gestational Age Infants (POPS study) collected data on all 1338 infants in the Netherlands who had very preterm birth (gestational age < 32 weeks) or very low birth weight (birth weight < 1500 grams). See Verloove -Vanhorick et al. (1986) for details.
The POPS study determined gestational age from the best obstetric estimate, including the last menstrual period, results of pregnancy testing, and ultrasonography findings. The POPS study collected measurements on 450 children using the DDI at four visits at corrected postnatal ages of 3, 6, 12 and 24 months.

Age-adjustment.
Assessment of very preterm children at the same chronological age as term children may cause  over-diagnosis of developmental delay in very preterm children. Very preterm children may require additional time that allows for development equivalent to that of children born a term.
In anthropometry, it is common to correct chronological age of very preterm born children to enable age-appropriate evaluation of growth. For example, suppose the child is born as a gestational age of 30 weeks, which is ten weeks early. A full correction would deduct ten weeks from the child's postnatal age, and a half correction would deduct five weeks. In particular, we calculate the corrected age (in days) as: where 280 is the average gestational age in days, and where we specify several alternatives for f as 1.00 (full correction), 0.75, 0.50 (half) or 0.00 (no correction).
Let's apply the same idea to child development. Using corrected age instead of postnatal age has two consequences: • It will affect the prior distribution for calculating the D-score; • It will affect DAZ calculation.
We evaluate these two effects in turn. Figure 9.3 plots the fully age-adjusted D-score against the unadjusted D-score. Any discrepancies result only from differences in the ages used in the age-dependent prior (c.f. Section 5.3.2).

Effect of age-adjustment on the D-score.
All points are on or below the diagonal. Age-adjustment lowers the D-score because a preterm is "made younger" by subtracting the missed pregnancy duration, and hence the prior distribution starts at the lower point. For example, the group of red marks with D-scores between 30D and 40D (age not corrected) will have D-scores between 20D and 30D when fully corrected. Note that only the red points (with perfect scores) are affected, thus illustrating that the prior has its most significant effect on the perfect response pattern. See also Section 5.3.1.
The impact of age-correction on the D-score is negligible when the child fails on one or more milestones. Figure 9.4 illustrates that a considerable number of D-scores fall below the minus -2 SD line of the reference when age is not adjusted, especially during the first year of life. The pattern suggests that the apparent slowness in development is primarily the result of being born early, and does not necessarily reflect delayed development.

Effect of full age adjustment (f = 0.00) on the DAZ.
Full age correction has a notable effect on the DAZ. Figure 9.5 illustrates that the POPS children are now somewhat advanced over the reference children. We ascribe this seemingly odd finding to more prolonged exposure to sound and vision in air. Thus after age correction, development in preterms during early infancy is advanced compared to just-born babies.
Full age correction seems to overcorrect the D-score, so it is natural to try intermediate values for f between 0 and 1. The value of 0.73 is implausibly high, especially because this value is close to birth. Setting f = 0.75 seems a good compromise, in the sense that the average DAZ is close to zero in the first age interval. The average DAZ is negative at later ages. We do not know whether this genuinely reflects less than optimal development of very preterm and low birth weight children, so either f = 1.00 and f = 0.75 are suitable candidates.

Conclusions.
• Compared with the general population, more very preterm children reached developmental milestones within chronological age five months when chronological age was fully corrected; • Fewer preterm children reached the milestones when chronological age was not corrected; • Fewer children reached the milestones when we used a correction of f = 0.50; • Similar proportions were observed when we used f = 0.75 within the first five months after birth.
• After chronological age five months, we observed similar proportions for very preterm and full-term children when chronological age was fully corrected.
• We recommend using full age correction (f = 1.00). This advice corresponds to current practice for growth and development. As we have shown, preterms may look better in the first few months under full age-correction. If the focus of the scientific study is on the first few months, we recommend an age correction of f = 0.75.    9.3 TOGO study Figure 9.6 presents the D-score and DAZ distributions of a sample of children living near Kpalimé, Togo. While the primary trend with age conforms to the previous data, the distributions differ from those in Figure 9.1 and Figure 9.2 in two respects: • Compression at the upper end: Most of the D-scores are above the median curve, which suggests that, at these ages, children living in Togo develop faster than children living in the Netherlands; • Expansion at the lower end: There is a considerable variation in D-scores on the lower end, with many D-scores below the -2 SD curve, suggesting that some children are significantly more delayed than would be expected in both Dutch samples.
The D-scores are calculated using the same 57 milestones and difficulty estimates as before. The resulting D-score distribution is quite unusual. The main question here is what could explain the pattern found in the D-scores. This section explores this question in some detail.

Togo Kpalimé study, design.
If the D-score is to be a universal measure, then it should be informative in low and middle-income countries (LMIC) as well. We do not yet know much about the usability and validity of the D-score in LMIC's. The western African country of Togo qualifies as a low-income country, with a 2017 GNI per capita of USD 610, compared to USD 46,180 in the Netherlands, and USD 744 for low-income countries in general (data. worldbank.org).
The data were collected by Cécile Schat-Savy, who initiated a youth health care centre modelled after the Dutch youth health care system in Kpalimé, Togo. See https://www.kinderhulp-togo.nl for more background. Data monitoring included a french translation the DDI for measuring child development. The investigators gathered data from 9747 individuals in the 0-18 age range.
Participants include children and their parents who visited the Kpalimé health centre at least one time. Kpalimé is the fourth largest town in Togo, but the health centre also attracted parents and children from a wide surrounding rural area. Parents visited the health centre for several reasons, including  for a preventive health check or because of their child's apparent health problems.
The health centre targeted parents through information sessions for parents at primary schools. Parents paid a small amount of money per child (about USD 4.00 for children of 4 years or older, and USD 0.80 for children younger than four years). Four local data-assistants, some portrayed in Figure 9.7, digitized the data from paper archives. TNO Child Health in The Netherlands monitored the process and checked the data for completeness and consistency.
Here we use a subset of 2674 visits from 1644 unique children who scored on the 57 milestones of the DDI 0-2 years. We did not calculate D-scores when age or DDI milestones were missing, which left a dataset of 2425 visits from unique 1567 children. The number of visits varied from 1 -9. The majority of children visited the centre once. Figure 9.8 is the same scatter plot as in Figure 9.6, but now marked by whether the physician registered signs of neuropathology in the form of tonus and reflex problems.

D-score labelled by neurological problem.
Many children with low D-scores also have tonus or reflex problems. This finding alone suggests that extreme D-score are not artefacts (e.g. caused by a wrongly coded age), but indicate main adverse health conditions. Figure 9.9 identifies the children who had an Apgar score at 10 minutes after birth that was lower than 8. About half of these children had a D-score below -2 SD curve.

D-score labelled by severe underweight.
Many children who visited the Kpalimé health centre had a low body weight for their age. Figure 9.10 marks the subset of severely underweight children (WAZ < -4). A substantial proportion of these children also had a very low D-score.
9.3.5 D-score labelled by severe stunting. Figure 9.11 is similar to 9.10, but now marked by the subset of severely stunted children (HAZ < -4). Also here, a sizable proportion has a low D-score.
When taken together, Figure 9.8- Figure 9.11 show that children with very low D-scores often experience (multiple) harsh health problems. Those health problems may have substantially delayed their development. Figure 9.12 shows substantial differences in gross motor development between children from Togo and the Netherlands. For example, at the age of three months, about 30 per cent of the Dutch infants succeed in controlling their head when pulled to sitting. However, infants from Togo seem already capable of head control when they are just one month old.

Gross motor development.
Moreover, the advantage persists at least until up to the age of two years: children in Togo can roll over and sit much earlier, or kick a ball without falling. As the documentary Babies shows, African children even manage to learn to walk with a tin can on their head, a craft that children in the west never achieve. Figure 9.13 shows a less pronounced but similar phenomenon for fine motor skills. These data suggest that children in Togo may have better fine motor skills than the children from the two Dutch cohorts. Figure 9.14 summarizes the data for three milestones on communication and language. In general, the success probability is similar in the three studies.

Communication and language.
One curious finding is that the high proportion of milestones passes in ddicmm041 for the Togo children around the age of 18 months. Note that several of the green lines in Figure 9.12- Figure 9.14 start close to perfect scores, which makes it impossible to show the rising patterns found in the Dutch data.
It may indeed be true that children from Togo develop more rapidly than Dutch children. But we may also wonder: Could there just be reporting bias on the part of the parents who either do not understand the items or have the expectation to say "yes" even if the child can't do it? It would be desirable if these results could be backed up from other sources.

Conclusions
This section compared the D-scores estimated from the DDI administered to three different groups of children.
We found that • The D-score by age plot showed a positive, curved relationship with age in all three studies; • Children born very preterm or with very low birth weight had similar development to reference children when their age was corrected for early birth; • A relatively small subset of children born in Togo had extremely low D-scores, not found in the Netherlands, likely the result of underlying neuropathology, severe underweight or severe stunting; • On average, children from Togo seemed to have faster development during the first two years, especially in motor development, though there may be issues with reporting bias.
All in all, these findings support the usefulness and validity of the D-score as an informative summary of child development during their first two years of life.

Next steps
This section provides a quick overview of the relevance, concepts and techniques of the D-score. While the results obtained thus far are encouraging, some questions will certainly remain when we put the method to practice.
A rough selection of such questions includes: • What is the added value of the D-score in practice?
• Does the D-score extend to higher ages?
• Is the assumption of uni-dimensionality reasonable for other ages and populations?
• Can we calculate the D-score from instruments other than the DDI?
• Is it reasonable to assume that milestone difficulty is identical in other populations?
• Does the method apply to caregiver-reported milestones?
• Would a dedicated D-score instrument be more efficient?
• How many milestones are "enough?" • Can the same scale be used for measurement at individual, group and population levels?
• Can the D-score detect delayed development?
• Would the D-score help to target early interventions?
This section briefly reviews some of these issues.

Usefulness of D-score for monitoring child health
The D-score is a new approach to measure child development. The D-score is a scale for quantifying generic child development by a single number. Milestones are selected to fit the Rasch model. We can interpret the resulting measurements as scores on an interval scale, a requirement for answering questions like: • What is the difference in development over time for the same child, group or population?
• What is the difference in development between different children, groups or populations of the same age?
• How does child development compare to a norm?
The concept of the D-score is broader than a score calculated from the DDI. Any instrument that fits the model underlying the D-score can be used to measure the child's D-score.
The precision of the measurement depends on the number of milestones and the match between milestone difficulty and person ability. We may thus tailor the measurement instrument to the question at hand.

D-chart, a growth chart for child development
The field of child growth and development roughly divides into two areas: • The subfield child growth (or auxology) emphasizes body measures like height, weight, body mass index, and so on. It is a rigorous quantitative science with intimate ties to statistics since the days of Quetelet and Galton.
• The subfield child development is more recent and builds upon a wide-ranging set of domain-specific instruments for measuring motor, language, cognitive and behavioural states.
The growth chart is a widely used tool to monitor physical growth. The D-score can be used in a similar way to create the D-chart. Figure 10.1 shows the developmental paths of five randomly chosen children from the SMOCC study. Although the milestones differ across age, there is only one vertical axis. These trajectories will help to track the progress of a child over time.
The D-chart shows that it is straightforward to apply quantitative techniques from child growth to child development. Our hope is that D-score aids in bridging the disparate subfields of child growth and child development. The lack of a universal measure for child development has long hampered the ability to estimate intervention effects or to compare populations. The D-score can be generalized to other instruments. We expect that the availability of a common yardstick will stimulate informed policy and priority setting. We hope a universal measure improves decision making, ultimately lowering the number of children not reaching their developmental potential.

D-score for international settings
Section 9 compared D-scores between three study samples. We restricted the analysis to studies that used the same instrument (the DDI, in Togo, translated to French) to measure child development.
It is difficult to compare levels of child development worldwide. Existing estimates on children not reaching their developmental potential rely on proxies, such as stunting and poverty. While these proxies have been found to correlate with child development, they are only weak indicators of actual child performance. Arguably, the performance of a child on a set of well-chosen milestones is more informative for his or her future health and productivity than body height or parental income.
There are more than 150 instruments are available that quantify child development. Many of these tools produce not just one but many scores. Such an overwhelming choice may seem a luxury until we realize that we cannot compare their ratings. Of course, we could settle on just one instrument …., but that's never going to happen. While simple in theory, pre-harmonization is complicated in practice. It requires significant and continued investments by a central agency. It does not address historical data, so it will be challenging to see secular trends. Also, pre-harmonization impedes the adoption of innovative techniques, e.g., using smartphone-assisted evaluations.
The D-score opens up an exciting alternative: agree on the scale, and leave some liberty to the data-collector in the exact     choice of the instrument. We could build upon the expertise of the data collector about the local population. Also, it will equip is to keep up with innovations in measurement.
The next chapter in our work will address some of the conceptual and technical issues that arise when we attempt to apply the D-score to other populations.

D-score from existing instruments
There is a vast base of historic child developmental data using existing instruments. The problem is that each device defines its own summaries, so we cannot compare scores across tools. Different instruments have different domains, various age forms, different stopping rules, diverse age norms, and so on. Yet, the milestones in these instruments are often very similar. Most tools collect data on milestones like: • Can the child stack two blocks?
• Can the child roll over?
• Can the child draw a cross?
• Can the child stand?
• Can the child say "baba?" With the D-score methodology in hand, we are ready to exploit the overlap in milestones shared by different instruments. Common items can act as bridges, so -with the appropriate item-level data -we may attempt calculating D-scores from other tools as well.
The task is to identify milestones that overlap between both instruments, filter out milestones that do not fit a joint model, and estimate the item difficulties of items that remain. Chapter II (van Buuren & Eekhout, 2021) will explore this possibility in more detail.

Creating new instruments for D-score
Extending the D-score to other instruments has the sideeffect of enlarging the item bank with useful items. As more and more data feed into the item bank, assessment of already present milestones may become more precise.
The enlarged and improved item bank then may act as the fundamental resource for creating instruments for particular settings. For example, if the interest is on finding the most advanced children, we may construct a difficult test that will separate the good and the best. Alternatively, we can use the item bank to create and administer computerized adaptive tests (Jacobusse & van Buuren, 2007;Wainer et al., 2000), a sequential method that selects the next milestone based on the previous test outcome.
Our ongoing work will explore the conceptual and technical challenges, and propose an integrated approach to support instrument construction and validation.

A -Notation
The notation in this chapter follows Wright & Masters (1982).

Section Symbol Term Description
4.6 β n Ability True (but unknown) developmental score of child n 4.6 δ i Difficulty True (but unknown) difficulty of item i 4.6 π ni Probability True (but unknown) probability that child n passes item i 6.1ˆn β Ability Estimated developmental score (D-score) of child n

Underlying data
The raw data needed to replicate these analyses are not public, so we cannot share it with this publication. However, the reader can apply for access to the data through the study contact. The table given below contains the contact information for each cohort included in this publication.
but users are free to redistribute, alter and combine the data, on the condition of giving appropriate credit with any redistributions of the material. The URL of the public data is https://d-score.org/childdevdata/.

1963.
Engelhard" could be classified as "good introductory books" (I would disagree with such a statement), these transport many of the myths of the RM compared to more complex IRT models. The myths surround the concepts of "sample-free measurement", "specific objectivity" and "invariant measurement" (Engelhard). After appropriate identification constraints of a latent variable, there is no difference concerning these concepts for the RM and the 2PL model. The 2PL model also provides sample-free measurements as the RM because the invariance property can hold for any IRT model. The invariance property is based on the local independence assumption and the absence of differential item functioning (DIF) with respect to age. Moreover, specific objectivity is usually equated with the concept of parameter separability for persons and items, which can also be achieved for the 2PL model (Irtel, Ballou). I do not ask authors to adopt the D-score to the 2PL or other alternative IRT models, but to remove their flawed statements about the "unique properties" of the RM.
Authors should be more careful with the statement that cross-cultural DIF implies biased measurement. There is rich literature arguing that the concepts of DIF and bias (or fairness) should not be equated (Camilli), and a DIF item is important for assessing differences between countries (or cultures). Hence, just removing DIF items is a misguided recommendation.

5.
In the same vein, removing items from the scale that does not fit the RM might threaten validity. This should be emphasized in the section "validity" which is silent about this issue. I would rather see the RM as a device for linking items using equal weights for items. Typically, the RM is misspecified; but why bother about it? 6.
The authors propose the pairwise estimation approach (PAIR) of Zwinderman. If I read the chapter correctly, items are (partly) administered to children depending on their (expected) level of ability. This means that items are missing at random (MAR). PAIR estimation does not work for (general) MAR data but provides consistent estimates under MCAR. It might be that a particular missingness design does not lead to biased estimates in the PAIR method, but authors should elaborate on why preferring this estimation approach over likelihoodbased estimation approaches.

7.
See above. I do not think that the information function is a valuable concept for assessing the uncertainty of the D-score. First, local dependence seems to be ignored. Second, it presupposes a correctly specified measurement model, which is certainly violated. Resampling item (groups) is a much better alternative for quantifying standard errors. 8.