1.10. Statistical Measures, Concepts, and Methods Common in Micro-simulation

1.10.1. Starting a Simulation and Keeping it Going: Population, States, Events

Dynamic micro-simulation creates a realistic depiction of a population and its changes over time by simulating a large sample of individuals, i.e., actors, who have the same distribution of individual characteristics as the real population. To start a simulation, a starting population file of micro-data records representing persons by a set of characteristics, i.e., states, and optional links to other actors, such as family or household members, is usually read in. During the simulation, as actors age, some of their characteristics and links change, and eventually they die. New actors are created and added at birth or after immigration, while others may leave the population due to emigration. Classic population projection models concentrate on three demographic processes: fertility, migration, and mortality. The model developed in this report extends this list of variables and processes e.g. by first marriage, transmission of ethnicity, school attendance, and educational attainments.

Processes correspond with events that can happen at a single moment of time and change individual states:

  • Birth events increase the parity of the mother. At this event also the date is recorded for the modeling of following births, and a new actor—the baby is created, inheriting some characteristics of the mother, like the province of residence. A permanent link between child and mother is created, allowing the child to access her mother’s characteristics. In our model, child mortality depends on mother’s age and education at birth, and the child’s future educational attainments are influenced by mother’s education.
  • Migration events change the province of residence of individuals. At immigration, new actors are created and their characteristics have to be initialized.
  • At death, the actor and links to other actors are removed.
  • At first union formation, an indicator is set and the time of the event is recorded. In more detailed models, partners are searched in the population and couples are linked.
  • We model two primary education events: entering and graduating primary school.

To realistically model events, we need measures of their occurrence in reality. These can be measures collected for a current period, such as a death register recording deaths by basic characteristics like age and sex. In our case, the most important data source is a population census, which collects both the current characteristics of individuals and households and information on some important recent events like the number of births given in the past 12 months, origin and timing of the most recent migration event, and the date at first marriage. A third data source is surveys, which typically are more detailed, but based only on a sample of the population. In our case, valuable survey information is birth histories of women used for the modeling of the timing of births (birth intervals). In addition, we use history data on children’s survival—respectively, the date of their deaths reported by mothers—as the base for child mortality modeling.

1.10.2. Measures of the likelihood of events: probabilities and rates

The two most important measures for the likelihood of events calculated from data are probabilities and rates.

  • Probability describes the likelihood that an event will occur for a single individual in a given time period and ranges from 0 to 1. It is calculated as number of events that occurred in a time period divided by the number of people followed for that time period. Note that such a measure is possible only if the event can happen only once per individual and time unit. If repeated events are possible, probabilities can still be calculated if at least one event was experienced, or probability distributions can be calculated if 0, 1, 2, or more events were experienced.
  • Rate is an instantaneous measure. While also based on a count of events, the measure includes time—specifically “time at risk”— in the denominator. As an example, if the probability of 50-year-old people dying within a year after their 50th birthday is 1 percent, this means that one person out of 100 alive on her 50th birthday will not experience her 51st birthday. The 1 percent is calculated by dividing the number of deaths by the original number of people. In contrast to probability, for rate, we divide the number of events by the time people were at risk. Those who died during the year were not at a greater risk of death after the date of death. If from a population of 100 people one person died, depending on when the death occurred, the calculated rate would be between 0.01 (1/100 if the death occurred at the end of the period) and 0.0101 (1/99 if the death occurred at the beginning of the period). Rates and probabilities are very close for unlikely events and therefore often confused. In contrast, if 90 percent of the 100 people die in an accident, depending on the time of the accident, the measured rate would be between 0.9 (90/100) and 9 (90/10). This huge span in measures will of course disappear if we have a big sample, if events are spread over time, and if we can assume a constant risk over this time unit. If risks change over age or time, we typically measure them for single years of age or calendar years, which are usually small enough intervals for the assumption to approximately hold.

Under the assumption of a constant risk over time span t, rates are mathematically linked to corresponding probabilities of survival, i.e., the probability that the event does not happen to a person over this time t:

S = exp(-r*t)

If an event can happen only once, its probability of occurrence is 1-S. If we know the probability of survival, under the assumption of a constant risk over time t, we can calculate the corresponding rate as:

r = -1/t * ln(S)

The likelihood an event will occur typically depends on specific circumstances. For example, mortality changes with age, but may also be impacted by other factors. For instance, when modeling child mortality, we also account for mother’s characteristics. Accordingly, measures of likelihood have to be available in a simulation for each combination of influencing factors. This can be achieved in two ways. First, if there are only few combinations and the sample size is large enough, we can calculate measures for each of all possible combinations of circumstances. The second approach is to find a formula that, when based on circumstances, can calculate the likelihood of an event. A very popular class of such formulas are proportional models. In the case of rates, proportional hazard models divide risk into baseline risk (e.g., mortality by age) and relative risk, assuming that specific circumstances (e.g., health status) modify this baseline proportionally over all ages. One of the convenient characteristics of rates is that they can be directly multiplied by relative risk factors. For example, a 100 percent increase in risk can be directly expressed as a multiplication of the initial risk by the factor 2. This does not work for probabilities. For example, when doubling a probability, the resulting number has no interpretation of a proportional probability; in fact, the calculation may even lead to a value above 1. To overcome this problem and allow estimation of proportional models that can be used to derive probabilities, transformations are used. A typical model of this class is logistic regression, which instead of estimating probabilities directly, estimates the log odds of an event. Odds are a transformation of probabilities typically used in gambling: a 75 percent chance to win can be expressed as a 75:25 chance, respectively, with the odds measuring 3.

odds = p / ( 1 - p )

Like rates, odds can be multiplied by proportional factors, which can be estimated in regression models.

1.10.3. From Measures to a Simulation

Corresponding to the two basic measures for the likelihood of events—probabilities and rates—we can distinguish two types of dynamic simulation models: discrete time and continuous time.

Models based on probabilities are discrete time models, i.e., models in which states are updated in fixed time steps, such as each year. In the simulation, if an event that happens to an individual is a random process based on the known probability, we draw a random number between 0 and 1 and decide that an event has happened if the random number is smaller than the probability. Probabilities contain no information on when an event happens within a time period; the same is the case for discrete time models that just model the before and after. Therfore, if more than one state changes in such models, there is no information available as to which event happened first. Also, we do not know the history of a person within this period, so a person now living in the capital who lived there a year ago may have spent most of the year somewhere else and moved several times. This is not a big problem for rare events for which multiple occurrences within a single period are very unlikely. Also, sometimes data do not allow more accurate descriptions of processes, as a survey could collect just the information now and a year ago. A comparable situation is the use of panel data, which is linked cross-sectional data that collect states as snapshots at various points in time.

Models based on rates are continuous time competing risks models in which events can happen at any moment in time. Instead of moving time in single steps of fixed length, such as years, and updating all states at once, time advances with whatever event happens first. At the start of the simulation, the rates for each event are calculated and this information is used to calculate waiting times for all possible events. Assuming a constant risk over time, the waiting time to an event follows an exponential distribution and a random waiting time based on a given rate can be easily calculated using a simple formula.

random waiting time = -ln(RandomNumber) / r

The expected waiting time of an exponentially distributed random variable is just the inverse of its rate:

expected waiting time = 1 / r
median waiting time = -ln(0.5) / r = 0.69 / r

Events compete, meaning that the event scheduled to happen first is the one that will move time to the time point at which it happens. The occurrence of an event typically impacts other events. For example, for women in reproductive age entering a first marriage, fertility risks typically increase. At the same time, the woman is not at risk for entering a first marriage anymore, but becomes at risk for divorce (if modeled). Accordingly, at the occurrence of each event, a new list of possible events is created and all waiting times affected by the event are updated.

The assumption of a risk being constant typically only holds for specific periods or “pieces” of time, so we talk about piece-wise constant risks. For example, even if nothing else happens, a waiting time has to be updated if it exceeds the piece of time for which the underlying rate is valid. For example, mortality risks stemming from life tables are typically valid for a single year of age, while the waiting time to death has to be updated at each birthday.

The micro-simulation model developed in this report follows a continuous time approach and most events can happen at any point in time. Note that the approach does not limit one to using only models based on rates. For example, we model school entry and graduations once at a fixed point in time each calendar year and the decisions to enter school respectively to graduate are based on probability parameters.

1.10.4. Estimating and calibrating probabilities and rates

In the simplest case, rates or probabilities used in a micro-simulation model can be obtained directly from published sources. In our case, age-specific mortality and fertility rates in the base versions of the model stem from rates provided by the ONS and are identical to rates used in macro population projections. More typically, micro-simulation requires more detailed measures, accounting for their variation by individual characteristics.

When selecting statistical methods for deriving rates and probabilities used in the simulation, we mostly make use of proportional models. Such models are very useful in micro-simulation; model builders, can combine information from different data sources and model users can support easy-to-interpret scenario creation. Proportional models divide the likelihood of events to a baseline factor that applies to all, and to relative factors based on specific individual characteristics. Accordingly, scenarios can be created in which the likelihood changes for all individuals, e.g., a mortality trend lowering overall mortality over time, or only for specific groups. Empirically, relative differences in risks between population groups are often very persistent over time and it can be a reasonable assumption that these differences persist also in the future, at least in the absence of specific policy interventions.

In contrast, for policy analysis, convergence scenarios where we assume that a gap between population groups like ethnicity can be closed over time are very helpful, as they can be used to study the downstream effects of such a change. But if there are no changes, proportional models are useful as status quo scenarios where we assume no change both in the baseline and the relative factors. While such scenarios assume stability in behaviors on the individual level, aggregate outcomes will still change over time if the composition of the population changes. An important strength of micro-simulation models is this ability to distinguish composition change from behavioral change. This modeling approach can be best explained by means of the following five examples.

Example: calibrating published baseline rates for fitting aggregated trends

For mortality modeling, we use published mortality rates stemming from a standard life table by age and sex. In the simulation, we use these rates as baseline hazards by age. At the same time, we support scenarios for increases in life expectancy over time. To do so, we introduce a second parameter: life expectancy by calendar year. To reach this target life expectancy, the mortality baseline is proportionally scaled by a factor that, when multiplied to all age-specific rates, modifies the life table so that the target life expectancy is reached. In our case, this factor is found by numeric simulation automatically performed in the model; statistically, it is a relative risk that decreases over time, as life expectancy increases.

Example: baseline hazard and relative risks estimated simultaneously

When modeling child mortality, we estimate the relative risks of birth by mother’s education and mother’s age, together with the baseline hazards by age of the child from retrospective birth and death history data using a piece-wise constant hazard regression model. For estimating such a model, individual records are split into single records for each time span, in which all covariates are the same: in our case we get one record for each year a child is alive until age five. In contrast to this time-changing covariate (age), other covariates (mother’s age group at birth, mother’s education level) stay the same in each of the created records. The only other two pieces of information needed to estimate the model are the time at risk in each time span (in our case, a year if the child survives, respectively the time at this age until the date of the survey, or the time until death if the child dies in a given year) and an indicator if the event (i.e., mortality) happened. The results of the regression are baseline hazards—death rates by age—and relative risks by mother’s education level and age group. Note that, by accounting for the time at risk, hazard regression models have no problem with “right censuring,” i.e., the fact that observations end at the moment they are collected, which is not necessarily at the end of time intervals of the model (like age in our example). Also, the exit of persons from the sample for other reasons than the modeled event (e.g., by emigration) can be easily handled by this approach. Mathematically, a piece-wise constant hazard model can be written as:

rji = aj * exp(b´i X),

    rij      hazard rate for an individual i in time inteval j
    aj       baseline hazard for the time interval j
    b'i      (transposed) vector of individual characteristics
    X       vector of relative risks

Example: combining base line hazards and relative risks stemming from different sources

One of the strengths of the proportional modeling approach is the fact that baseline hazards and relative risks can stem from different data sources; this allows combining information from robust data sets for the baseline (e.g., administrative data, census) with far more detailed survey data for estimating relative risks. We used this approach as an option in modeling child mortality, combining the overall mortality rates from the base model, but combining those with the relative risks estimated for hild mortality risks. Combining information from two sources requires calibration very similar to that in the first example above: for a given population composition by the characteristics used for relative risks, the application of relative risks to the calibrated baseline risk must result in the target baseline rate. (Again, this step is automatically performed in the micro-simulation model). Cox regression is a statistical model widely used in literature for estimating relative risks.

All three of these examples were examples of rates and use the convenient the feature that they can be directly multiplied. Doubling a rate means doubling the risk, which reduces by half the expected waiting time to the event. As noted above, probabilities cannot be linearly transformed as rates; also, as probabilities are limited to an interval 0 - 1, they are typically not estimated directly in regression models, as a mechanism has to be applied to limit regression results to the allowed interval. The most widely used transformation for estimating the probability of events are logits (log odds) as used in logistic regression:

logit(p) = ln( p / (1-p)) = a + bX

As a result of the logistic regression, we get the coefficients a—the baseline—and a vector b of proportional factors from which we can calculate the probability of an event for a given vector of individual characteristics X.

p = exp(a+bX) / ( 1 + exp(a+bX) )

The estimated coefficients are interpreted as log odds, or, more intuitively, odds ratios when calculating their exponential. For example, applying an odds ratio of 2 means, that the odds of an event double from 1:x to 2:x. If chances were initially 50:50, they become 100:50, i.e., 2:1, or, expressed in probabilities, increase from 50 percent to 66 percent.

Example: Logistic regression for modeling provincial differences and time trends

In our micro-simulation model, we used logistic regression models for deriving probabilities to enter and graduate from primary school, as one example. These models contain a baseline, a time trend, and proportional factors for provincial differences. Analysis has shown that provincial differences, expressed in log odds, are very persistent over time, the proportionally of the model therefore being a reasonable assumption. Accordingly, the base scenario assumes that these differences also persist in the future. Users can easily run alternative scenarios by, for example, assuming convergence of probabilities to the best or, as presented in this report, assuming universal education is introduced immediately or phased-in over the next 10 years for all provinces.

Example: Adding additional odds-ratios to a model

Like the case for continuous time models based on hazard regression, discrete time models based on probabilities also allow adding additional relative factors that might stem from different data sources. For example, in the models for primary school enrollment and graduations, we provide an option to add odds ratios by mother’s education, allowing the study of the inter-generational transmission of education. If this option is chosen, additional factors are added in a way that, by calibrating the base odds at the beginning of the simulation or a specific year selected by the user, the overall probability of entering school is not changed. From this year onward, we then freeze the new calibrated base odds. As a result, all projected changes on the aggregate level can be attributed to the changing educational composition of mothers.

1.10.5. Probability Distributions and Origin-destination Matrices

Probability distributions are used for models of more than two possible outcomes. An example is the distribution of destination provinces of immigrants. If the destination distribution depends on the origin, distributions can be tabled in origin-destination matrices. This approach is used for modeling internal migration, further divided by age group and sex. Like single probabilities, probability distributions and origin-destination matrices can be either directly tabulated from micro data or modeled; the choice depends on sample size and desired number of co-variates. Multinational extensions exist for the logistic regression approach presented above; alternatively, multiple outcomes can be modeled by decision trees applying binomial models.

1.10.6. Micro-simulation Implementation

Once the required probabilities or rates are obtained and all regression coefficients are estimated, this information can be used to parameterize the micro-simulation model. Parameter tables can directly contain rates or probabilities by any number of dimensions (e.g., age, sex, region) or contain the regression coefficients, leaving the calculation of the used hazards or probabilities to the model. Model parameters are used in various ways:

  • Some parameters can be used directly. For example, in the base version of education, whether to admit people of school entry age into school is directly based on probabilities by calendar year, sex, and province contained in a three-dimensional parameter table. While the probabilities stem from a logistic regression of log odds, regression results were transformed outside the micro-simulation model into probabilities, making the parameter table more intuitive.
  • Some parameters require further calculations performed within the micro-simulation whenever they are needed. This is the case for parameter tables containing regression coefficients. For given individual characteristics, the regression formula is evaluated using the parameters. For example, child mortality is implemented using two parameters, an age baseline and relative risks. Based on this information, individual risks can be calculated by multiplication and updated at each birthday.
  • Some parameters are created within the model based on other parameters before the start of the simulation. For example, a “model-generated” parameter table of mortality hazards by age, sex, and calendar year is created before the simulation by scaling the values from a standard life table parameter to reach a target life expectancy for each year that stems from another parameter.
  • Some parameters are used indirectly as alignment targets. They are not used to schedule individual events but to set target values in the simulation. For example, in one model we use the age-specific fertility rates of the model’s base version to determine the target number of births matching existing macro projections, but use a more detailed model, including parity, education, and time since last birth, to distribute births to the most likely mothers based on relative differences in birth risks beyond age alone. In this case, the individual waiting times based on the refined model are used for ranking potential mothers.

During simulation, probability and rates are used to create individual events or decisions. Processes based on probabilities or rates are stochastic processes, meaning that they contain a random element that determines if or when an event happens to an individual. When simulating a large enough sample of individuals, the proportion of people experiencing an event of a given probability will come close to the overall probability, and the simulated number of events will come close to the expected value based on given rates. For individuals, transitions are binary.

In the case of rates, waiting times are calculated and put into an event queue, and the first event to happen censures other events in the queue and causes an update of all other processes affected, as described previously. As for probabilities, the individual waiting times are stochastic, thus dependent on a random number in combination with the rate. If the sample is large enough, the average waiting time in the simulation will come close to its expected value, while individual waiting times are exponentially distributed.

In the case of probabilities, the model has to create regular (e.g., yearly) events when a process or processes are updated. For example, school entry can be scheduled on a specific day in a year. On this day, all eligible persons decide if they enter school or not based on given probabilities. Technically, this is done by drawing a uniform distributed random number between 0 and 1 and comparing the result with the probability: the event happens if the random number is below the probability. Also probability distributions and origin-destination matrices can be used directly in combination with a random number to determine the outcome of a choice, like the destination province of a migrant.