1.6. Data Requirements¶
Micro-simulation is typically associated with high data requirements. This view stems from the predominant use of micro-simulation for modeling highly complex systems, e.g., the operations of social insurance systems in the context of social and demographic change. In contrast to such models, which must depict individual life-courses in great detail and include educational choices, employment, earnings, family dynamics, savings, health, and retirement decisions, the data requirements for population projection models are very modest and, for most countries, the necessary data are readily available.
DYNAMIS-POP requires two types of data:
- Population projection data as used in macro-projection models (and available online for most countries). This includes age-specific fertility patterns, projected total fertility rates, a standard life table, and projected life expectancies by period and sex.
- Four micro-data files for parameter estimations and the creation of the starting population. These files can typically be created from population censuses and a household survey like UNICEF’s Multiple Indicators Cluster Surveys (MICS) or USAID/ICF Macro’s Demographic and Health Surveys (DHS).
A file of current residents, typically compiled from a population census dataset. The production of partially or fully synthetic data as input to the model is an option worth considering. The approach (for which a specialized R package – simPop – and which we document is a separate report) offers the advantage of producing anonymous datasets, and of offering a solution to integrate data from multiple sources and to address quality issues in existing data.
M_ID Person ID (0,1,..)
M_HHID Household ID (0,1,..)
M_WEIGHT Sample weight (123.456)
M_AGE Age (in years, 16.789)
M_MALE Sex (female 0, male 1)
M_DOB District of birth (0..m, m = abroad)
M_DOR District of residence (0..n)
M_PDIST District 12 months ago (0..m, m = abroad)
M_EDUC Primary education (0 non, 1 incomplete, 2 completed)
M_PARITY Parity (0, 1..)
M_BIR12 Number of births in the past 12 months (0, 1, 2)
M_AGEMAR Age at first marriage (in years, 16.789, 999 never married)
M_AGEBIR Age at most recent birth (in years, 16.789, 999 childless)
M_ROB Region of birth (0..b, b = abroad)
M_ROR Region of residence (0..a)
M_PREG Region 12 months ago (0..b, b = abroad)
M_ETHNO Ethnicity (0..y)
A file of recent emigrants (people who emigrated in the past 12 months. As a proxy, this file is typically compiled from census information from household members living abroad)
M_WEIGHT Sample weight (123.456)
M_PDIST District of residence 12 months ago (0..n)
M_PREG Region of residence 12 months ago (0..x)
M_AGE Age (in years 18.901)
M_MALE Sex (0 female, 1 male)
A file of all child history records - births, deaths, vaccination - reported by women. This information is available in MICS as well as in DHS surveys:
M_WEIGHT Record Weight
M_INTERV Date of interview (months since 1900)
M_REGION Region (0,1..)
M_BIRTH Date of birth (months since 1900)
M_DEATH Date of death (months since 1900)
M_MALE Male (0/1)
M_AGEMO Mothers age at birth of child (months)
M_EDUCMO Primary education of mother: (0 non / 1 some / 2 graduate)
M_ETHNO Ethnicity (0,1..)
M_VACC Child is vaccinated (0/1 one year old only; 999 others)
M_PCARE Mother received prenatal care (0/1 one year old only; 999 others)
A file of women recording all birth events. This information is available in MICS as well as in DHS surveys:
M_B01 Month of 1st birth (number of months since 1900; 9999 for non)
...
M_B14 Month of 14th birth (number of months since 1900)
M_WEIGHT Sample weight (123.456)
M_BIRTH Birth (number of months since 1900)
M_EDUC Primary education (0 none, 1 incomplete, 2 completed)
M_REG Region of residence (0..n)
M_INTERV Month of interview (number of months since 1900)
M_MAR Month of first marriage (number of months since 1900; 9999 never married)
Population projection data can be directly copied into the according model parameter tables or be produced by provided analysis R scripts based on csv files.
Notes on variable construction:
- The terms “region” and “district” must be understood as “geographic area at first level” and “geographic area at second level”. This could be Region and District in some countries, State and Provinces in others, etc.
- The codes of regions and districts must be consistent over time (e.g., the codes for district of birth or previous residence must be fully compatible with the codes used for the current residence). If the administrative divisions of the country have changed over time, this must be addressed in the phase of data preparation,
- M_WEIGHT represents the sample weight or “weighting coefficient”. The value will be 1 for all observations in case where the data file is an exhaustive census. The sample weights can be calculated and calibrated based on published population tables. All weights must be strictly positive (but do not have to be integers).
- The M_AGE, M_AGEMAR, M_AGEBIR variables correspond to exact age in years. In most survey and census datasets, the information will be provided as age in completed years. In such case, to obtain an “exact” age, a random value comprised between 0 and 1 should be added to the completed age value. A good option is to add a 3-decimal value.
- All variables describing a “Time of …” or “Month of…” represent the number of months between January 1, 1900 and the of occurrence of the event.
- The default file format of the data files is comma-separated value (.CSV) text file. If data are provided in a different format (e.g. Stata .dat or R Rdat) the country-specific setup-script for data analysis has to be adapted.
Figure: Population projection parameters