Andy Turner's MoSeS 2001 UK Demographic Initialisation Web Page @ School of Geography, University of Leeds

Introduction
- A web page about generating individual and household level population data for census output regions in the United Kingdom (UK) for 2001-04-29 - the 2001 UK Population Census census enumeration date.
- The data was produced by running bespoke Java programs primarily developed by Andy Turner.
- This was UK demographic modelling work for MoSeS, a second phase research node of the ESRC funded National Centre for e-Social Science.
- MPJ Express was used to parallelise the Java programs to take advantage of multi-processor computer architectures
  - This allowed results to be generated within a more reasonable time frame on suitable available computer resources
    - The University of Leeds, School of Geography Beowulf (http://en.wikipedia.org/wiki/Beowulf_(computing))
    - The UK National Grid Service core site grid computers
- The original concept, outlined in the MoSeS Proposal, was to select Samples of Census 2001 Anonymised Records (SARs) to represent aggregate Census Area Statistics (CAS) populations
  - The result being an individual and household level population dataset for each CAS area in the UK.
  - This builds on the work of Williamson et al (1998) who applied this idea to integrate similar UK population census data for 1991
    - For further details see (Williamson (2003)) via the following URL:
      - http://pcwww.liv.ac.uk/~william/microdata/
  - Extending the work of Williamson et al (1998), this work focussed on developing a better type of Genetic Algorithm (GA) to use as the combinatorial optimisation search heuristic
    - A well designed GA can in principal search an entire solutions space
      - It should be capable of producing as well fitting results as any other combinatorial optimisation search heuristic
        
        Including a well developed modified simulated annealing method as prefered by Williamson et al (1998)
        
        Depending on the nature of the data, there is an element of chance as to whether it will before testing every possible solution
        
        If the solution space is smooth with only one peak of well fitting solutions, then optimisation is relatively trivial
        
        With a very large and diverse combinatorial search space with individual spikes of well fitting solutions then finding the optimal result is relatively difficult
        
        This is in part the nature of a pseudo random stochastic search heuristic especially those that focus the search around (in nearby parameter space of) previously tested solutions
- In summary, the basic idea is to select (well fitting) sets of records from the SARs to populate Census areas (for which there are published aggregate statistics)
  - Aggregate statistics from the selected SARs are compared with published aggregate CAS data to evaluate the fitness of a solution.
- Age related control constraints are used to ensure the results match some detailed (aggregated) age profile of CAS.
- The resulting data was intended to be used as input to dynamic UK demographic models which stepped with an annual temporal resolution for the years 2001 to 2031
  - The results of the dynamic simulation in turn were to be inputs into applications, some of which focussed on support of dependent populations (the young, old and infirm).
- Contents:
  - Outline of Work
    - A brief summary of what was done
  - Genetic Algorithm
    - A general description
  - Data
    - An introduction to the SARs and the CAS
  - Permutations
    - Reasons the need for a heuristic to search for a good solution as the number of potential solutions is so large that a brute force search which assess all these is too computationally expensive
  - ISARHP_ISARCEP
    - Provides details of how the Individual SAR (ISAR) and CAS were integrated to produce an individual and household level population data set for the UK at Output Area (OA) level
  - HSARHP_ISARCEP
    - Provides details of how the ISAR, Household SAR (HSAR) and CAS were integrated to produce an individual and household level population data set for the UK at OA level
  - Results
    - Presentation of results showing how the two different results compare to the CAS for control constrained, optimisation constrained and unconstrained variables
  - Summary/Discussion/Conclusion
    - A summary outlining what could be done to enhance this work, and an honest reflection about if this would be good in terms of the resources it would require
  - References
  - Acknowledgements
  - References
Outline of Work
- MoSeS started in July 2005 and this part of the work (2001 UK Demographic Initialisation) went through two major iterations during the initial three year funding period.
  - The first major iteration used the Individual Licensed SAR (ISAR).
  - The second major iteration additionally used the Special Licensed Household SAR (HSAR).
  - These data are describe in more detail in the Data Section
- In each major iteration there were many development iterations
  - At each iteration lessons were learned and information was gleaned to refine the process and understand computational requirements in terms of processing and data storage (time and memory).
  - New ways of presenting results were explored and attempts were made to automate the reporting of results.
- The first outputs for Leeds were produced in December 2005
  - These were based exclusively on the 2001 ISAR and CAS at Output Area (OA) level
    - For details of the 2001 ISAR data (Office for National Statistics (2006a)) see the Cathie Marsh Centre for Census and Survey Research (CCSR) 2001 Census Individual Licensed SAR Web Pages via the following URL
      - http://www.ccsr.ac.uk/sars/2001/indiv/
    - Details of the CAS data are available via
      - The UK Office for National Statistics (ONS) 2001 Census Area Statistics Tables Web Pages via the following URL
        
        http://www.statistics.gov.uk/census2001/cas_table_outlines.asp
      - The MIMAS Census Dissemination Unit (CDU) Casweb interface for downloading the data made available to the UK academic community via the following URLs
        
        http://cdu.mimas.ac.uk/2001/
        
        http://casweb.mimas.ac.uk/
- The code/process was parallelised and a first UK output was produced in January 2006
  - As with the results produced for Leeds, these results used a Household Formation Routine (HFR) developed by Belinda Wu (Birkin et al (2006b))
    - In the HFR, each household was formed from the individual Household Reference Persons (HRPs)
    - The HFR attempted to assign all other individuals to these households by matching HRP variables with other non-HRP records using the following ISAR variables
      - ISARDataRecord.get_AGEO()
        
        For details on the ISAR AGE0 variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/age0/
      - ISARDataRecord.get_MARSTAT()
        
        For details on the ISAR MARSTAT variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/marstat/
      - ISARDataRecord.get_RELTOHR()
        
        For details on the ISAR RELTOHR variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/reltohr/
      - ISARDataRecord.get_SEX()
        
        For details on the ISAR SEX variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/sex/
      - ISARDataRecord.get_FAMTYP()
        
        For details on the ISAR FAMTYP variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/famtyp/
      - ISARDataRecord.get_FNDEPCH()
        
        For details on the ISAR FNDEPCH variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/fndepch/
      - ISARDataRecord.get_HNELDERS()
        
        For details on the ISAR HNELDERS variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/hnelders/
      - ISARDataRecord.get_HNRESDNT()
        
        For details on the ISAR HNRESDNT variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/hnresdnt/
    - The HFR was untested and based on assumptions about the nature of households
      - Other ISAR variables may have been used to enhance it
      - It was a first attempt and believed to be useful
- The results were analysed and judged
  - Consultants reasoned that the results were not good enough to support the intended applications
    - This was subjective and no other benchmark results were available for comparison, but some way to reduce the difference between variables aggregated from the modelled inidividual level populations and those in the aggregate census data was wanted
    - Was the procedure outlined in the MoSeS Proposal (Birkin et al (2004)) inappropriate given the nature of the data, or was the implementation inadequate?
- Progress was presented at the Second International Conference on e-Social Science in July 2006 (Birkin et al (2006a))
  - Therein the programs that produce individual and household level population data by integrating Census Outputs are refered to as Population Reconstruction Models (PRMs)
- An attempt was made to improve the results by
  - Explicitly distinguishing between Household Population (HP) and Communal Establishment Population (CEP) and handling the optimisation of these populations for each region seperately
  - Removing a constraint to Sample With No Replacement (SWNR)
    - To SWNR meant that for each region, multiple duplicate records were not allowed
      - The constraint was originally wanted by demographic expert consultants for reasons which were never clear
        
        Implementing SWNR was technically difficult
      - This reduced the number of possible solutions especially for regions with relatively homogenous populations
  - Producing and analysing geographical maps of the differences to try to better understand the results and consider ways to improve them
  - Producing results at the Middle-level Super Output Area (MSOA) level
    - This was mainly an attempt to speed up the results generation process, but it was aso possible it would lead to better fitting results at this scale
    - Yet, although there are fewer MSOA regions than Output Area (OA) regions, the increase in number of possible permutations for larger populations is more significant
      - Despite considerable effort, results were generated more slowly for a single MSOA than would be generated for all the OAs in the MSOA individually
    - The loss of spatial detail in the results was undesirable
    - Errors were comparable
      - At the MSOA level the aggregated fit from OA generated results (aggregated to MSOA level) was of the same order as those for results generated at MSOA level.
      - Despite an expected reduction in the effect of Small Cell Adjustment Measures (SCAM)
        
        SCAM is a disclosure protection measure which introduces error into CAS data with the aim of preserving confidentiality
        
        Details about SCAM and other disclosure protection measures for Census 2001 outputs is available via the following URL
        
        http://www.statistics.gov.uk/census2001/discloseprotect.asp
      - The main benefit in doing this work was that it highlighted the logical flaw in enforcing SWNR
        
        For some MSOA regions there were no control constrining results that could be generated with this constraint
      - Additionally, it added flexibility to the software being developed which became capable of to producing output reports at the MSOA level
  - Removing optimisation constraints which measured how well the Household Formation Routine (HFR) had assigned individuals to households
    - They remained largely untested and were based on simplifying assumptions about household composition that were potentially too general
      - The main assumption was that each household was comprised of a single family, but it was realised that other types of household were also common in certain areas and this was considered to be a likely reason for the poor fit of aggregated ISAR statistics and CAS data for some Output Areas
- New results for the Leeds and the UK at OA level were generated and awaited analysis
  - These results did not indicate the households to which individuals in the household population were assigned
- At this stage the 2001 Census Special Licence Household SAR (Office for National Statistics (2006b)) were released and it was decided to use develop a new procedure to use these for generating household populations
  - And so started the second major iteration of development
- Some details of the HSAR are provided in the Data Section
- There were a number of reasons for using the HSAR
  - As the data had only recently become available, they were of considerable interesting to the demographic experts on the team
  - It was desirable to have a household level grouping
    - A Household Formation Routine (HFR) had been used to group ISARs in the household population into household groups, but this removed the need for such a routine which remained largely untested and was based on simplifying assumptions about household composition that were potentially too general
- There were a number of issues with the HSAR
  1. It contained roughly 3 times fewer individual records (525725 records compared with the ISAR 1843530 records).
    - Although the sample of about 200 thousand households (about 1%) were selected to be broadly representative, it is likely that with so many variables, there will be household profiles that are not represented.
  2. HSAR data for England and Wales were readily available, although via a specially arranged license, but HSAR data for Scotland or Northern Ireland were not readily available and were kept locked in a secure facility
    - It was not feasibly to move the entire modelling effort to work within the secure facility
    - The following two approaches were considered:
      1. Assume that, in general, households in Scotland and Northern Ireland are represented by those in the HSAR data for England and Wales.
      2. Reduce the scope of what was being done and only produce data output for England and Wales rather than the whole UK.
  3. There are differences in the variables in the HSAR and ISAR.
    - One of the most complicating differences concerns the age variable (AGEH in the HSAR and AGE0 in the ISAR)
      - AGEH coded age in 2 year bands up to age 80 (all higher ages were grouped together with a value of 80)
        
        http://www.ccsr.ac.uk/sars/2001/hhold/variables/ageh/
      - AGE0 coded single years of age up to 16 and from 75 to 95, with 5 group in between (all higher ages were grouped together with a value of 95)
        
        http://www.ccsr.ac.uk/sars/2001/indiv/variables/age0/
- Results for the UK at OA level were generated
- Work started to compare result from the newly develop GA methods with results produced from a simpler sampling re-weighting method
  - The first step involved producing results for a small subset of optimisation constraining variables for both methods
    - This was something of a benchmark test of the GA to see if it could work as well as the simpler method in simple cases
      - Result showed that the GA performed as well as the sampling re-weighting method for small numbers of constrining variables
  - One advantage of the GA method was that it was able to incorporate many variables into its optimisation function
    - What other methods could do this and produce comparable results?
Genetic Algorithm
- The following is a summary of Genetic Algorithm (GA) optimisation
  - Step 1. Generate (control constraining) initial results
  - Step 2. Breed the (control constraining) results to produce more (control constraining) results
  - Step 3. Measure the goodness of fit of the results (using optimisation constraints)
  - Step 4. Select/remove results based on the goodness of fit measure
  - Step 5. Repeat Steps 2 to 4 until convergence or a maximum number of iterations is reached
  - Step 6. Store the best fitting results or result
- The summary of GA optimisation presented above is very general
- In terms of producing the individual and household level population results for the UK there are lots of details to consider
  - Granularity of the task
    - It is possible to attempt the optimisation for any region from the entire UK down to Output Area (OA) level
    - A single optimisation for the entire UK presents the greatest number of permutations of potential results
      - It also results in no detail in terms of the spatial resolution of results
    - Optimisation was considered at the following levels:
      - Output Area (OA) level
      - Ward level
      - Middle Super Output Area (MSOA) level
      - Local Authority District (LAD) level
    - The highest spatial resolution results would be at OA level
      - There are 223060 OA in the UK
        
        223060 is a considerable number of optimisations
        
        Supposing optimisation for each OA takes 5 minutes
        
        This amounts to
        
        Over 2.1 years of processing if the optimisations are done sequentially (one after the other)
        
        Only 5 minutes of processing if all the results are generated in parallel simulataneously
        
        This is possible as each optimisation can be generated independently
        
        In parallel processing lingo, the problem is "embarassingly parallel"
  - Control constraints
    - A Control Constraint (CC) is a term used to represent a criteria for a result that has to be correct else the result is invalid
      - Others sometimes called these a "hard constraint"
    - This could be something seemingly simple, e.g. the number of people in the population has to be correct
      - But just because it is simple to describe, does not mean it is simple to impose to produce valid results
        
        Consider that in one type of optimisation we are selecting household records which may represent several individuals
    - It might also be relatively detailed, e.g. the age and gender/sex structure of the results has be correct
    - Indeed any variable might be converted into a CC, and a complex set of CCs can be specified
      - With more CCs any results that are found should be better results, but the more complex and detailed the CCs are that are imposed, the fewer the number of potential results there are which makes optimisation more difficult
  - Optimisation constraints
    - These are used to measure the goodness of fit of a result.
    - An Optimisation Constraint (OC) is a term used to represent a variable which is input into a goodness of fit measure in the GA
      - Others sometimes called these a "soft constraint"
    - A set of OCs are chosen to try to get the result optimal for specific uses
      - Greater weight might be applied to different types of constraint
        
        For example consider the following in terms of health related variables
        
        An OC might be the difference in the number of people with limiting long term illness in the sample set and the number of people with limiting long term illness in Census Aggregate Statistics.
        
        Another OC might be the difference in the number of people with good health in the sample set and the number of people with good health in Census Aggregate Statistics.
        
        These OCs might be combined and each may be given a weighting factor and they might be further combined with other OCs.
    - One advantage of a genetic algorithm approach is that many OCs can be used and that these can be changed to re-optimise results to help produce results for a specific use
      - Population data can be re-optimised to be better suited for specific purposes
        
        For some application it might be better to focus on getting better fitting results in terms of health and old age, employment and middle age, or educational attainment and school age population variabeles
  - Breeding
    - If Communal Establishment Population (CEP) is control constrained independently of Household Population (HP) there are effectively two genes in the genetic algorithm
      - One type of breeding that might be possible (depending on control constraints) is via a crossover method
        
        In this the CEP of one result might be put with the HP of another result to create a new result in breeding
    - Although crossover breeding might be useful, by far the most important breeding method used is mutation
      - For this some individuals or households in a population are swapped with others in such a way that control constraints are maintained.
      - Allowing the swapping of a large number of individuals or households increases the breadth of search for the genetic algorithm and reduces the likelihood of the optimisation getting stuck with a sub-optimal solution.
    - It requires a great deal of experimentation to investigate
      - How much breeding should take place at each optimisation iteration
      - Whether it is a good idea to try to keep a variety of solutions as opposed to only solutions bred from a single result
  - Convergence criteria
    - These try to prevent the algorithm running for too long
    - If an optimal result is found then there is no longer any need to continue trying to find a better result.
    - If an optimal result has not been found in a reasonable time, or if a specified number of breeding and selection iteration cycles have completed, then a convergence criteria may force the algorithm to stop optimising and return a result.
    - Another convergence criteria might keep track of the number of optimistaion iteration that have been completed since a better fitting result was found
      - If not improvement is found after a given number of tries, the optimisation may be stopped and result returned.
      - This can also be used to control how different genetic algorithm parameters can vary
        
        For instance difference amounts of mutations might be tried, and/or different numbers of "survivors" from each iteration might be specified.
    - As with genetic algorithm parameters for breeding, there is a need for experimentation in order to discover sane convergence criteria.
- The imposition of different GA control constraints can have a big effect on how to do breeding
  - With control constraints and breeding routines in place a considerable amount of experimentation is needed in order to
    - Set appropriate GA parameters for the
      - Size of initial set of results
      - Selection of results from each iteration
      - Convergence criteria
Data
- This section provides an introduction to the 2001 Census data focussing on the Census data outputs integrated to produce individual and household level data at the Output Area (OA) level for the UK
- Census data outputs are produced from Census data which exists as an individual and household level data set for the UK, but which is not made available for most research
- If researchers could access the individual and houehold level Census data to produce demographic projections and for other uses, then this work which tries to integrate some census data outputs to provide an estimate of these data would be irrelevant
- The Census data outputs integrated in this work are the Individual and Household Samples of Annonymised Records and the Census Aggregate/Area Statistics
  - These are samples and generalisations of Census data that have been processed in numerous ways
    - Details of what is done to process these data and the details of resulting data are important, but they are not presented here
      - Hopefully it suffices to say that the results are but a shaddow of the Census data they are derived from and the values of variables in different outputs can be grouped differently and the results may not necessarily add up.
- The 2001 Individual Licensed SAR (ISAR) "is a 3 per cent sample and contains 1843530 individuals and includes information on age, gender, ethnicity, health, employment status, housing, amenities, family type, geography, social class, education, distance to work, workplace, hours worked and migration. The data are available for England, Wales, Scotland and Northern Ireland (Office for National Statistics (2006a), CCSR 2001 Census Individual Licensed SAR Web Page)
- These data are made available for academic research via the Cathie Marsh Centre for Census and Survey Research (CCSR) following the completion of registration and usage agreement documentation via the following URL
  - http://www.ccsr.ac.uk/sars/2001/indiv/
- The Special Licence Household SAR (HSAR) "is a 1 per cent sample of households and all those individuals in those households from the 2001 Census. It is a hierarchical file allowing linkages to be made between individuals within families and households. The Special Licence Household SAR contains information on age, gender, ethnicity, marital status, social class, education and employment status. It also includes household level variables, e.g. housing tenure and number of cars. A number of derived variables have been added, for example, the number of full time earners in a household or the age of the youngest dependent child in a household" (Office for National Statistics (2006b), CCSR 2001 Census Special Licence Household SAR Web Page)
- These data are made available for academic research via the UK Data Archive Economic and Social Data Service (ESDS) as Study Number 5278
- Details about this data are also made available by CCSR via the following URL
  - http://www.ccsr.ac.uk/sars/2001/hhold/
- The Census Aggregate/Area Statistics (CAS) "consist of a series of tables which provide detailed information down to the most local geographic level - the Output Area. The full set is composed of CAS tables, CAS Theme Tables with information on a particular population such as dependent children or pensioner households, and Univariate tables which provide a more detailed breakdown for a single topic" (ONS 2001 Census Area Statistics Tables Web Page)
- Details of these data are publicly available via the Office for National Statistics
  - Table outlines can be downloaded via the following URL
    - http://www.statistics.gov.uk/census2001/cas_table_outlines.asp))
- These data are available by the MIMAS Census Dissemination Unit (CDU) (to the UK academic research community) and can be accessed via the following URL
  - http://www.census.ac.uk/cdu/2001/
- Two results are in process of being submitted to the UK Data Archive Economic and Social Data Service (ESDS)
  - On their own, these results provide only a list of SAR record identifiers for each Output Area
    - For any detailed use of the data, a copy of the ISAR is needed and for detailed use of household population produced from HSAR data, a copy of the HSAR data is also needed
    - In addition it is recommended that any use of these data begins with an analysis of the data with regard to the aggregate fit with respect to available CAS variables that are considered important
  - Once there is a study number for these data and they are available, details will be provided...
Permutations
- The general formula for finding the number of permutations of size p taken from n objects (npPermutations) is given by the following equation:
  - npPermutations = n! / ( n - p )!
    - the operator / is for division, and = is an equals assignment operator
    - 10 factorial (10!) means multiply 10, 9, 8, 7, 6, 5, 4, 3, 2, and 1
      - using * as a multiplication operator, this can be written as (10*9*8*7*6*5*4*3*2*1) or (10*9*...*2*1)
- So for p = 300 (close to the average population of a Census Output Area), the number of permutations of potential sets of 300 records from the ISAR (with n = 1843525) is:
  - npPermutations = 1843525! / ( 1843525 - 300 )!
  - npPermutations = (1843525*1843524*...*1843225*1843224*...*2*1) / (1843225*1843224*...*2*1)
  - npPermutations = (1843525*1843524*...*1843227*1843226)
  - Which is a bit over 1843226 raised to the power of 300
    - This is so large a number of possible permutations that a brute force search that evaluates them all is too computationally expensive
- Furthermore, consider that there are 223060 Output Areas in the the UK
- Using detailed control constraints reduces the number of permutations
  - Exactly how many permutations there are will depend on the distribution of control constraining variable values in the SAR and in the distribution of data variables in an Output Area being processed
    - Detailed mathematical working has not been done to ascertain details of the permutations in each case
      - It was assumed that in each case that at least one control constraining solution existed and it was hoped that there were many - providing a reasonably large and variable solution space for the GA to do its work
ISARHP_ISARCEP
- This section provides some details of the Genetic Algorithm (GA) implementation for creating a population result for each UK Output Area (OA) that specifies which ISARs represent the Household Population (HP) and which ISARs represent the Communal Establishment Population (CEP) including
  - Control Constraint (CC) details
  - Optimisation Constraint (OC) details
  - GA parameter details
- It also provides the details of the GA parameters used to produce the result submitted to the archive and analysed in the Results Section
- The following Java code operators symbols are used in this section
  - == is an equality operator returning true if something preceding is equal to something subsequent
  - != is an equality operator returning true if something preceding is not equal to something subsequent
  - || is a logical OR operator returning true if something preceding is true or something subsequent is true
  - + is an addition operator which for numeric arguments produces the arithmetic sum
- The aim is to generate an Object[] population, for each CAS region, where:
  - population[0] provides references to ISAR records (ISARDataRecords) for the Household Population (HP); and,
  - population[1] provides references to ISARDataRecords for the Communal Establishment Population (CEP).
- Control Constraints
  - Counts of Household Population Household Reference Persons from CAS Table 003 (CAS003) with records represented by CAS003DataRecord
  - Counts of Household Population from CAS Table 001 (CAS001) with records represented by CAS001DataRecord
  - Counts of Communal Establishment Populations from CAS Table 003 (CAS003) with records represented by CAS003DataRecord
  - Additional control constraints considered:
    - Selecting only ISARDataRecords with ( CESTATUS == -9 ) for HP, and ISARDataRecords with ( CESTATUS == 1 || CESTATUS == 2 ) for CEP.
      - For details on the CESTATUS variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/cestatus/
      - Imposing this constraint is reasonably straightforward, however it was not done for the following reasons
        
        It was thought it might not be possible for some areas (more likely not to be possible due to combined effect of other constraints), and that it is expected that populations of similar types cluster (e.g. students), and that the ISAR is only a 3% sample (if it were a 100% record this entire task would be approached differently, but 3% is reasonable small!)
        
        It is probably not as sensible as trying to achieve the desired effect by modifying the measurement of goodness of fit during optimisation
- Optimisation Constraints
  - The variables come from the following tables
    - CAS Key Statistics Table 008 (CASKS008) with records represented by CASKS008DataRecord
    - CAS Key Statistics Table 020 (CASKS020) with records represented by CASKS020DataRecord
    - CAS Key Statistics Table 09b (CASKS09b) with records represented by CASKS09bDataRecord
    - CAS Key Statistics Table 09c (CASKS09c) with records represented by CASKS09cDataRecord
    - CAS Table 001 (CAS001) with records represented by CASKS001DataRecord
    - CAS Table 002 (CAS002) with records represented by CASKS002DataRecord
  - From CASKS008 the following are used
    - peopleWhoseGeneralHealthWasGood = CASKS008DataRecord.getPeopleWhoseGeneralHealthWasGood()
    - peopleWhoseGeneralHealthWasFarilyGood = CASKS008DataRecord.getPeopleWhoseGeneralHealthWasFairlyGood()
    - peopleWhoseGeneralHealthWasPoor = CASKS008DataRecord.getPeopleWhoseGeneralHealthWasPoor()
    - peopleWithLimitingLongTermIllness = CASKS008DataRecord.getPeopleWithLimitingLongTermIllness()
    - peopleWithoutLimitingLongTermIllness = ( CASKS008DataRecord.getAllPeople() - CASKS008DataRecord.getPeopleWithLimitingLongTermIllness() )
  - ISAR aggregates for comparison with CASKS008 variables are derived from
    - ISARDataRecord.get_HEALTH()
      - For details on the ISAR HEALTH variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/health/
    - ISARDataRecord.get_LLTI()
      - For details on the ISAR LLTI variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/llti/
  - From CASKS020 the following are used
    - marriedOrCohabitingCoupleWithChildren = ( CASKS020DataRecord.getHouseholdsComprisingOneFamilyAndNoOthersCohabitingCoupleHouseholdsAllChildrenNonDependent() + CASKS020DataRecord.getHouseholdsComprisingOneFamilyAndNoOthersCohabitingCoupleHouseholdsWithDependentChildren() + CASKS020DataRecord.getHouseholdsComprisingOneFamilyAndNoOthersMarriedCoupleHouseholdsAllChildrenNonDependent() + CASKS020DataRecord.getHouseholdsComprisingOneFamilyAndNoOthersMarriedCoupleHouseholdsWithDependentChildren() + CASKS020DataRecord.getHouseholdsComprisingOneFamilyAndNoOthersMarriedCoupleHouseholdsNoChildren() + CASKS020DataRecord.getHouseholdsComprisingOneFamilyAndNoOthersMarriedCoupleHouseholdsNoChildren() )
    - loneParentHouseholdsWithChildren = ( CASKS020DataRecord.getHouseholdsComprisingOneFamilyAndNoOthersLoneParentHouseholdsAllChildrenNonDependent() + CASKS020DataRecord.getHouseholdsComprisingOneFamilyAndNoOthersLoneParentHouseholdsWithDependentChildren() )
  - ISAR aggregates for comparison with CASKS020 variables are derived from Household Reference Persons ISARDataRecord.get_FAMTYP()
    - For details on the ISAR FAMTYP variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/famtyp/
  - From CASKS09b the following are used
  - From CASKS09C the following are used
  - ISAR aggregates for comparison with CASKS09b and CASKS09c variables are derived from
    - ISARDataRecord.get_SEX()
      - For details on the ISAR SEX variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/sex/
    - ISARDataRecord.get_ECONACT()
      - For details on the ISAR ECONACT variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/econact/
    - ISARDataRecord.get_AGEO()
      - For details on the ISAR AGE0 variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/age0/
  - From CAS001 the following are used (note that these could have been applied individually to HP and CEP)
  - ISAR aggregates for comparison with CAS001 are derived from
    - ISARDataRecord.get_SEX()
      - For details on the ISAR SEX variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/sex/
    - ISARDataRecord.get_AGEO()
      - For details on the ISAR AGE0 variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/age0/
  - From CAS002 the following are used
  - ISAR aggregates for comparison with CAS002 are derived from
    - ISARDataRecord.get_SEX()
      - For details on the ISAR SEX variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/sex/
    - ISARDataRecord.get_AGEO()
      - For details on the ISAR AGE0 variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/age0/
    - ISARDataRecord.get_MARSTAT()
      - For details on the ISAR MARSTAT variable see http://www.ccsr.ac.uk/sars/2001/indiv/variables/marstat/
- GA Parameters
  - The following GA parameters were specified:
    - _InitialPopulationSize
      - This controlled the number of solutions that were initially pseudo randomly created prior to optimisation
      - (The terminology is slightly confusing because population is used here for a set of solutions when the process is also generating a solution which represents a population)
    - _NumberOfOptimisationIterations
      - Control for the maximum number of optimisation iterations that would be done before a result was output
    - _MaxNumberOfSolutions
      - Control for the maximum number of best solutions which would be selected in each optimisation iteration to "survive"
    - _ConvergenceThreshold
      - Control for the maximum number of iterations after which if no better solution had been found, the best fitting result would be returned.
    - _MaxNumberOfMutationsPerChild
      - Control for the maximum number of records that could be swapped in a breeding mutation
    - _MaxNumberOfMutationsPerParent
      - Control for the maximum number of times a solution would be used in breeding in an optimisation iteration
    - _RandomSeed
      - Control for the pseudo random number generator used for the stochastic elements of the heuristic
      - The pseudo-random number seed is set and used so that results can be readily recreated
  - For the result submitted to ESDS the run was done in two parts
    - The first part was essentially an initialisation and used the following parameters
    - The second part was based on the initialised results and optimisation used the following parameters
HSARHP_ISARCEP
- This section provides some details of the Genetic Algorithm (GA) implementation for creating a population result for each UK Output Area (OA) that specifies which HSARs represent the Household Population (HP) and which ISARs represent the Communal Establishment Population (CEP) including
  - Control Constraint (CC) details
  - Optimisation Constraint (OC) details
  - GA parameter details
- It also provides the details of the GA parameters used to produce the result submitted to the archive and analysed in the Results Section
- The aim is to generate an Object[] population, for each CAS region, where:
  - population[0] provides references to HSAR records (HSARDataRecords) for the Household Population (HP) Household Reference Person; and,
  - population[1] provides references to ISARDataRecords for the Communal Establishment Population (CEP).
- Control Constraints
  - The same control constraints are applied as in the ISARHP_ISARCEP optimisations with the exception of total population by gender/sex
    - Total population by gender/sex or any more detailed criteria is very difficult to constrain to as entire households are selected on the basis of a Household Reference Person constraint and there can be between 0 and 11 other persons in a household
- Optimisation Constraints (OCs)
  - Similar OCs were applied as in the ISARHP_ISARCEP optimisations, the differences are that
    - For CAS001 and CAS002 derived OCs, some additional constraints were specified in the age range 30 to 59
    - For CASKS09b and CASKS09c
      - A count of unemployed aged 50 and over for Females and Males was used
        
        Again, the greater age variable detail in the HSAR allows for this
        
        CASKS09bDataRecord.getMalesAge50AndOverUnemployed()
        
        CASKS09cDataRecord.getFemalesAge50AndOverUnemployed()
      - But other OCs were either omitted or further aggregated such that Female and Male counts were only detailed for Economically Active Employed and Economically Active Unemployed
        
        From CASKS09b the following were used
        
        ( CASKS09bDataRecord.getMalesAge16to74EconomicallyActiveEmployedFullTime() + CASKS09bDataRecord.getMalesAge16to74EconomicallyActiveEmployedPartTime() + CASKS09bDataRecord.getMalesAge16to74EconomicallyActiveSelfEmployed() )
        
        ( CASKS09bDataRecord.getMalesAge16to74EconomicallyActiveUnemployed() + CASKS09bDataRecord.getMalesAge16to74EconomicallyInactiveRetired() )
        
        From CASKS09C the following were used
        
        ( CASKS09cDataRecord.getFemalesAge16to74EconomicallyActiveEmployedFullTime() + CASKS09cDataRecord.getFemalesAge16to74EconomicallyActiveEmployedPartTime() + CASKS09cDataRecord.getFemalesAge16to74EconomicallyActiveSelfEmployed() )
        
        ( CASKS09cDataRecord.getFemalesAge16to74EconomicallyActiveUnemployed() + CASKS09cDataRecord.getFemalesAge16to74EconomicallyInactiveRetired() )
  - The SAR aggregate variable statistics for comparison are derived; for the HP from HSAR aggregates; for CEP from ISAR aggregates; and, for total population from both HSAR and ISAR aggregates
  - ISAR variables used for calculating aggregates are as detailed in Section ISAR_HP_ISAR_CEP
  - HSAR aggregates for comparison with CAS001 are derived from
    - HSARDataRecord.get_SEX()
      - For details on the HSAR SEX variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/sex/
    - HSARDataRecord.get_AGEH()
      - For details on the HSAR AGEH variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/ageh/
  - HSAR aggregates for comparison with CAS001 are derived from
    - HSARDataRecord.get_SEX()
      - For details on the HSAR SEX variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/sex/
    - HSARDataRecord.get_AGEH()
      - For details on the HSAR AGEH variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/ageh/
    - HSARDataRecord.get_MARSTAH()
      - For details on the HSAR AGEH variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/marstah/
  - HSAR aggregates for comparison with CASKS008 variables are derived from
    - HSARDataRecord.get_HEALTH()
      - For details on the HSAR HEALTH variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/health/
    - HSARDataRecord.get_LLTI()
      - For details on the HSAR LLTI variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/llti/
  - HSAR aggregates for comparison with CASKS09b and CASKS09c variables are derived from
    - HSARDataRecord.get_SEX()
      - For details on the HSAR SEX variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/sex/
    - HSARDataRecord.get_ECONACH()
      - For details on the HSAR ECONACH variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/econach/
    - HSARDataRecord.get_AGEH()
      - For details on the HSAR AGEH variable see http://www.ccsr.ac.uk/sars/2001/hhold/variables/ageh/
- GA Parameters
  - These are the same as for the ISARHP_ISARCEP GA
  - Similarly, for the result submitted to ESDS the run was done in two parts
    - The first part was essentially an initialisation and used the following parameters
    - The second part was based on the initialised results and optimisation used the following parameters
Results
- There are a considerable number of variables in the derived data and Census outputs which can be compared
- An initial analysis of results was based on
  - Graphs of the aggregated statistics at OA level with a summary linear regression
  - Geographical maps of the residuals
- A more in-depth analysis would investigate outliers and consider why the results had poor fit for those OAs
  - Such an in-depth analysis has not been done and is not presented here...
- Three types of results variables are distinguished
  - Control Constraints (CCs)
  - Optimisation Constraints (OCs)
  - Non-Constraints (NCs)
- For CCs, there should be a perfect correlation between aggregate model results and the CC variables derived from CAS tables
  - For both outputs, this was indeed found to be the case
- For OCs, a strong correlation between aggregate model results and the CAS OC variables is wanted/expected
  - It is shown that for some variables, there is a better fit than for others
- NCs represent variables that could have been used for constraining, but were not
  - Where there is a strong correlation, this indicates that the variable is correlated with the others that were used as CCs or OCs
- ISARHP_ISARCEP
  - To use this results a user requires a copy of the ISAR (which is reasonably easy to obtain for UK academics)
  - OC graphs
    - http://www.geog.leeds.ac.uk/people/a.turner/projects/MoSeS/documentation/demography/results/2001PopulationInitialisation/NSSE/ISARHP_ISARCEP_2_1000_2_2_2_2_0_1/UK/OptimisationConstraints/OA.xhtml2.0.html
  - ...
- HSARHP_ISARCEP
  - To use this results a user requires a copy of both the ISAR and HSAR
  - OC graphs
    - http://www.geog.leeds.ac.uk/people/a.turner/projects/MoSeS/documentation/demography/results/2001PopulationInitialisation/NSSE/HSARHP_ISARCEP_2_1000_2_2_2_2_0_1/UK/OptimisationConstraints/OA.xhtml2.0.html
  - ...
Summary/Discussion/Conclusion
- Contol Constraints (CCs) and Optimisation Constraints (OCs)
  - CCs reduce the solution space and provide interchangeable parts of solutions that can be more mixed and matched
    - What CCs can be used and what CCs cannot be used?
      - This is data specific in that solutions under some CCs may be found for some areas, but not others
    - With more and more complex CCs it becomes increasingly likely that for a given area, there are no viable solutions
    - Additionally, the more CCs there are, the more complex it becomes to swap a set of records with another set of records without breaking these CCs.
    - In general, given the nature of CAS data, it seems that optimisation is more important than constraining.
- In undertaking this work a greater understanding of the nature of the data and the task itself was developed
  - One of the most things for anyone working with UK 2001 Census data to know are about the aggregations of the data and the reasons why the data do not add up as might be logically expected
  - This exercise would not be necessary if the individual and household level Census data that is collected were made readily available for research
- Suggestions for further work
  - New results could be generated for different optimisation functions and these could be compared and contrasted and even developed from the outputs submitted to the archive
    - All the source code for generating the data is available as open source Java programs
  - Similar processing of other Census data using this approach might be attempted
    - Including other 2001 Census data
      - SAM data could be used
      - More controlled microdata might also be used
- This work is something of a waste of time and effort
  - There would be no need to re-integrate census data outputs to approximate census data if the data were made available to trusted individuals and groups for research purposes
References
- Birkin et al (2006a) Birkin M., Turner A.G.D., Wu B. (2006) A Synthetic Demographic Model of the UK Population: Methods, Progress and Problems. Paper presented at the Second International Conference on e-Social Science, UK, June (2006-06).
- Birkin et al (2006b) Birkin M., Turner A.G.D., Wu B. (2006) Proof of Concept for a Dynamic Simulation Model of the UK Population. Abstract submitted for the Second International Conference on e-Social Science, UK, June (2006-06).
- Birkin et al (2004) Birkin M, Dew P, Rees P, Chen H, Clarke M, Keen J, Xu J (2004) MoSeS Proposal Submitted to ESRC (with deletion of confidential information) [On-line] URL: http://www.geog.leeds.ac.uk/people/a.turner/projects/MoSeS/documentation/proposal/proposal.doc [Accessed 2011-03-03]
- Williamson P. (2003) Small-area synthetic population microdata [online] URL: http://pcwww.liv.ac.uk/~william/microdata/ [Accessed 2011-03-03]
- Williamson P., Birkin M., Rees P. (1998) The estimation of population microdata by using data from small area statistics and samples of annonymised records. Environment and Planning A, 30, 785-816.
- Census Data References
  - ONS 2001 Census Area Statistics Tables Web Page [On-line] URL: http://www.statistics.gov.uk/census2001/cas_table_outlines.asp [Accessed 2011-03-08]
  - census.ac.uk Web Site [On-line] URL: http://www.census.ac.uk/ [Accessed 2011-03-08]
  - MIMAS Census Dissemination Unit Web Page [On-line] URL: http://www.census.ac.uk/cdu/2001/ [Accessed 2011-03-08]
  - Office for National Statistics (2006a) 2001 United Kingdom Sample of Anonymised Records, Individual Licensed File distributed by the Cathie Marsh Centre for Census and Survey Research, University of Manchester
  - CCSR 2001 Census Individual Licensed SAR Web Page [On-line] URL: http://www.ccsr.ac.uk/sars/2001/indiv/ [Accessed 2011-03-08]
  - Office for National Statistics (2006b) 2001 United Kingdom Sample of Anonymised Records, Household Special Licensed File [computer file] distributed by the UK Data Archive, University of Essex
  - CCSR 2001 Census Special Licence Household SAR Web Page [On-line] URL: http://www.ccsr.ac.uk/sars/2001/hhold/ [Accessed 2011-03-08] For citation use: CCSR 2001 Census Special Licence Household SAR Web Page (http://www.ccsr.ac.uk/sars/2001/hhold/)
  - CCSR Sample of Anonymised Records Web Page [On-line] URL: http://www.ccsr.ac.uk/sars/2001/ [Accessed 2011-03-08]
  - ONS Samples of Census 2001 Anonymised Records Web Page [On-line] URL: http://www.statistics.gov.uk/census2001/cn_117.asp
  - ONS Census 2001 Web Page [On-line] URL: http://www.statistics.gov.uk/census2001/
  - [Accessed 2005-07-01] (Internet Archive URLs: http://waybackmachine.org/*/http://www.statistics.gov.uk/census2001/, http://replay.waybackmachine.org/20050401091030/http://www.statistics.gov.uk/census2001/
- Miscellaneous References
  - Oracle Java Web Site [On-line] URL: http://www.oracle.com/technetwork/java/
  - MPJ Express Web Site [On-line] URL: http://mpj-express.org/ [Accessed 2011-03-08]
  - Java Code References
Acknowledgements
- This work was funded by the UK Economic and Social Research Council under the project code RES-149-25-0034
- Belinda Wu provided the Household Formation Routine Java code used in the production of preliminary results
- Thanks to
  - The MoSeS team and to colleagues at the National Centre for e-Social Science for your help and encouragement
  - Data providers and the Economic and Social Data Service for disseminating results
    - Details of this (and derived) publications are to be made known to SAR data providers under data usage agreement terms...
  - The UK National Grid Service for helping to process the results and to their funders for providing this invaluable resource
  - The University of Leeds and particularly the Centre for computational Geography and School of Geography for your help and support
- Copyright: Census output is Crown copyright and is reproduced with the permission of the Controller of HMSO and the Queen's Printer for Scotland

Andy Turner's MoSeS 2001 UK Demographic Initialisation Page

Introduction

Outline of Work

Genetic Algorithm

Data

ISAR

HSAR

CAS

Output

Permutations

ISARHP_ISARCEP

HSARHP_ISARCEP

Results

ISARHP_ISARCEP

HSARHP_ISARCEP

Summary/Discussion/Conclusion

References

Acknowledgements