Marcus Blake & Stan Openshaw.
School of Geography, Leeds University, Leeds. LS2 9JT
A new 1991 census based research classification has been developed under the aegis of an ESRC research project. The new system is called GB-Profiles `91 and has been developed solely for academic purposes. There is no relationship whatsoever with the Super Profiles system of CDMS. Profiles 91 uses the best available computer methods run on the best available computer hardware. It classifies the 150,000 (130,000) smallest areas in Britain for which 1991 census data are available (EDs in England & Wales, OAs in Scotland) into a relatively small number of distinctive residential area types based on an assessment of their multivariate census data profile. It is expected to be of considerable research value; particularly, when linked to unit postcodes. In building such a system there are a number of key design issues, particularly the choice of,
1. variables,
2. classification method,
3. result resolution,
4. and incorporation of prior knowledge.
The choice of variables to use is important because the results are to a large extent determined by it. Sadly in many previous studies there is no clear explanation or audit trail as to how the variables were chosen. An exception is the 1981 Super Profiles system; see Charlton et al (1985). The choice of variables has to reflect the purpose of the exercise and is both laboriously boring and tedious as well as hard because of the careful need to construct sensible indicator variables from a set of almost 10,000 available census counts for each small area.
Here the small areas used are the Census Enumeration Districts (EDs), and the source of data is the Small Area Statistics or SAS. The SAS is "a predefined set of cross tabulations of two or more census variables which are made available by the Census Offices in a machine readable format for a wide range of different areal units throughout the whole of Great Britain" (Cole, 1993, p. 201)
This statement makes two important points. First, that the census covers the whole of Great Britain. Every person must, by law, complete the census questionnaire, therefore the coverage is high. The Census Validation Survey (CVS) checks have indicated that approximately 299,000 (0.6% of the enumerated population) people were missed by the 1991 census, but this itself is estimated to be only a third of the actually under-enumeration (Wriggins, 1993). But even with an estimated 1 million people missing the census still represents a broad and accurate data set.
The second point is that it is available for a wide variety of different spatial units, including the enumeration districts (EDs and OAs) at which the census was taken. The SAS is the only Census data set that allows data to be output at this low level of resolution as confidentiality is maintained by two processes. First, by restricting output to those EDs which have more than 50 persons or 25 households, and second by "blurring" the results (adding +1, 0 or -1 to the census counts in a quasi-random manner). Although these confidentiality enhancing measure do allow data to be released for small areas, they cause severe methodological problems that have to be addressed when classifying census data.
The research objective of the census classification project sponsored by the ESRC is to provide a census data representative small area classification of Britain's residential areas. The key phrase here is "census data representative". The aim is not solely to define areas of urban or rural deprivation or area of affluence; or areas dominated by this or that socio-economic mix that happens to be of narrow or specialist attention and importance. There are, of course, valid objectives for special purpose classifications of Britain, but the objective here is to seek a more general (broad based) census data descriptive representation. It is thought that this will appeal to the majority of the potential prospective social science users and is in the best traditions of the original geodemographic systems. It is also likely that this goal will provide maximum added-value to the 1991 database.
In practice, this means that the broadly representative set of census variable based indicators needs to be created. It follows, therefore, that in principle the importance of each general type of variable should broadly reflect the nature and content of the original census questionnaire rather than the 100 or so tables of cross-tabulated counts; the latter are a reflection of the perceived user needs for the census outputs. This desire for coverage and representativeness of the census questionnaire is, in practice, modified by an assessment of the sorts of variables previously and typically used, and by the thought that data redundancy may be induced or removed later via statistical means. A consideration of indicators used by previous classifications is useful also, because it represents a, not insignificant, transfer of intellectual and conceptual thinking that may well be worth re-cycling. Indeed the lineage of some census indicator variables can be traced back to the 1966 census!
The strategy here then, is to review the variables used by others and then supplement and modify this list by reference to the principles of coverage and representativeness that are considered so important.
Table 1: Major players in the British Geodemographic Industry
Company Associated Products CACI (+ PinPoint) ACORN CDMS Super profiles CCN MOSAIC Infolink DEFINE Equifax Europe IMAGES Euro Direct Neighbours & PROSPECTS
Table 2 provides a list of the salient features of both the 1981 and 1991 systems. They can be differentiated by the number and type of data sources they use, the spatial unit they are based on, whether Principle Components Analysis is used to reduce redundancy within the variables, and the number of groups used to classify the data. It is interesting that a number of the systems claim to be based on unit postcodes, whereas the 1991 census only reported population and household counts at the ED level. This mixture of census ED and unit postcode data (units which are typically one tenth the size of EDs) causes all manner of problems and it is by no means clear whether the resulting classifications are actually any improvement over a purely ED based one, although they may often be perceived to be better by the end users.
Table 2: Description of the Major Geodemographic Systems
Classific Supplie Sources of # of Area PCA # of ation r Variables variab Unit Use grps les d 1981 ACORN CACI Census81 41 ED no 11, 38 SYSTEM (only) S DEFINE Infolin Census81 67 Postcod yes 10; k credit data e 47; 423 electoral roll PAF data MOSAIC CCN Census81 38 4 Postcod no 12; credit 11 e 38; 57 activity link data only 1 electoral roll PAF data CCJ data PiN CACI Census81 104 ED yes 12; (only) 25; 60 Superprof CDMS Census81 55 + ED yes 11 iles 10 Lifesty (afflu le ence) grps; 37 Groups; 150 Clustse rs TGI data 25? (small scale) 1991 ACORN91 CACI Census91 79 ED yes 6 SYSTEM (only) Categor S ies; 17 groups; 54 Types DEFINE91 Infolin Census91 yes k credit activity data electoral roll unemployment stats IMAGES Equifax Census91 Postcod Europe e (UK) Ltd NDL data MOSAIC91 CCN Census91 87 Postcod No 11; 62 credit e activity data electoral roll PAF data CCJ data retail access data Neighbour Euro Census81 +91 ED? ? s & Direct PROSPECTS Databas e Marketi ng Ltd PiN91 CACI Census91 49 ED No 6, 17, (only) 42Source: P. Sleight, 1993, Target Market Consultancy
As more and more data is collected in computer readable form so the number of different data sets available to marketing companies increases, Table 2 lists the varied sources used by each of the systems, they range from systems that restrict themselves to solely using census variables, ACORN, to systems were the census forms only a minority of the total data used, MOSAIC. The most commonly used non census data sets used are the electoral roll, credit activity data, county court judgments and data associated with retail activity obtained either from surveys or lifestyle databases.
None of these extra-census data sets are currently available to academics so the use of similar variables is not an option considered here. However, it might be noted that mixing different data sources with different levels of spatial resolution and sampling characteristics is not necessarily or automatically an advantage, whilst it is a further source of methodological problems.
Also, it is recognized by most observers (Sleight, 1993; Openshaw, 1993) that there is now a general trend within the industry for systems to become more specific. Companies now produce a range of products to cover the wide range of situations where these systems can be applied. Such focusing it not relevant here because we seek to develop a general purpose census classification as a descriptive summary (or surrogate) of the multivariate complexity of the 1991 census.
Each of the Census agencies listed in Table 1 was contacted and asked for information on the system concerned, specifically each was asked for a list of the census variables used in their system. Because the present competitive situation has increased the commercial value of this information only one company (Infolink) was prepared to provide this information. Unfortunately, the Infolink variables appear to be based on the total counts from each of the 88 tables in the SAS which restricts their utility. A list of the census variables used in the CCN system MOSAIC was acquired (see Appendix A) and this formed the initial basis of the selected variables.
Useful comparisons were made with the census variables used by the 1981 Super Profiles and ACORN systems (see Appendices B & C) The other agencies would only provide general information and the relevant brochures. A list of the variables
Further, some of these brochures provide enough detail to derive the general census topics involved in each of the geodemographic systems. For example, the CACI ACORN brochure provides short descriptions of each of the 54 ACORN types. Type 10.32 is described as "Home Owning areas with Skilled Workers". To produce such a grouping CACI would have had to have used Census data on Tenure (more specifically Table S20, households owned outright) and data on social class. Indeed, this illustrates another useful principle. The variables defined here as of interest should be sufficient to identify any of the labeled area types described in the various other commercial systems.
There are also situations where the choice of variable is hard to discern. For example, it is unclear how CACI derive how home owning areas are established (Type 9.28) or what data they use to measure affluence (several adjectives are used in the descriptions - Wealthy, Affluent, Well-off and Prosperous). In this situation no attempt was made to guess, the variables derived from these descriptions were only selected when it was obvious from the pen picture that they were included in the classification.
From these three sources, 1991 & 1981 lists and the analysis of pen descriptions of some of the better documented systems, a in-depth knowledge of the various selections of variable selected from the census was acquired. This knowledge was used in selecting the specific variables from the large number available.
The statistics provided by the SAS are those that the OPCS perceived to be needed after an extensive consultation process. For example There is a strong emphasis on dependency within the 1991 census (lone parents, number of dependent children etc.) because of the recent moves by the government to review the structure of the Welfare State.
The content of the 1991 census forms remains fairly similar to that of 1981 Table 3 provides a list of the questions asked by the 1991 census questionnaire. There were five major changes from the 1981 census. These were additional questions on ethnic origin, long-term illness, the existence of central heating, the term-time address of students, and on the number of hours worked in the previous week.
Table 3: 1991 Census Questions
Questions on Households type of accommodation extent of sharing tenure number of rooms availability of bath & WC central heating number of cars & vans lowest floor level of accommodation (Scotland only) Questions on the individual sex date of birth marital status relationship in household ethnic group whereabouts on Census night usual address term-time address (for students) usual address one year ago country of birth long-term illness whether working in week before Census hours worked weekly occupation industry address of place of work means of travel to work higher qualifications Scottish Gaelic (Scotland only) Welsh language (Wales only)
These topics have been regrouped under the eight headings listed below.
Table 4: Variable Selection Headings
Selection Groups Census Topics Demographic age sex marital status Ethnic birth place nationality ethnic group Housing usual residence housing (number) rooms (number) tenure household amenities availability of cars & vans Household Composition a combination of most of the other individual and household census topic aggregated to the level of the household e.g. couple households with dependent child(ren) and no car. Socio-economic economic position occupation place of work industry qualifications Migration migration Health limiting long term illness Travel-to-work journey to workThese eight headings form the framework that structures the selection of potential variables, ensuring that all the important areas of the census are included in the final classification.
In considering suitable variables for a new general residential classification it is important not only to know which variables others have used, but also what the SAS counts represent. The OPCS provide a detailed explanation of the definitions and classifications used to aggregate the census returns into counts and tables (OPCS, 1992).
Equally important with all Census statistics is to remember what population the counts are being counted from. This is especially true of the SAS because the population being counted both between and within each table can vary (Residents in Households, Students and schoolchildren aged 5 and over, or Persons aged 60 and over with limiting long-term illness). The majority of the tables are either based on the usual resident population or the number of households. the full list of denominators used to create the rates are listed in Appendix C
This section describes and explains the selected variables. They are grouped under each of the eight topic heading listed above. A list of the associated SAS reference codes is provided in Appendix C.
The SAS (Table S02) breaks down the usual resident population into 5 year age groups and provides counts for the total, single/widowed/divorced, and married men and women. So which variables would best represent this mass of data? In general commercial systems aggregate the age groups into five or six unequal groups. Here, seven different generations are identified ranging from infants to the aged.
The population base for these variables is the usually resident population: 1991 base (which is referred to as Residents). This is the most common base used within the SAS and is a count of...
"...all the persons recorded as resident in households in an area, even if they were present elsewhere on census night, plus residents in communal establishments who were present in the establishment on Census night." (OPCS, 1992, p. 7)
1 0 - 4 infants 2 5 - 14 children 3 15 - 24 young adults 4 25 - 44 adults 5 45 - 64 middle aged 6 65 - 74 recently retired 7 75 - 84 elderly 8 85+ the aged
The increase in the numbers of students, working women and lone parents makes these groups particular important to differentiate within a 1990's classification.
9 Married population 10 Single population S02From the analysis of the commercial systems four other social groups are generally identified as being important, pensioners, working women, lone parents and students. These groups have grown significantly in the last decade and it will become increasingly necessary to differentiated them from the general population.
Pensioners also form an increasingly significant proportion of the population. The OPCS define pensionable age as the minimum age at which a person may receive a national insurance retirement pension i.e. 60 for women and 65 for men.
11 Resident persons who are of pensionable age
12 Working women
13 Total 'Lone' Parents
14 Students 16+
15 Total Imputed Residents (S19)
16 White 17 Black 18 Indian 19 Pakistani 20 Bangladeshi 21 Chinese + OtherEthnicity forms and important variable in all the commercial systems; there are many clusters characterized by having a large multi-ethnic component. It is important to be able to distinguish between these clusters by also including associated variables.
The SAS provides a large number of different variables associated with Ethnic Groups. One of the main topics is the tenure associated with these different groups. Although many multi-ethnic areas do tend to be associated with a poorer areas there is a danger of labeling all these areas as less well-off when a significant proportion are not. These 18 variables provide the detail required to differentiate between these areas. Tenure is used to distinguish between financial stable and financially stressed multi-ethnic regions.
Owner Council Black 22 23 (grps) Indian, 24 25 Pakistani, Ban'deshi Chinese & 26 27 others
A migrant household is defined as a household whose head is a migrant ( the head of household is the first usually resident adult mentioned on the census form). A wholly moving household is a household whose resident members aged one year and over were migrants with the same postcode of usual residence one year before the census.
Only the net result of any moves is recorded so if a person returned to an address after moving within the year he or she would not be recorded as a migrant. Similarly any moves within the year are not recorded.
Different types of move can be distinguished depending on which boundaries are crossed (ward, district, county, standard regions and country). It should be borne in mind that the census includes internal migrants and immigrants to Great Britain, but not of course emigrants from Great Britain who are not enumerated.
Here two migration variables were used; the total number of resident migrants and the number of resident migrants that were also pensioners. The former allowed areas with a large number of new residents to be distinguished, for example new housing estates. While the later identified areas which attracted older residents who have usually recently retired, for example coastal retirement areas and retirement homes.
28 Total migrants 29 Pensioner migrants
Tenure is used by all the commercial classifications in one form or another. Superprofiles includes five variables and ACORN four. Both of these 1981 based systems distinguish between furnished and unfurnished flats. Here more emphasis is placed on home owners because of the increase in their numbers during the 1980's and these two tenures are aggregated under the privately rented category.
Generally the rented category and especially accommodation rented from Local Authorities etc. has been used by researchers as a measure of lack of resources and residential insecurity. In contrast. because of the financial commitment required to purchase a house, house ownership is seen as a surrogate for long term financial stability.
The OPCS classify tenure into 8 categories, here these are aggregated into six percentage variables.
30 Owned Outright 31 Mortgaged 32 Private Rented 33 Rented from HA, LA, NT
34 detached 35 semi-detached 36 terraced 37 flats 38 bedsits
Superprofiles and ACORN include variables on households without WCs. Today this is generally agreed to be a universal amenity and is therefore dropped in favour of the central heating variable. But, as a measure of deprivation one problem with this variable is that while a large proportion of household have central heating, many households cannot afford to run it (even more so now there is VAT on fuel).
39 No central heating 40 Lacking bath and shower
41 No car 42 2+ cars
The number of households which suffer from overcrowding is only 109,000, 0.5% of the total housing stock; therefore the problems caused by the ecological fallacy are likewise increased.
43 More than 1.5 ppr
44 Households with 7+ rooms
45 couple hhld, aged 16-24 without child(ren) 46 couple hhld, aged 16-24 with child(ren) 47 couple hhld, aged 25-34 without child(ren) 48 couple hhld, aged 25-34 with child(ren) 49 couple hhld, aged 35-54 without child(ren) 50 couple hhld, aged 35-54 with child(ren) 51 couple hhld, aged 55-75 plusFrom this mass of data in Table S87 the following variables have been picked; again tenure is used as a surrogate for financial stability.
Owner Council No Family Household 52 53 1 Couple Households (no 54 55 children) 1 Couple Households (with 56 57 children) 2+ Family Households 58 59
60 Households with dependents
61 Economically active 62 Self-employed 63 UnemploymentThe limitations with these figures are that for certain areas of Great Britain the figures may be substantially out of date because of the extent to which unemployment has increased since April 1991
Also, a change in the level of unemployment in an area may be more related to the local economy than to the quality of the residential neighbourhood. The late 1980's recession has affected many low unemployment areas, thereby reducing the value of unemployment as a indicator of residential characteristics.
64 Agric./Forestry/Fishing 65 Energy & Water 66 Manufacturing 67 Construction 68 Distribution & Catering 69 Transport 70 Banking & Finance
71 Professional 1, 2, 3, 4 72 Intermediate & Junior non-manual 5, 6 73 Manual workers 8, 7, 9, 10, 11,12 74 Farmers & agricultural workers 13, 14, 15 75 Armed Forces & Other 16, 17
It is used here as a measure of education, which is also associated with a higher earnings, better levels of health and in general a higher standard of living.
76 Workers with higher degrees 77 Workers with other qualifications
This may well be effectual in picking up broad regional differences in the general level of well-being across Britain. Its usefulness as a local residential area discriminator is as yet unexplored
Those residents that were economically inactive because of long-term illness were also included. This population is dependent on social and health care services and/or families for their way of life.
78 Total persons with LLI (S12)
79 Train Bus 80 Car 81 Work at home
V82 Medical & Care Ests. V83 Detention centres & Defence Est. V84 Education Ests. V85 Hotels & Other Ests.
It is most important to be clear about the purpose of the exercise. The choice of variables and their specification has to reflect the explicit purpose. It would seem that many previous classifications have, at best, been "purpose vague". It is obvious, but important nonetheless to be quite frank. Different variables will almost certainly produce different results, and whatever purpose is reflected, this will probably map onto the available set of 10,000 census variables in many, many, different ways producing many possible different classifications.
The aim of this classification project has been to be "census data representative". The aim is not solely to define areas of urban or rural deprivation or area of affluence; or areas dominated by this or that socio-economic mix that happens to be of narrow or specialist attention and importance. There are, of course, valid special purpose objectives for a classification of Britain's residential areas, but the objective here is to seek a more general (broad based) census data descriptive representation. It is thought that this will appeal to the majority of the potential prospective social science users and is in the best traditions of the original geodemographic systems. It is also likely that this goal will provide maximum added-value to the 1991 database.
The variables defined here essentially reflect the past experience of the researchers, those used by other commercial organizations, and the desire for coverage of the census topics. Table 6 summarizes what has been achieved. It is inevitable that coverage is uneven and contains possibly high levels of data redundancy. What is done to either reduce or remove or retain redundancy is a subject for a separate study. The object here was to reduce almost 10,000 potential variables to a much more manageable number for subsequent analysis and classification.