Geographical data mining: key design issues

Stan Openshaw, Centre for Computational Geography, School of Geography, University of Leeds, Leeds LS2 9JT United Kingdom
Email: stan@geog.leeds.ac.uk

1. Introduction

The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasises the importance of developing data driven inductive approaches to geographical analysis and modelling to facilitate the creation of new knowledge and aid the processes of scientific discovery. Indeed a number of Data Mining tools now exist that are designed to assist the process of exploring large amounts of data in search of recurrent patterns and relationships; for example Intelligent Data Miner, MineSet, Clementine, or STATISTICA etc. These packages have been developed mainly for the purpose of analysing very large commercial databases in order to model and predict customer-buying behaviour. This emphasis on prediction may well limit their usefulness in GIS where spatial pattern recognition may be a more usual activity. Nevertheless, in a GIS context there are strong arguments to be made in favour of the development and application of Data Mining tools in order to exploit the flood of geoinformation.

Many data miners appear to believe that their tools will work with any and all data regardless of its subject origins. Maybe this merely perpetuates a myth that data mining is universally useful. In fact a cynic might well argue that one of the main purposes of data mining is to help users with data warehouses to believe that their investments are worthwhile in the short-term while they try and identify how best to exploit it in the longer and medium term. It is very dangerous to overlook the features of geographical data that make it special. Geographical Data Mining (GDM) is to be regarded as a special type of data mining that seeks to perform similar generic functions as conventional data mining tools, but modified to take into account the special features of geoinformation, the rather different styles and needs of analysis and modelling relevant to the world of GIS, and the peculiar nature of geographical explanation. The focus here is on developing an explicitly Geographical Data Mining technology.

2. Understanding the geographical analysis task

2.1 Why geographical data mining?

The special features of geographical data are briefly summarised in Table 1.

Table 1 Special features of geoinformation

observations are not independent
data uncertainty and errors are often spatial structured
whole map statistics are seldom helpful
non-stationarity is to be expected
relationships are often geographical localised - rather than global
non-linearity is the norm
data distributions are non-normal
high levels of multivariateness but with redundancy
time often interacts with space
most GIS data layers are categorical
the locational element is important
the modifiable nature of all spatially aggregated data
results reflect definitional dependencies
there can be a fair proportion of junk data

It can be argued that conventional data mining tools are still useful, in the same way that conventional statistical methods can be applied to spatial data even if they cannot always be used safely to analyse or model it. Indeed, there is no doubting that many conventional statistical tools continue to be useful and that they have already well served quantitative geography for more than 3 decades; however, the view here is that many of these conventional, general purpose, statistical techniques are not sufficiently focused on, and tailored to, the special needs of geographical analysis. This same criticism applies to Data Mining tools.

There are some geoinformational data mining tasks that may be usefully performed by conventional data mining software. Table 2 outlines the range of tools that most data mining packages offer and many of these methods could be usefully applied to spatial data.

Table 2 Typical generic data mining functions

exploratory data analysis tools
linked map and graph displays
other visual data mining tools
most multivariate statistical methods
linear and logistic regression
classification
decision trees and regression trees
association rules detection
neural networks
memory based reasoning
sequence discovery

For example, data reduction tools, such as multivariate classification, can be useful as a means of summarising the essential features of large spatial data sets; for instance, to create geodemographics classifications. Similarly, modelling tools such as neural networks and decision trees can be readily applied to some geographic problems. It can be argued that while these methods ignore all of the special features outlined in Table 1, they still "work" to some degree but there are many exploratory geographical analysis types of data mining tasks that they cannot perform.

It is important to understand that if you use conventional data mining tools, then you are implicitly forced to accept the key assumption that geodata is the same as any other data and there is nothing special about geographical information or geographical analysis. If you input some X, Y referenced data into a data mining package and expect it to identify localised clusters of excess incidence of a disease, then you would probably be disappointed. These packages could only treat the X,Y co-ordinates as if they were merely two ordinary variables (such as age or income) and it is very likely that nothing useful would be achieved. There is no mechanism for handling location or spatial aggregation or for coping with spatial entities or spatial concepts such as a region or circle or distance or map topology. Conventional data mining tools may be very powerful but they also continue the geographical neglect inherent in conventional statistical methods.

The view expressed here is that what is needed are new types of data mining tools that can handle the special nature of spatial information, and also capture the spirit and essence of geographicalness that a GIS-minded data miner would expect to have available. There is one further problem that needs to be dealt with. If you equate geographical data mining with exploratory spatial analysis maybe some will be misled into believing that this problem has already been solved. This overlooks the massive difference between exploring a manageable small data set with few variables and the need to perform the same process on massive databases (with two or three orders of magnitude more cases) and possibly high levels of multivariate complexity. Human being-based graphical explorations of spatial data does not scale. It is a good approach for re-working quantitative geography problems from the 1970s, but cannot cope with massive data volumes or high levels of multivariate complexity. It is not a particularly GIS-relevant technology. Visualisation is a useful tool, but only for studying complex problems characterised by low dimensionality; anything else is probably beyond the reach of manual-based exploratory styles of analysis.

A related problem is the small number of skilled analysts compared with the potentially massive number of GIS databases that could be usefully analysed. Data mining tools are not much help because none of them have been designed to be a push-button act or to offer any automated analysis capabilities. Indeed they are closely associated with a graphics-based exploratory statistical analysis paradigm with its emphasis on visual programming and the visualisation of 3-D images as a guide to pattern recognition. It is suggested that maybe this expert-based visualisation process is too limiting to adequately handle the problems of geographical data mining, although the proof of this statement may need to be tested by empirical experiment.

The best long-term solution is to ignore more or less everything that conventional data mining tools do and to start from scratch to identify what it is they need to do in order to be useful in a geographical context, and if they are to justify the adjective "geographical." Once there is a clear idea of what the needs are then maybe the required methods can be developed from the available statistical and computational intelligence technologies. This paper focuses on developing an understanding and outlines a potential solution to these design issues.

2.2 Geographical analysis machines

In terms of developing a GIS-relevant GDM approach, there are two very different strategies that can be adopted:

a human being based exploratory graphical methods approach, and
the development of automated analysis machines.

The latter is arguably most relevant to GDM because it seeks to automate the search and analysis task concentrating on those aspects that are least well handled at present. The idea of developing an automated geography can be traced to Dobson (1983); however, the first prototype of an automated Geographical Analysis Machines (GAM/1) was developed in the mid 1980s; see Openshaw et al., (1987). The idea was very simple. If you have a large point-referenced database and are interested in searching for localised clustering but you have no pre-defined hypotheses to narrow down the spatial search (or you have exhausted all the available a priori hypotheses), then why not look everywhere and examine all locations for evidence of clustering over a wide range of spatial scales? GAM/1 was built by applying a simple circle-based test for localised clustering to all possible locations on the map. It has some claim to being the original geographical data miner.

The justification for automation was as follows:

there are many possible locations of potential localised clusters (many millions),
the search should be locationally unbiased and geographically comprehensive, and
it should handle spatial data uncertainty.

Subsequently, the explosion of GIS databases have added:

the need to speedily analyse in hours, rather than months or years, geographical data sets,
an increasing need to perform routine analysis for monitoring purposes perhaps in real-time,
there are many databases that are available for analysis and few skilled spatial analysts,
there are an increasing number of data sets that have a high imperative for geographical analysis once they exist,
as computer hardware has become faster, so analysis becomes easier and user expectations are increasing, and
automation makes it easier to develop user-friendly interfaces so that even the novice has some prospect of being able to perform some useful geographical analysis on the same day the work started.

In the late 1980s, GAM/1 was criticised on the grounds that it was a database trawler and a data mining tool. A decade later this intended insult has turned into a complement. Subsequent empirical evaluation showed that the current GAMK version is a very powerful cluster detector that in blind testing on rare disease data worked better than any of the alternative statistical methods; see Alexander and Boyle (1996)\; however, it is only a descriptive tool. It suggests where you should look for localised clustering in geographic space but it cannot prove that the apparent clusters it finds are real. There is no test of overall statistical significance although informal methods are provided in the web version of GAM that can help the end-user develop a better understanding of what the results may mean. Arguably, in many practical applications, GAM is in fact all you may need to meet many exploratory needs. It provides a good starting point for further investigation and maybe even action.

Of course GAM is only a tool (as indeed all Data Mining methods are) that will assist the end-user explore their data. Unlike most existing data mining tools with their emphasis on visual interfaces and skilled users, GAM is much more like a one-stop shop. It is highly automated and focuses on that part of the process that the human end-user is least well able to perform (the search process) but it then leaves the end-user to do what the computer cannot (decide whether the results are meaningful or just spurious junk). Unfortunately, GAM uses a brute force search that is not readily scalable to higher dimensional data search spaces. A space-time version has recently been developed (GAMK-T) but this is almost certainly the last stage in the development of GAM because of the potentially massive increase in computer times if any more search dimensions are added. As teraflop supercomputers become more easily available in the next few years, then brute force exploratory geographical analysis approaches, which are massively parallelisable, may well become a feasible strategy. Right now the available supercomputers are not sufficiently fast, and alternative approaches based on smart search methods are the best current strategy.

The space-time GAMK-T method avoids most of the computational problems by clever programming algorithms. The incorporation of time into exploratory geographical analysis is extremely useful, but it immediately causes other problems. In particular, there is very little practical experience of space-time geographical analysis because until recently there were little data and very few statistical methods have been developed to measure or detect space-time patterning. It is a new application area particularly for GIS, and it also raises some potential difficulties. For instance, if you have geodata for 1,000 time intervals, then there are many different ways of aggregating time. Many of the permutations of time period may yield a geographical data set that, if analysed, could contain patterns if you were able to perform the analysis and identify the best time permutations. There also are instantly a few orders of magnitude more results for you to examine.

2.3 The emergence of Geographical Explanations Machines

Other automated geographical analysis machines have been developed to explore different geographical data dimensions. Openshaw et al., (1990) described the development of a Geographical Correlates Exploration Machine (GCEM). This is an automation of a traditional map-based GIS analysis tool; that of map overlay. It has long been found useful to overlay two coverages and then map the data for the zones created by the union of the two maps. This procedure can and has been applied to up to five coverages. The hope is that some of the polygons created by the map overlay process will define "interesting" results (i.e., unusually high- or low-valued polygons) worthy of further investigation. Here the principal search problem is that of deciding which sets of maps to overlay. GCEM automates this search process although, because of the curse of dimensionality (there are 2^M-1 permutations of M maps) it can only realistically handle relatively modest numbers of coverages albeit far more than can be performed by purely manual means.

Recently the GCEM idea has been re-developed as a special case of GAMK; see Openshaw (1998). Instead of overlaying complete maps, the data inside each of the GAM search circles are examined using permutations of coverages for subsets of homogenous cases that yield extremely unusual results. This new method is termed the Geographical Exploration Machine (GEM). It is a fully automated form of analysis, that is easy to use, and is another form of a geographical data mining tool. The data mining for localised clusters is now taking place in a geographic space with multivariate map attributes. The output is similar to that produced by GAMK except that it is far more voluminous since each map showing a surface of evidence is indexed by a unique map coverage permutation. It also requires extensive compute times for large data sets and this may well require the use of high-performance computers for a while until workstations become faster and can cope. Fortunately, the code can be readily parallelised for message passing with an adjustable level of granularity to ensure efficient parallel performance; see Openshaw and Turton (1999a).

2.4 GIS data characteristics

Unfortunately neither GAMK nor GAMK-T nor GEM are sufficient to handle the full range of potential GDM needs. The problem is that GDM really needs a fusion of all three tools. Openshaw (1994,5) pointed out that GIS databases consist of three broad classes of data types that form a complex multivariate tri-space and that it is these spaces, and the interactions between them, that form the hyperspace that GDM needs to be able to search. These spaces are composed of:

1. Geographical coordinates,

2. Temporal coordinate(s), and

3. Multiple attributes relating to the geographical entities.

The other important characteristic to note here is that all three 'spaces' may (sometimes or always) interact with each other to further complicate the task. Finally, note that all three types of space are measured in different units that cannot be directly related to each other. For example, spatial co-ordinates measured in degrees latitude and longitude or in metres cannot be converted into time co-ordinates measured in years, days, hours, or seconds. Additionally, neither of these positional co-ordinates in space-time can be directly related to any of the other multivariate descriptions about the entities in the database; for example, percentage unemployment or geology type, or height above sea level. Finally the form of measurement also may vary. Spatial and temporal coordinates may well be real continuous numbers albeit bounded at fixed extremal values; however, the other multivariate attributes can be expressed in all possible forms of measurement; continuous, integer, real, ordered multistates, unordered multistates, and (maybe soon) fuzzy qualitative. A really useful GDM has to be able to handle all forms of data and measurement scales that characterise geodata as well as cope with the special features of spatial data previously noted in Table 1. In particular, some regard needs to be given to handle data uncertainties that (unlike non-spatial data) are not easily characterised as sampling error but are considerably more complex.

3. What should a Geographical Data Miner tool be able to do?

3.1 Basic functionality

The end-user requirements from a successful GDM application would almost certainly include some but not necessarily all of the following simultaneously:

Serve basic spatial data exploratory needs,
Have the potential to create new insights, ideas, and hypotheses from the analysis,
Offer artistic impressions of pattern structure to stimulate the imagination,
Spot major unusual localised database patterns and detect empirical location-based regularities, and
Be easy to use and meet basic GISability criteria (Openshaw and Fischer, 1995).

Many of these functions are obviously inter-related and one tool may be able to perform more than one simultaneously; however the data mining task is not straightforward because of the generally multivariate tri-space characteristics of the GIS databases of greatest interest. Indeed, the complexity of this task should not be underestimated. Nearly all existing geographical analysis tools only work in a 2-D spatial context and ignore the more complex hyperspaces by implicitly holding them constant, often without the user realising the damage that this may be doing to the unknown patterns and relationships contained in the data.

It now seems that exploring 2-D map based data for evidence of localised clustering is becoming fairly routine and trivial even if reaching this position has taken more than 35 years. Furthermore, it is only in the last 5 years that the necessity for finding localised, non-stationary, patterns has been properly recognised. The original whole map point pattern nearest neighbour methods, while rightfully recognised as a useful advance in the 1950s, fossilised a style of whole map analysis mentality that took more than 3 decades to overturn, and even now there are still advocates of it. This happy situation for purely spatial analysis rapidly deteriorates when more complex space-time-attribute data interactions in Table 3 are considered.

Table 3 Trispace characteristics of a GIS database relevant to GDM

Datatype	Nature of Data
1.	geography data
2.	time data
3.	multiple attribute data
4.	geography and time data
5.	time and multiple attribute data
6.	geography and multiple attribute data
7.	geography, time, and multiple attribute data

In essence, most current geographical analysis tools only work with data type 1 and they implicitly hold all the other spaces constant. Time Data (data type 2) can be handled via well-developed time series methods, but these are implicitly space free, other than the time series data doubtlessly relates to a geographical area of some kind. It is important to note that geography space is not an important part of the time series model. Attribute only data (type 3) can be handled via conventional multivariate statistical (and current data mining) methods, but space and time can at best only be represented badly; by pretending they are the same as any other variables. For example, it hardly makes sense to use X, Y or time coordinates as if they are variables similar to percentage unemployment, does it?

Space-time data interactions (data type 4) are far harder to analyse if space (i.e., location) is handled in a non-trivial fashion. Indeed only a small number of space-time clustering measures exist (i.e., Knox and Mantel statistics) and very few space-time models have so far been developed (i.e., STARIMAs). Time and multiple attribute data (data type 5) are even harder to analyse. The best example would be multiple time series modelling for a fixed area. Space and multiple attribute data interaction (data type 6) could be seen to be handled by a regionalisation method. Spatial zones can be classified or aggregated on the basis of their multivariate similarity subject to contiguity constraints being placed on the aggregation process but this represents only one type of this class of data space interaction. Finally, data type 7, space-time-multiple attribute data, is least well served of all, however, it is the development of GDM tools that can handle this data type that is the greatest challenge. Methods that can handle data with this level of complexity should have little difficulty with all the rest.

Note that in this classification of the space only data (represented by X, Y co-ordinates or polygons or some other locational descriptor) methods all relate implicitly to a single time period and data type. The challenge is how best to progress the exploratory geographical analysis tool-kit from only being capable of handling data type 1 to incorporate all the other types. Note though that data types 2, 3, and 5 may well have no geographical expression, but this may nevertheless still be a valid result for geodata and it is something to bear in mind when designing a GDM, namely that the patterns may not exist in geographical space but in other dimensions of the data hyperspace.

3.2 Data strangulation

The basic idea of "letting the data speak" is an accepted principle in data mining and exploratory data analysis. The problem is how to achieve this objective in practice. Letting the data speak implies an absolute minimum of selection, coding, classification, reduction, and data modification prior to any analysis being performed; however, all these aspects are strongly entrenched in current research practice. For example it may appear to be common sense in spatial epidemiology to define the disease categories of interest, to fix the time periods to be studied, and to agree that the most relevant spatial scale at which to analyse the data. All these seemingly harmless decisions and research protocols are placing invisible, but real, restrictions on what the data can tell us about themselves and the patterns and relationships they contain. Consider an example. The prior definition of disease categories appears extremely sensible, except (1) all disease codings are artificial, subjective, and may not be based on any good understanding of biological mechanisms; and (2) the purpose in exploratory analysis is to find unusual geographical localised clusters as clues to unknown processes and causal mechanisms, so how confident can you be that the selected disease categories are the most relevant ones to use and how certain can you be that those not selected are not linked via not yet discovered common causes? Why not develop a GDM that will identify which disease code combinations may interact with other attributes and relevant time periods so that together they produce the strongest evidence of localised clustering if only we had been clever enough to specify it in the first place? This is most important because you would expect interaction effects. The identification of patterns depend on what data you choose to study, and, if you are uncertain, then you should not constrain the method by artificially, accidentally, or unintentionally, strangling the data before you even start to analyse it.

A key assumption in GDM is that whatever patterns or relationships exist in a database, the ease by which they can be uncovered depends on not unnecessarily complicating the task by muddying any data structure that exists. For example, imagine a disease occurs in a local area in only one particular time period (viz a time and space localised epidemic); suppose this occurs in time 50 to 70. Now imagine how hard it would be to "find" this cluster if all the data are aggregated into a single time period; (1 to 365), or if it is split into 30 unit chunks so that the 50 to 70 period is now split into two parts. Likewise, imagine that the disease was not of a homogenous single type but was categorised as being 1, 9, 14, 19, and 20 out of the 30 disease codes being analysed lumped altogether, or re-categorised in groups of 5. It is more than likely that the wrong aggregation or disaggregation of either time or disease type would interact with spatial location to yield data that contained impossible to find patterns, except by chance. The only solution is to analyse the data free of all additional modifications imposed by the analyst. This is in keeping with the spirit of data mining even if it appears initially alien. Note this need only becomes a problem when rich databases exist.

3.3 Visiting hyperspace and leaving a trail

The multivariate and trispace data domains constitute an immensely complex hyperspace. The challenge in designing a GDM is how best to be able to cope with this complexity. There are basically three possible strategies you could adopt:

1. Progressively re-map the hyperspace greatly reducing its dimensionality until it becomes manageable; however dropping dimensions, variables, and re-classification all run the risk of data strangulation;

2. Develop some kind of virtual reality hyperspace explorer and data visualiser so that you can go into the database's hyperspace and visit it to see what it looks like without reducing any of the complexity, and

3. Create GDM agents that are able to visit these spaces for you and then report back when it seems that useful and understandable results have been obtained.

The latter strategy seems to offer the most promise; see Openshaw (1994, 1995). The basic idea is very simple and quite generally applicable. The problem is that it can be operationalised in many different ways not all of which will work, and that developing a practically useful system requires considerable experimentation.

The justification for this third approach is as follows. Human beings can only easily handle a small number of dimensions of multivariate space. They live in a 3-D world, are 3-D, and have impossible problems in visualising four, five, or fifty more dimensions. It is possible to create intelligent agents or artificial creatures or robots able to search these higher dimensional spaces because they can be created as higher dimensional entities with the same number of dimensions as the hyperspaces they have to live in. The aim is to devise a pattern (or relationship) search using these creatures and then follow their movements in hyperspace, but projecting it back into a visual geographical or map space (perhaps also with a time dimension). In this way the map acts as a hyperindex to the more complex spaces within which they live. The only critical assumption here is that the primary interest in GIS relevant data mining is in the location of the patterns (and their nature) in a 2-D map space. It may be that patterns have no distinctive geographical expression as indeed they would if data type interactions 2, 3, and 5 were dominant. It would be nice, but not essential, if a GDM could also handle these cases.

3.4 How to parameterise the search

The GDM problem is essentially a problem of search in a very complex multidimensional space. The search task can be viewed as identifying a localised piece of this multidimensional space that contains unusual amounts of unexpected patterns. This is equivalent to specifying the database query that will define the space region that contains the strongest or most unusual or most interesting patterns. The difficulty is that there is a large, almost infinite, number of possible database queries that could be generated in the search for whatever is defined as being "good" or "interesting" results. Fortunately, many search problems can be solved as optimisation problems. If the search domain can be parameterised, then the challenge reduces to how best to search the parameter space to optimise a pattern measuring function that is constructed so that it will identify whatever are considered to be interesting results. A complication for the optimisation process is that in the GDM, the data being explored is a mix of continuous and categorical measurements and there could be multiple good and potentially useful solutions. The search also has to be capable of handling all the trispace interactions shown in Table 3, but without knowing in advance which ones are most relevant.

In geography space, a GDM only needs X and Y coordinates to define a location being examined, and a measure of the extent of the space based at that location. For instance in GAM, the search could be parameterised by X, Y and R, where R is the radius of a circle based on the point X, Y, (See Openshaw, 1998; Openshaw et al., 1999). Note that it makes little difference whether a circle or a rectangle is used. The CAS method of Openshaw et al., (1989) used rectangles. The circle has the advantage of only needing fewer parameters and is rotationally invariant. In Besag and Newell?s (1991) version of GAM, the R is the K^th nearest neighbour distance for location X, Y. The MAPEX algorithm of Openshaw and Perree (1996) used a genetic algorithm to find good X, Y and R values. The swarm and flock optimisation method of Macgill et al., (1999) uses a different search strategy but it is still looking for optimal X, Y, and R values. So X, Y, and R can be used to parameterise the geographical space and, with a little thought, this can be extended into other spaces.

Once geography and time space are considered together, then there is an instant additional problem. Time is measured in non-map units (i.e., seconds, days, years) rather than in terms of map distance units; however, the simplest parameterisation is to merely add an additional term to the GAM parameterisation; so that you X, Y, R, and equivalent values for time T and Tr, where Tr is the time distance (i.e., a measure of time radius or localness) around time T for the data in the geographical space described by location X, Y, and radius R. This presents some potential problems; for example, time may not be as continuous as geographic space and often seems to be lumpy by comparison. Whereas there is no difficulty in imagining multiple separate clumps of patterns across the map, the pattern that exists in each clump need not necessarily be composed of cases that are approximately contiguous in time; they too could be clumped in time. Maybe, in practice, this is not a problem if you restrict the clustering in time to being similar in nature to the clustering in space; i.e., localised. There is no reason in principle why localised time clustering cannot have a similar spatial distribution at different moments in time; indeed one imagines that it probably will do so. The other problem is that time clustering may occur independently of geographical clustering and vice versa and the parameterisation and pattern measuring function have to be able to handle these aspects.

The next hurdle is the need to include "other attributes" in the parameterisation of the search. These "other attributes" in a GIS database context are variables such as map layers and categories interpreted in the widest possible sense. They include both digital map layers, choropleth maps, derived GIS results, such as buffered polygons and pseudo map layers that provide context. Pseudo map layers could be values from a choropleth map, or variables such as the percentage of unemployment rates within P km of a location expressed either as a categorical value or as a continuous variable. This pseudo map data layer notion is useful as a means of incorporating in a GDM a wider range of non-digital map variables, such as census data. This is a good example of a pseudo map data layer, while it does not exist as a conventional map coverage, it can nevertheless be treated in the same manner; see Openshaw and Turton (1999b).

The simplest solution to parameterising the search to handle these "other attributes" is to extend the central value and radius representation used for handling time if the variable is more or less continuous as an integer or real number; however, map coverages present a harder problem as the values are qualitative and no longer have any natural order to them; for example, the categorical values used to represent different geological structures. This problem was implicitly solved in the multiple overlay permutation search described in Openshaw et al., (1990). The simplest approach is to add to the [X, Y, R, T, Tr] a set of C_k variables, where C_k represents coverage k. The search takes place within the specified space-time envelope for each subset of data that are identical in terms of a set of specified C_k coverages. Note that this does not require that all the sets of individual coverage categories are specified, only that each data subset is identical in terms of the selected coverages. This may be regarded as being too severe with no flexibility for fuzzyness and OR type conditions to be applied. A less restrictive approach involves adding an extra dimension to C_k to specify explicitly one or more categories within each coverage. So C_k becomes D_ik which is a 0-1 variable with an entry for each coverage K and for each category i in each coverage k. The match can either be perfect or a measure of distance included in the parameterisation, so that H_k is the maximum allowed Hamming distance for any data case from D_ik in coverage K. This H_k can be regarded as the equivalent to R in geography space or Tr in time space. The D_ik values could also be coded as 0, 1, 2 where 2 indicates and ignores this coverage-category. In theory this would permit the method maximum freedom to flexibly find the best attribute interactions by permitting access to the full set of permutations that could be used. Note that if time is coded (or can be recoded) as a sufficiently discrete integer with not too many values then this attribute also can be handled in a similar manner. The only other possibility would be to permit multiple time points joined together by AND, OR logic but maybe this is becoming too sophisticated.

In essence the parameterisation is defining a multi-dimensional trispace search query that defines a chunk of space within which some measure of pattern is to be computed. It also can be visualised as a complex series of logical if conditions connected by ANDs and Ors. This is why a GDM might also be regarded as an Intelligent Query System (IQS), a system that suggests questions for you to investigate. Indeed this may be the best way to view it but it is not the best way to program it. The search (X, Y, R) becomes find all data within distance R of point X, Y and the search (X, Y, R, T, Tr) becomes find all data within distance R of point X, Y and within Tr of time T. Alternative you can cast this parameterisation as a set of if conditions; for instance, IF a case is within R of X, Y and its time coordinates are within Tr of T and its coverage 1 category 1 or 2 or 3 matches the target (etc., for other coverages and categories) then include it in the data to be analysed. The handling of multiple time domains merely requires the addition of a few additional IF AND statements. Note that how you subsequently decide to encode the parameterisations mainly reflects the optimisation technology being used; for instance a genetic programming solution would be different from a bit string representation designed for a genetic algorithm. Again, once you know what it is you wish to do, its practical operationalisation is often relatively straightforward although questions of efficiency and performance still need to be investigated. Clearly some representations will turn out to be far more suitable for GDM than some others, and experimentation is probably the only way to proceed.

3.5 Devising an appropriate objective function

The aim is to define optimal values for the parameters so that they optimise some pattern measuring function. The final design question concerns the nature of the objective function which, when optimised, will provide a solution to a GDM application. In some ways this is harder than designing the problem parameterisation. The original method proposed in Openshaw (1994, 1995) involved the use of a count statistic with significance levels measured by means of a Monte Carlo method. This count statistic was essentially a count of the number of data cases that occur within the search region compared with what would be expected if the data were random. Experimentation has shown that this test probably has a low level of power because there is no population at risk data being used. The alternative proposed here makes use of the GAMK accumulated surface of evidence approach; see Openshaw and Craft (1991), Openshaw (1998), Openshaw et al (1999).

The GAMK method works as follows:

Step 1: Retrieve all cases within a search region

Step 2: Apply a significance test procedure to consider whether the observed rate or count statistic is unusual.

Step 3: If the probability is less than a threshold, then save details of the search region together with the number of excess cases, and

Step 4: Convert the excess cases into a cumulative density surface using a kernel smoothing procedure that uses circle radius as a bandwidth.

This surface of evidence method offers a number of attractions:

it is easily visualised and can be animated with slider bars for scale (radius size),
it has been shown to be robust to multiple testing effects by summing up the results from many different searches,
it is easily understood,
it is not a test of significance but a description of pattern, and
it provides a useful overall summary of where on the map pattern is located and where it is not.

Note that this method differs from a purely kernel density approach because only those circles that pass the hurdle of a local significance testing procedure are allowed to contribute to it. Additionally, their contribution depends on the size of the excess with a bandwidth that reflects the area of the circle being examined.

The GAMK method will work in a more general GDM context provided a population at risk measure is available. The GDM in common with GAM and GEM requires that you have a value or count of something of interest and a measure of either the population at risk or an expected value (perhaps a model has been used to remove non-spatial covariates, for example, age/sex differences in a disease rate). The statistical test applied in Step 2 depends on the incidence of the feature being studied. If this is rare then Poisson probabilities could be used. If very common, then a binomial probability may be more appropriate. You also can apply an assumption-free Monte Carlo significance test and, even a bootstrap Z statistic. In practice, the results of GAMK are insensitive to the choice of significance testing procedures because of the kernel smoothing in Step 4.

3.6 Global optimality or nearly optimal or good results

The GAMK objective function for any individual search region is a circle size weighted excess incidence statistic. This can be used on GDM on the grounds that the GDM is really only interested in looking for localised geographical clustering and that nothing else is of interest and because the results are most easily visualised in a map domain.

There also is another difficulty. The most obvious optimisation algorithms make use of evolutionary programming techniques. Other candidates include focused Monte Carlo and Swarm optimisation; however, they all attempt to find a single global best although there can there can be no assurance that a global optimum result is obtained. Except in very simple low-dimensional applications with few coverages, the global results are unknown unless they can be identified by brute force on a supercomputer. These optimisation methods can handle multi-global solutions and make no assumption of global convexity or continuous functions. Unfortunately, in the end they only ever find one (or a set of very similar) result. This may not be the global best, but hopefully it will be a "good" solution- although no theoretical proof will ever be available. The solution used in Openshaw (1995) is to restart the search having removed all the incidence data that contributed to the best result. This stop-delete data - restart process may have to be repeated several times, but it appears to be very effective. The surface of evidence results are accumulated over several runs of GDM.

3.7 Handling time-varying population at risk data

A further problem needs some attention. Time can affect the values of the data being analysed. The GIS coverages can change but these effects (if measurable) can be handled by adding more coverages to the data being explored. Far more serious is how to handle time-related changes in the population at risk. Typically the time resolution of the variable of interest may be far better than that of the population at risk; for example, disease incidence data might be monthly, but the best census-based population at risk data are updated once every 10 years (120 times poorer resolution). There are various possible approaches:

Ignore the problem on the grounds that any patterns created by this neglect may still be of considerable interest;
Argue that this is a data quality issue for the user that needs to be addressed;
Use an expected value based on historic data or long-term average values or estimates of a maximum value so as to minimise the risk of false patterns being uncovered; and
Estimate population-at-risk values for each time period for which there are data to be analysed, perhaps using linear interpolation.

The first strategy is the one recommended here, but the GDM can easily be modified to handle the other approaches as it only affects how you calculate the expected values.

3.8 Indexing the results of hyperspaces

A final design question concerns how to understand what the GDM has found. With GAMK this is fairly obvious because there is only a single map, although this can be indexed by scale. Here, as in GEM, there may be many multiple sets of GAMK results corresponding to possibly multiple local optima found by the search process characterised by potentially multiple different search criteria. The solution is to generate the results file even if it runs to several hundred megabytes and then apply summary measures, visualisation, and animation tools to it after the search process is complete. A Java-based viewer has been developed that allows the user to explore, visualise, animate, and study any of the results found during and at the end of the search.

There are three useful outputs:

A summary view in geography space of any and all clustering that has been found;
List of best search regions together with coordinates in hyperspace, and
Hypertext linkages or databases that connected (2) to (1) and show the contribution (2) makes to 1.

3.9 Fuzzy search

A final modification concerns the spatial search. The X, Y, and R map search parameterisation is critical to all else that follows. It is extremely crisp in that it ignores data uncertainty. It also is impossible to detect certain spatial patterns, such as doughnuts, by a single circle. Fortunately it is straightforward to add extra parameters to the search problem so that circle radius R is replaced by a fuzzy set in which the membership of belonging to the circle can be made to vary with distance from the circle centroid. A single fuzzy set parameterised as a triangle would add a considerable extra degree of flexibility.

4. Conclusions

The paper has focused on the design issues in building a GIS-appropriate Geographical Data Mining tool. How well it performs is the subject of ongoing research. Preliminary results will be presented at GeoComputation '99 using synthetic data with varying levels of pattern complexity. Synthetic data are used because otherwise there is no way of assessing how good or bad the results are. There is currently little or no experience of exploratory GIS- relevant analysis going beyond the map. There is no literature of useful results being found by such study because the technologies needed are only now starting to be developed. Comparisons with conventional data mining software (Clementine and Intelligent Data Miner) with GAMK and GAMK-T also will be described. The belief is that GDM actually works surprisingly well, at least on synthetic data. Its application to real-world data, which contains unknown patterns, is the next step. There is little doubt that the development and application of GDM tools will become a major future GeoComputation research theme.

Acknowledgments

The research reported here is supported by ESRC Grant No. R237260. The assistance of Ian Turton, Andy Turner and James Macgill in researching aspects of GDM technologies are gratefully acknowledged.

References

Alexander, F. E., Boyle, P., 1996 Methods for Investigating localised Clustering of Disease, IARC Scientific Publications No 135, Lyon, France

Besag, J. ,Newell, J., 1991, 'The detection of clusters in rare disease', Journal of the Royal Statistical Society Ser. A, 143-155

Dobson, J. E., 1983, 'Automated geography', The Professional Geographer 35, 135-143

Macgill,J.,Openshaw, S. 1998, The use of flocks to drive a Geographic Analysis Machine., Proceedings of Geocomputation 98,Bristol.

Openshaw, S., Charlton, M E., Wymer, C., Craft, A., 1987, 'A Mark I Geographical Analysis Machine for the automated analysis of point data sets', International Journal of Geographical Information Systems 1, 335-358.

Openshaw, S., Wilkie, D., Binks, K., Wakeford, R., Gerrard, HH, Croasdale, MR., 1989, 'A method of detecting spatial clustering of disease', in WA Crosbie and JH Gitlus (eds) Medical Response to the Effects of Ionising Radiation, Elsever, London, p295-308.

Openshaw, S., Cross, A., Charlton, M E., 1990, 'Building a prototype geographical correlates exploration machine', International Journal of Geographical Information Systems 3, 297-312.

Openshaw, S., Craft, A. 1991, 'Using the Geographical Analysis Machine to search for evidence of clusters and clustering in childhood leukaemia and non-Hodgkin lymphomas in Britain.' In: Draper, G., ed., The Geographical Epidemiology of Childhood Leukaemia and Non-Hodgkin Lymphoma in Great Britain 1966-83, London, HMSO, p109-122

Openshaw, S., Fischer, M M., 1995, 'A framework for research on spatial analysis relevant to geo-statistical information systems in Europe', Geographical Systems 2, 325-337

Openshaw, S. Perrée, T., 1996, 'User centred intelligent spatial analysis of point data', in D. Parker (ed) Innovations in GIS 3 Taylor and Francis, London p.119-134

Openshaw, S., 1994, 'Two exploratory space-time attribute pattern analysers relevant to GIS' in S Fotheringham and P Rogerson (eds) GIS and Spatial Analysis Taylor and Francis, London, p83-104

Openshaw, S., 1995 'Developing automated and smart spatial pattern exploration tools for geographical information systems applications', The Statistician 44, 3-16

Openshaw, S. and Openshaw, C., 1997, Artificial Intelligence in Geography. Wiley, Chichester

Openshaw, S., 1998, 'Building automated Geographical Analysis and Exploration Machines' in P A Longley, S M Brooks, R Mcdonnell, B Macmillan (eds) Geocomputation: A primer, Wiley Chichester p95-115

Openshaw, S. and Turton, I., 1999a An introduction to High Performance Computing and the Art of Parallel Programming: for geographers, social scientists, and engineers. Routledge, London. (forthcoming)

Openshaw, S., Turton, I., 1999b, 'Using a Geographical Explanations Machine to Analyse Spatial Factors relating to Primary School Performance' (forthcoming)

Openshaw, S., I. Turton, J. Macgill, J. Davy, 1999. 'Putting the Geographical Analysis Machine on the Internet' In B Gittings (eds) Innovations in GIS 6, Taylor and Francis (forthcoming)