Testing space-time and more complex hyperspace geographical analysis tools

Stan Openshaw, Andy Turner, Ian Turton, James Macgill

Centre for Computational Geography, School of Geography, University of Leeds, Leeds LS2 9JT, UK

Chris Brunsdon, Department of Town and Country Planning, University of Newcastle, Newcastle Upon Tyne, NE1 7RU

Abstract. The paper presents the results of a novel experiment that seeks to compare the performance of several alternative exploratory geographical analysis methods. A number of simulated data sets containing different amounts of geographical, temporal, and attribute related patterns are created and analysed using various Geographical Analysis Machines, commercial data mining software, smart geographical analysis tools, and artificial life based approaches.

Keywords: exploratory geographical analysis, data mining, artificial life, space-time analysis, smart geographical analysis tools

1 Introduction
2 What is special about geographical data mining?
3 Two Synthetic Data Generators 3.1 Objectives
3.2 Synthetic data generator 1
3.3 Synthetic data generator 2 3.3.1 Computational issues
3.3.2 Adding Time and Attribute Interaction 4 Geographical Analysis Tools 4.1 GAM/K
4.2 GAM/K-T
4.3 MAPEX
4.4 GDM/1
4.5 Commercial Data Mining Tools 4.6 Flocking Hunting Agents 5 Results
6 Conclusions

Acknowledgements
References

1 Introduction

The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasises the importance of developing data driven inductive approaches to geographical analysis. Indeed as end-user GIS applications mature so new analysis needs are starting to emerge. For example, in the UK the Crime and Disorder Act (1998) places a statutory responsibility on the 464 District Authorities to perform crime pattern analysis. Geographically referenced crime data is not new, most Police Forces have, or are developing, on-line crime reporting systems with GIS or digital map based command and control computer systems. However, other than mapping, the idea and need to perform spatial analysis is relatively new and can be regarded as an activity that has been created as a by-product of GIS success. A similar need for spatial analysis tools can be identified in many other GIS application areas. The problem for both researchers and developers is how best to try and meet these new emerging needs for useful geographical analysis. In the commercial sector, many large retail and financial organisations have been investing heavily in the development of data warehouses. They know that their future competitiveness may well depend on the uses they make of their data resources. Much of their data is or can be geographically referenced and may well contain business significant geographical patterns and relationships so tools are needed that can help find them.

Data mining is business analyst terminology for a particular form of interactive data analysis which employs at least one intuitive human expert and their available computing resources to explore and model patterns in data. Data mining techniques are synonymous with exploratory data analysis tools which are applied in a particular aspect of the search for interesting patterns and relationships in data. It follows that a specific data mining application aims to generate information from data to help further create understanding of a particular problem or class of events via some kind of interactive exploratory data analysis using a software package of data mining techniques. Data mining is defined by one developer of data mining software as: "... the process of discovering valid, previously unknown, and ultimately comprehensible information from large stores of data. You can use the extracted information to form a prediction or classification model, or to identify similarities between database records. The resulting information can help you make more informed decisions." Another developer data mining software describe it as: "... referring to the use of a variety of techniques to identify `nuggets' of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but of low value in its raw form, and little use can be made of it directly; it is the hidden information in the data which is valuable." The emphasis in these definitions is on the use of information extracted from data bases to make informed business decisions. Such emphasis will probably be used to shield data mining systems from criticisms that may surface as analysts try to apply the software to problems outside the context of business decision making such as in GIS. Such criticism is likely due to the exaggerated and often misleading claims made by the developers about the analysis capabilities of data mining systems. It may be that this hype is aimed at encouraging and promoting the use of non-linear data lead exploratory analysis, but none-the-less it all too often leads to expectations of the systems to solve problems that are currently far beyond their capabilities. Two Crows (a well recognised data mining consultancy) write "A great deal of what is said about data mining is incomplete, exaggerated, or wrong. Data mining has taken the business world by storm, but as with many new technologies, there seem to be a direct relationship between its potential benefits and the quantity of often-contradictory claims, or myths, about its capabilities and weaknesses."

So data mining packages have mainly been developed for the purpose of analysing very large commercial databases in order to model and predict customer buying behaviour. This emphasis on prediction may well limit their usefulness in spatial analysis where testing hypotheses and exploratory spatial analysis may be a more usual activity. However, it is not a particularly sound defence to argue that data mining tools are intended for enterprise business orientated applications not involving GIS because there are many potential enterprise applications that are based around GIS databases of one sort or another. However, GIS developers and spatial scientists have been slow to develop equivalent geographical data mining tools partly because of the historical emphasis on confirmatory methods and to a lesser extent on low dimensional linked map-graph statistical exploration tools. Maybe also the new needs for geographical analysis in GIS data rich environments also caught many unawares.

2 What is special about geographical data mining?

It can be argued that conventional data mining tools can be usefully applied to mine GIS databases to extract pattern in the same way that conventional statistical methods can be applied to spatial data. There are some geoinformational data mining tasks that may be usefully performed by conventional data mining software. Table 1 outlines the range of tools that most data mining packages offer and many of these methods could be usefully applied to spatial data. For example, data reduction tools, such as multivariate classification, can be useful as a means of summarising the essential features of large spatial data sets; for instance, to create geodemographic classifications. Similarly, modelling tools such as neural networks and decision trees can be readily applied to some geographic problems. It can be argued that whilst these methods ignore all of the special features of geographical data; see Table 2. They still "work" to some degree but there are also many exploratory geographical analysis types of data mining task that seemingly they cannot perform.

However, there is a major potential problem in that if you use conventional data mining tools then you are implicitly forced to accept the key assumption that geographical data are the same as any other data and that there is nothing special about geographical information or indeed geographical analysis that will prevent it being performed by conventional methods. If you input some X, Y referenced data into a data mining package and expected it to identify localised clusters of excess incidence of a disease, then you would probably be very disappointed. These packages could only treat the X,Y co-ordinates as if they were merely two ordinary variables (such as age or income) and it is very likely that nothing useful would be achieved. There is no mechanism for handling location or spatial aggregation or for coping with spatial concepts or even mapping. Conventional data mining tools maybe very powerful but they continue the geographical neglect inherent in conventional statistical methods.

Openshaw (1999) argues that what is now needed are new types of data mining tools that can handle the special nature of spatial information and also capture the spirit and essence of geographicalness that a GIS minded data miner would expect to have available. There is a further problem that needs to be dealt with. If you simply equate geographical data mining with exploratory spatial analysis then maybe some will be misled into believing that this problem has already been solved. However this overlooks the massive difference between exploring a manageable small data set with few variables and the need to perform the same process on massive databases (with two or three orders of magnitude more cases) and possibly high levels of multivariate complexity. Human being based graphical explorations of spatial data just does not scale well. The bottleneck is not computational but is a result of limits on the speed and skills of the spatial analyst. There comes a point where adding extra dimensions to the analysis hyperspace overwhelms human abilities. Visualisation tools are useful but there are also limits as to what they can deliver and, in general, GIS databases often present too much complexity for such a simple minded (albeit technically sophisticated) approach.

This paper describes and presents the results of a series of empirical experiments concerned with evaluating the abilities of a range of data mining tools in detecting patterns in synthetic spatial data. The patterns can be purely spatial, or temporal, or space-time, or space-time-attribute based. Synthetic data sets are used so that the "true" results are known allowing the abilities of the various methods to be investigated. The importance of going beyond purely spatial patterns reflects the growing number of GIS databases that include time as well as a plethora of other variables. This represents a considerable methodological challenge as it is clear that most of the existing exploratory analysis tools available for use with geographical data only function well in a two dimensional map space and may be incapable of finding patterns in more complex time and multiple attribute spaces. Openshaw (1994, 1995) identifies seven possible interactions of the trispace that characterises GIS databases; see Table 3. The need is for tools relevant to GIS which can successfully search for patterns in some or all of these hyperspaces. The problem is that these hyperspaces interact to create or hide patterns. For example, suppose you have a database with X, Y for space, T for time, and C type of event. It is obvious that different events may well contain different geographical patterns. The same event may also exhibit different geographical patterns for different time periods. The essence of the problem is that you cannot study the time-event type interactions separately or sequentially as it may be that the strongest patterns are found only when certain time periods and event types are analysed together. Currently the best available methods tend to totally ignore these interactions and would study the data either together as a single data set or else rely on a priori research design decisions that effectively strangle the data before it can speak, albeit unintentionally.

Section 3 outlines two synthetic data generators. Section 4 outlines the five methods that are applied. Section 5 which presents the results and ideas for further research are contained in the conclusion.

3 Two Synthetic Data Generators

3.1 Objectives

A major perceived difficulty is the current lack of major practical success stories in real world geographical analysis applications that can be used to demonstrate successful projects and hence function as exemplars. For instance, in spatial epidemiology there are very few documented successes whereby exploratory geographical analysis tools have been used to correctly identify hitherto unknown disease clusters. Indeed many researchers are fairly pessimistic about the prospects; see Alexander and Boyle (1996); although this seems to reflect a reluctance to perform exploratory analysis and a preference for testing hypotheses. Maybe also the lack of portable tools or sophisticated exploratory analysis functions in vendor GIS systems has not helped. A similar situation exists in other areas; viz. crime pattern analysis. It would appear that users require proof that the methods work (as well as being easy to use) before using them. Seemingly no-one wants to be first to "risk" the routine application of geographical analysis tools in everyday end-user orientated GIS applications. As a result there are currently no generally available pattern benchmark data sets that can be used to measure the performance of old or new methods of exploratory geographical analysis.

The strategy adopted here is to create synthetic data sets with varying degrees of pattern and then assess the success of a selection of methods in analysing these data. This need becomes even more critical if the patterns being concealed in these data sets are not just localised spatial clusters of varying intensity but also include space-time interactions and more complex structures. The data sets are available on the WWW for others to test out their favourite methods. There are problems with this strategy in that: the synthetic data may contain unrealistic degrees of pattern, the patterns may be too hard or too easy compared with the still unknown patterns that exist in the real-world, there is an implicit assumption that each synthetic pattern is findable (but it may not be for all sorts of reasons), and so far only fairly low dimensional databases have been created. In some ways the unknown cunning and skills of the synthetic pattern generator is being pitted against the abilities of the analysis methods being tested. It is a useful start but it is by no means definitive.

The study region was defined as Yorkshire and Humberside as this yielded a sufficiently large data set. The data relates to 10,430 Census Enumeration Districts (EDs) for which persons were used as the population at risk factor. Each census ED had a corrected 100 metre grid-reference attached to it. For the purposes of this exercise 1,000 events were to be generated from a total population at risk of 4,820,129 persons.

3.2 Synthetic data generator 1

The first data generator used created 10 synthetic data sets that displayed varying degrees of spatial and temporal clustering; from purely random to 25% clustered. The spatial clustering algorithm is that described in Alexander et al (1996). Each synthetic data set had a different degrees of clustering and often different parent locations. The random cases were selected by multinomial allocation with multinomial probabilities proportional to the population at risk (i.e. total persons). This approximates an inhomogeneous Poisson process the intensity of which depends on the distribution of the population. The non-random events were allocated as follows:

identify parent locations
cases are allocated to clusters using an inhomogeneous Poisson process that reflects a Gaussian risk function. There is a distance cut-off so that 90% of the cases will occur within 10 km of their parent.
A time clustering component was added by assigning each random spatially clustered event a uniformly distributed time period between 1 and 255. Those cases which are part of the spatial clusters are assigned a uniformly distributed random limited to time periods between 50 and 60.

This produced a number of data sets with varying degrees of spatial patterns but with a common space-time component. There is no attempt to generate attribute interactions and the attribute field was randomly set to 1 or 2 with equal frequency. Table3 describes the characteristics of these data.

3.3 Synthetic data generator 2

The second data generator is more ambitious and included the full range of trispace interactions described in Table 2. The true results were only revealed after the analysis task had been completed. The method is based loosely on the "raised incidence" model. Consider first the space-only case; under a null hypothesis of no clustering other than that due to variation in underlying population at risk, the probability of incidence in any place would be proportional to the population at risk in that place. If we denote the probability of incidence at a point (x,y) by p(x,y), then this simple model is

p1(x,y) = kh(x,y) (i)

where h(x,y) is the density of population at risk at (x,y), and k is a constant of proportionality chosen to ensure that p(x,y) integrates to unity. Clearly, this model is of little use if one wishes to consider clustered data. However, the model can be modified, so that the likelihood of incidents occurring around some places exceeds that expected due to population density alone. One way of achieving this is by multiplying (i) by a spatial kernel function centered around some point (x1,y1), with a bandwidth b. For example one could have

p2(x,y) = kh(x,y)exp[((x-x1)²+(y-y1)²)/2b²] (ii)

Here the likelihood of a point occurring at location (x,y) depends not only on the population at risk, but also on the closeness to some `hot point' (x1,y1). Points will cluster spatially around (x1,y1). The role played by the bandwidth, b, is to control the `tightness' of the cluster. Low values of b will produce more concentrated clusters. Note finally that although (ii) described a clustered process, in reality it is likely that some cases will be part of a cluster while others will be general `background' cases. Indeed, if (ii) were the only generating process, identification of clusters would be a trivial point-plotting exercise. For this reason, the final model will be a mixture of both processes:

p3(x,y) = k (ah(x,y)exp[((x-x1)²+(y-y1)²)/2b²] + (1-a)h(x,y)) (iii)

The newly introduced parameter a can be interpreted as the proportion of the data which is part of the cluster.

3.3.1 Computational issues

The next issue to be addressed is how one can generate random points from the distribution p3 above. To do this in practice, two issues must be considered. Firstly, the data apply to discrete spatial units (census enumeration districts), and not to continuous space. Secondly, it is not immediately clear how one can simulate random numbers in the distributional form given above. To resolve the first problem, instead of randomly generating a real-number pair (x,y), we generate an index to the enumeration district. Call this index i. Since there are 10,430 enumeration districts, i is a random integer between 1 and 10,430. Of course, different values of i will have different probabilities of selection. For example, in the simple model (i), the probability of selection for a given i is proportional to the population at risk resident in the corresponding enumeration district. For the `hot point' model (ii), the probability of selection for a given i is proportional to the population at risk multiplied by the `kernel factor'

exp[((x-x1)²+(y-y1)²)/2b²]. Here (x,y) for each enumeration district is taken as the zonal centroid.

Having re-specified the models for discrete data, the next problem is that of actually generating the data. For models (i) and (ii) this can be achieved using rejection sampling, as set out below

Step 1: For each enumeration district, compute a number proportional to the probability of selection. Store these in an array X.

Step 2. Compute M, the maximum value in array X.

Step 3. Generate a uniformly random integer in the range 1...10400. Call this J.

Step 4. Generate a uniform continuous number in the range 0...M. Call this U.

Step 5. If U < X(J) then return j as the selected index. Otherwise repeat from step 3.

(NB. It is not necessary to compute the normalising constant, k in this algorithm.)

Thus, we now have a method for generating cases for models (i) and (ii) - the next stage is to generate the mixture model (iii). This is relatively simple once there is a method for generating models (i) and (ii). If one wishes to condition on a value of a (say 0.2), and a given sample size n (say 1000), then draw a selections from model (i), and n(1-a) selections from model (ii), and merge these. For the example values suggested above, one would draw 800 model (i) selections, and 200 model (ii) selections.

3.3.2 Adding Time and Attribute Interaction

The above section sets out a method for drawing spatially clustered data, but does not consider time or attribute information. Recall that the aim here is to detect interactions between space, time and attributes in clusters. Generating random time or attribute data in itself is straightforward, given the methods described in Section 3. For example, suppose we wished to generate a day in the range 1...365 in addition to the spatial information. Assuming initially that there were time clusters, but that these were independent of any spatial clustering, one could use the mixture method set out in Section 3 to generate a data set. One could choose a `hot day' and a bandwidth, similar to the `hot point', and have some cases clustered around this point in time (with probability a), whilst others occurred uniformly throughout the year (with probability (1-a)). To introduce a degree of interaction here, one could link the probability of the observation coming from the clustered model to the spatial location. For example, suppose the probability a was a kernel function of (x1,y1), with a maximum height of 0.75 - this implies that incidents occurring near to the `hot point' are more likely to also cluster in time - around the `hot day'. When the spatial location of an observation lies exactly on the `hot point', then there is a 0.75 chance that the observation will also be clustered in time. Finally, one could extend the method to incorporate attributes in the same way. Initially this is carried out only for a single, dichotomous attribute taking the values 0 and 1. In the unclustered version, the respective probabilities are c0 and c1 for the values to be selected. However, around the point (x1,y1) these probabilities may be adjusted using a kernel model. All of these kinds of interaction models are used in this study.

4 Geographical Analysis Tools

4.1 GAM/K

GAM/K is designed to detect localised spatial clustering without knowing either where to look or at what scales to look for patterns. Location and scale can both interact in that clustering may occur at different scales on different parts of the map. The algorithm is extensively described elsewhere; see Openshaw (1998), Openshaw et al (1987, 1999a, b). However, there is no provision in GAM/K for detecting space-time or attribute specific clustering other than the ability of the end-user to segment the data into appropriate subsets. It is not expected to do so well on the more complex data sets because here no attempts are made to subset the data. This may undervalue its potential in the real-world because not all end-users would be that ignorant of their data or the patterns they suspect it may well contain.

4.2 GAM/K-T

This is a recently developed version of GAM/K in which permutations of time period are also examined. The search is now focused on identifying where to look for localised clusters, at what spatial scales, and in which time periods. All three now interact. Tests indicate it will detect space, space-time, or time only patterns although there are now multiple output maps necessitating the development of special results viewers. GAM/K-T, in common with GAM/K, employs an exhaustive search strategy. The additional computational load of handling time was reduced by algorithmic design improvements and by data sparsity which dramatically reduces the maximum number of time permutations that need to be searched.

4.3 MAPEX

MAPEXplorer (MAPEX) seeks to perform a GAM/K and GAM/K-T function via a smart search based on a genetic algorithm. The original method is described in Openshaw and Perree (1996) only performed spatial searches. This has now been extended to handle space-time patterns. The hope is that MAPEX will perform well on the space and space-time pattern detection tasks but it will be unable to handle (except by accident) attribute clustering. Although once again in a real-world application the end-user is unlikely to always be totally ignorant of the data.

4.4 GDM/1

Geographical Data Miner (GDM/1) is another recent development. Its design is described in Openshaw (1999). GDM is a development of MAPEX to handle event characteristics and to include GIS coverage linkages as performed by GEM; see Openshaw (1998), Openshaw and Turton (1999ab). The search now has to handle where to look for localised clustering, at which spatial scales, at which time periods, and identify what event characteristics define the patterns. All the spaces interact to create a highly complex search hyperspace. It is emphasised that this not an uncommon problem in GIS where it is increasingly common for the data in a GIS data base can contain some, or all, of the following: map location information, time, event attributes, location attributes. Location attributes are based on typical GIS coverage details; i.e. geology, rainfall, etc; that apply to all the data. Event attributes only apply to the cases (events) which are being analysed. Here time and type are both event attributes that do not exist for locations in the database which are not events. So the data being analysed consists of two types of data: (1) X, Y for census Ed and population of census Ed and (2) X, Y for event, time, and type. Note that the type (1) population data could also be indexed by time but here it is constant. The design purpose in GDM/1 was to extend the MAPEX approach to handle space only, space-time space-event type, time-event type, and space-time-event type interactions. Note that a spatial dimension is always present which makes sense because of the geographical purpose of the method. However, experiments indicate that when space is unimportant then the search circles will either grow so large as to encompass all the data or else focus on disjoint subsets of it that cumulative cover the entire map. The interaction effects are handled via a series of AND and OR implicit operators. Another way of thinking about GDM is as an intelligent query generator. If only you knew in advance what selection to make to detect the strongest pattern then all would be well. The GDM searches the universe of all possible relevant queries (given the available data) to suggest the most promising ones (as measured by some pattern detecting statistic) to make. GDM too needs a sophisticated viewer in order to make sense of the results.

4.5 Commercial Data Mining Tools

Data mining tools need to be guided by users who understand the problem, the data and the general nature of the analytical methods involved. The methods now used in data mining are usually extensions and generalisations of analytical methods known for decades. The recent interest in data mining is almost completely due to the improved user interfaces that make these techniques more available (and more easy to sell) to businesses that have already spent vast amounts of money on a data warehouses. So there is an instant problem with virtually all conventional data mining packages. Despite their high cost, there is almost certainly little prospect of finding any explicit geographical analysis tools. Many will not even draw maps. The synthetic data was pre-processed to become a complete flat file for each census Ed; thus X,Y, population, 0-1 event, time, type. Each event is represented as either a 0 or 1 count (some records may be duplicated). Where the event count is zero the time and type values were randomly generated with a uniform distribution between relevant limits (1-365, 1-2). Three different strategies were used:

compute event rates for each Ed and attempt to predict the rates given population, time, type, X, and Y; co-ordinates using neural networks, rule generators and decision tree type data mining tools;
classify the data and then displaying incidence rates for each group in the classification using available classification tools such as Kohonen nets and K-means, and
train a neural net to model a version of the data that was purely random and then apply to the synthetic data to identify the largest residuals which would provide evidence of clustering.

The greatest problems here are the learning curves of using the two different data mining packages, numerous operational problems with them, and the large number of alternatives that could be investigated under each of the three headings. Nor was it possible to investigate hybrid combinations of different tools. For example, decision trees or other classifiers can be used to select between alternative data mining methods for a given data set. Also it is quite apparent that a GIS skilled data miner could readily use the visual programming interface provided by these data mining packages to create completely new hybrid tools that either may mimic some of the functions of exploratory geographical analysis methods or provide totally different alternative approaches. In general none of the three direct approaches listed above worked at all well. The problems are intensified by the absence of any mapping capability and other difficulties to do with visualising the results. It is possible that with more complex multiple category data there will be more opportunities for data mining packages to find interesting relationships as relative importance of the spatial variables may reduce. Certainly at present with these synthetic test data sets the two data mining systems appear to struggle, however, further evaluation work is needed before a firm conclusion can be drawn. It is important to note that these tentative, preliminary views are limited to the application of the basic core data mining tools the systems do provide opportunities for developing more innovative analysis procedures. Data mining is not a push button technology and rests heavily on the skills and experience of the analysts and their understanding of the underlying algorithms they are using. By comparison the other methods are designed to be almost totally automated and human skills are only relevant to looking at the final results.

4.6 Flocking Hunting Agents

The flock engine utilises a number of independent agents (called geoBoids) that explore the spatial database foraging for clusters. They communicate and direct their movements co-operatively by borrowing behavioural traits that are based on flocking motion (Reynolds 1987). Each member of the flock is able to evaluate and broadcast its performance. Based on this, other flock members can choose to steer towards well-performing geoBoids to assist them or to steer to avoid poorly performing boids. In addition, each geoBoid changes its own behaviour based on its performance. For example, a poorly-performing boid will speed up in order to find a more interesting area. Likewise, a well-performing geoBoid will slow down to investigate an interesting region more carefully. Finally, if a boid finds an empty section of space with zero population, it masks off that area blocking other boids from entering it.

For the Time-Space-Attribute analysis a Genetic Algorithm based optimiser was build and retro-fitted to the flock based search. Interaction between the flock of agents and the GA is, at this time, is minimal, essentially, the Flock suggests geographic regions of interest, and the GA tries to find the best combination of attribute and time span to describe any clustering found there. As its stands there is no `memory' within the GA to carry information between each suggestion from the flock. In a future, more integrated, system each agent in the flock would carry around its own mini-GA allowing that side of the search to take advantage of the inter agent communication. Macgill (1998) goes into more detail about the flock algorithm.

5 Results

The results are summarised in Table 5.

The GAM/K worked surprisingly well and found most of the clusters in both data sets despite the clever time-attribute interactions in data 2. GAM/K told you where the clusters are located fairly rapidly. The results are easy to interpret and the technology is fairly mature and well established. It may not work so well if the data contained more complex structure although the two data sets are by no means simple. It struggled most with data 2 cluster 2. This was because the cluster centre is a rural location and the results were pulled towards and mixed with the nearest large town which had a weakening effect. This problem may well have been created by the size threshold parameter used by GAM/K.

The GAM/K-T results are similar to GAM/K except that it correctly identified the time clustering in data 2. It struggled with data 1 time clusters (as did all the time sensitive methods) leading to the suggestion that the data are incorrect in that the intended 50 to 60 time cluster was randomised over the complete range.

The MAPEX software performs quite well. It easily finds spatial clustering and space-time clustering in two of the three data 2 clusters. It is interesting that this method was able to find the locations of clusters without any awareness of the interactions with the attribute. This was unexpected and may well suggest that the attribute interactions were insufficiently subtle. However, it totally fails to find cluster 1 indicating a likely software bug in the coding of the co-ordinate ranges. Genetic Algorithms are fault tolerant which is a useful property as well as a nuisance. On data 1 there is a good performance in detecting the spatial clustering but almost total failure to identify the correct time period. This mirrors the GAM/K-T performance and may well indicate that these data are not time clustered as expected. Clearly further investigation is needed here.

The GDM/1 results are potentially promising but reflect the early stage in its development. It works quite well but is a little unreliable and requires large amounts of processor time. In some cases each data required a day of CPU time on a fast workstation. It also displays the same generic fault as did MAPEDX when processing data 2 clusters. However, it correctly spotted that the attribute information for data 1 was random (by ignoring it) but it seemed to experience the same difficulties in identifying the time period, again supporting the data error hypothesis. It is a pity that the results for the space-attribute interaction data sets are not yet available.

The flocking results are currently only available for data 2. The flock worked well in detecting purely spatial clustering. The poorest results relate to those data sets where the cluster radius size was large (20km). Here the boids became very susceptible to artefacts from the background population that tended lead the flock away from the real cluster centers. For the smaller radius clusters the performance was good in that it often found two of the clusters with strong indications of the third (slightly offset to the nearest population center). The majority of the development of the GA side of the method had gone into the detection of attribute interactions and the system correctly identified all but two of the attribute, space-attribute, time-attribute effects. It also identified time only clustering but more work will be needed on the time-space interactions of the system. With closer integration between the GA and the flock agents the performance of the system should improve even further.

Finally, there are no Data Mining results because none of the methods that were applied work particularly well. It was also very difficult to interpret the outputs in a manner equivalent to that used for the other methods.

There are also some other problems in that the interpretation of whether or not a synthetic data cluster was found was subjective. The identification and measurement of false positives is also fraught with difficulty because of probabilistic uncertainties as to what is an "real" error. Maybe a more deterministic synthetic data generator needs to be developed so that the deviations in space-time-attribute interactions can be directly assessed. There is clearly scope for considerable additional research in these and related aspects.

6 Conclusions

The results are very interesting and clearly require further study and investigation. They demonstrate that there are now fairly reliable methods capable of detecting spatial clustering and space time clustering. There ability to also analyze space-time-attribute and space-attribute interactions is less well understood. One surprise that in retrospect should have been obvious was the apparent failure of the commercial data mining packages that were investigated although their testing is continuing. Another surprise was the need to develop data result viewers that would allow the end-user to interpret the results. The extension of exploratory geographical analysis into more complex hyper-spaces generates orders of magnitude more results to investigate. The design objective of creating an intelligent human-machine partnership may require re-thinking. As it stands there is considerable subjectivity in "interpreting" the results. Maybe the old GAM/K approach of doing this interpretation automatically or of adding a results filter before human beings start to visualise them might well be a useful subsequent development. Meanwhile we would hope that other interested researchers will use our data sets to test out and develop new methods of geographical analysis.

Acknowledgements

Census data used here was purchased by ESRC/JISC for the academic sector and all results based on it are undoubtedly Crown Copyright. ESRC funded much of the GAM/K-T, MAPEX, and GDM1 research via grant No. R237260. The assistance of Capital Bank plc in providing the data mining evaluation is gratefully acknowledged.

References

Alexander, F. E., Williams, J., Maisonneuve, P., Boyle, P, 1996,'The simulated data-sets', in F E Alexander and P Boyle (eds) Methods for Investigating localised Clustering of Disease, IARC Scientific Publications No 135, Lyon, France p21-27

Besag, J. ,Newell, J., 1991, 'The detection of clusters in rare disease', Journal of the Royal Statistical Society Ser. A, 143-155

Dobson, J. E., 1983, 'Automated geography', The Professional Geographer 35, 135-143

Macgill,J.,Openshaw, S. 1998, The use of flocks to drive a Geographic Analysis Machine., Proceedings of Geocomputation 98,Bristol.

Openshaw, S., Charlton, M E., Wymer, C., Craft, A., 1987, 'A Mark I Geographical Analysis Machine for the automated analysis of point data sets', International Journal of Geographical Information Systems 1, 335-358.

Openshaw, S., Craft, A. 1991, Using the Geographical Analysis Machine to search for evidence of clusters and clustering in childhood leukaemia and non-Hodgkin lymphomas in Britain. In: Draper, G., ed., The Geographical Epidemiology of Childhood Leukaemia and Non-Hodgkin Lymphoma in Great Britain 1966-83, London, HMSO, p109-122

Openshaw, S., Fischer, M M., 1995, `A framework for research on spatial analysis relevant to geo-statistical information systems in Europe', Geographical Systems 2, 325-337

Openshaw, S. Perrée, T., 1996, `User centred intelligent spatial analysis of point data', in D. Parker (ed) Innovations in GIS 3 Taylor and Francis, London p.119-134

Openshaw, S., 1994, "Two exploratory space-time attribute pattern analysers relevant to GIS" in S Fotheringham and P Rogerson (eds) GIS and Spatial Analysis Taylor and Francis, London, p83-104

Openshaw, S., 1995 `Developing automated and smart spatial pattern exploration tools for geographical information systems applications', The Statistician 44, 3-16

Openshaw, S. and Openshaw, C., 1997, Artificial Intelligence in Geography. Wiley, Chichester

Openshaw, S., 1998, `Building automated Geographical Analysis and Exploration Machines'in P A Longley, S M Brooks, R Mcdonnell, B Macmillan (eds) Geocomputation: A primer, Wiley Chichester p95-115

Openshaw, S. and Turton, I., 1999a An introduction to High Performance Computing and the Art of Parallel Programming: for geographers, social scientists, and engineers. Routledge, London. (forthcoming)

Openshaw, S., Turton, I., 1999b, 'Using a Geographical Explanations Machine to Analyse Spatial Factors relating to Primary School Performance' (forthcoming)

Openshaw, S., Turton, I, Macgill, J., Davy, J., 1999b, `Putting the Geographical Analysis Machine on the Internet' In B Gittings (eds) Innovations in GIS 6, Taylor and Francis (forthcoming)

Openshaw, S. 1999, 'Geographical Data Mining: key design issues', Proceedings of GeoComputation '99 (forthcoming)

Openshaw, S., Turton, I., Macgill, J, 1999a, 'Using the Geographical Analysis Machine to analyse limiting long term illness', Geographical & Environmental Modelling 3, 83-99

Reynolds, C W., 1987, 'Flocks, herds, and schools: a distributional behavioural model', Computer Graphics 21, 25-34

Table 1 Typical generic data mining functions

exploratory data analysis tools
linked map and graph displays
other visual data mining tools
most multivariate statistical methods
linear and logistic regression
classification
decision trees and regression trees
association rules detection
neural networks
memory based reasoning
sequence discovery

Table 2 Trispace characteristics of a GIS database

Datatype Nature of Data Interaction

_________________________________________________________________________________

1. spatial data

2. time data

3. multiple attribute data

4. geography and time data

5. time and multiple attribute data

6. geography and multiple attribute data

7. geography, time, and multiple attribute data

Table 3 Characteristics of the synthetic data sets 1

Data Set	Clusters	%Clustered	Size of Clusters
1	3	20	132 48 20
2	1	20	200
3	0	0.0	0
4	4	15	123 1 1 25
5	2	10	87 13
6	2	5	45 5
7	2	2.5	24 1
8	1	2.5	25
9	1	1.0	10
10	2	1.0	8 2
11	0	0.0	0

Table 4 Characteristics of the synthetic data sets 2

Data Set	Clusters	%Clustered	Cluster Radius km	Type of Clustering
1	3	30	5	Space
2	3	30	20	Space
3	3	60	5	Space
4	3	60	20	Space
5	3	30	5	Space-Time
6	3	30	20	Space-Time
7	3	60	5	Space-Time
8	3	60	20	Space-Time
9	3	30	5	Space-Attribute
10	3	30	20	Space-Attribute
11	3	60	5	Space-Attribute
12	3	60	20	Space-Attribute
13	3	30	5	Space-Time -Attribute
14	3	30	20	Space-Time-Attribute
15	3	60	5	Space-Time-Attribute
16	3	60	20	Space-Time-Attribute

Table 5. Summary of Results

Dataset

GAMK

GAMKT

MAPEX

FLOCK

GDM1

4(2)¹

S10

2+?²

1+?²

2+?²

1+?²

2+?²

C10

2+?²

1+?²

C11

2+?²

C12

2+?²

C13

2+?²

C14

2+?²

C15

2+?²

C16

2+?²

Notes: C = number of clusters

T = Time clustering (Y/N)

A = Attribute interaction (Yes/Space/Time)

1 two of the clusters in this data set are very small, all the methods missed these two clusters

2 one of the clusters was very rural and these methods centred the detected cluster over a nearby town.

Testing space-time and more complex hyperspace geographical analysis tools

Stan Openshaw, Andy Turner, Ian Turton, James Macgill

Centre for Computational Geography, School of Geography, University of Leeds, Leeds LS2 9JT, UK

Chris Brunsdon, Department of Town and Country Planning, University of Newcastle, Newcastle Upon Tyne, NE1 7RU

Contents

1 Introduction

2 What is special about geographical data mining?

3 Two Synthetic Data Generators

3.1 Objectives

3.2 Synthetic data generator 1

3.3 Synthetic data generator 2

3.3.1 Computational issues

4 Geographical Analysis Tools

4.1 GAM/K

4.2 GAM/K-T

4.3 MAPEX

4.4 GDM/1

4.5 Commercial Data Mining Tools

4.6 Flocking Hunting Agents

5 Results

6 Conclusions

Acknowledgements

References

Table 1 Typical generic data mining functions

Table 2 Trispace characteristics of a GIS database

Table 3 Characteristics of the synthetic data sets 1

Table 4 Characteristics of the synthetic data sets 2

Table 5. Summary of Results