Abstract. The paper presents the results of a novel experiment
that seeks to compare the performance of several alternative exploratory
geographical analysis methods. A number of simulated data sets containing
different amounts of geographical, temporal, and attribute related patterns
are created and analysed using various Geographical Analysis Machines,
commercial data mining software, smart geographical analysis tools, and
artificial life based approaches.
Keywords: exploratory geographical analysis, data mining, artificial life, space-time analysis, smart geographical analysis tools
Data mining is business analyst terminology for a particular form of interactive data analysis which employs at least one intuitive human expert and their available computing resources to explore and model patterns in data. Data mining techniques are synonymous with exploratory data analysis tools which are applied in a particular aspect of the search for interesting patterns and relationships in data. It follows that a specific data mining application aims to generate information from data to help further create understanding of a particular problem or class of events via some kind of interactive exploratory data analysis using a software package of data mining techniques. Data mining is defined by one developer of data mining software as: "... the process of discovering valid, previously unknown, and ultimately comprehensible information from large stores of data. You can use the extracted information to form a prediction or classification model, or to identify similarities between database records. The resulting information can help you make more informed decisions." Another developer data mining software describe it as: "... referring to the use of a variety of techniques to identify `nuggets' of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but of low value in its raw form, and little use can be made of it directly; it is the hidden information in the data which is valuable." The emphasis in these definitions is on the use of information extracted from data bases to make informed business decisions. Such emphasis will probably be used to shield data mining systems from criticisms that may surface as analysts try to apply the software to problems outside the context of business decision making such as in GIS. Such criticism is likely due to the exaggerated and often misleading claims made by the developers about the analysis capabilities of data mining systems. It may be that this hype is aimed at encouraging and promoting the use of non-linear data lead exploratory analysis, but none-the-less it all too often leads to expectations of the systems to solve problems that are currently far beyond their capabilities. Two Crows (a well recognised data mining consultancy) write "A great deal of what is said about data mining is incomplete, exaggerated, or wrong. Data mining has taken the business world by storm, but as with many new technologies, there seem to be a direct relationship between its potential benefits and the quantity of often-contradictory claims, or myths, about its capabilities and weaknesses."
So data mining packages have mainly been developed for the purpose of analysing very large commercial databases in order to model and predict customer buying behaviour. This emphasis on prediction may well limit their usefulness in spatial analysis where testing hypotheses and exploratory spatial analysis may be a more usual activity. However, it is not a particularly sound defence to argue that data mining tools are intended for enterprise business orientated applications not involving GIS because there are many potential enterprise applications that are based around GIS databases of one sort or another. However, GIS developers and spatial scientists have been slow to develop equivalent geographical data mining tools partly because of the historical emphasis on confirmatory methods and to a lesser extent on low dimensional linked map-graph statistical exploration tools. Maybe also the new needs for geographical analysis in GIS data rich environments also caught many unawares.
It can be argued that conventional data mining tools can be usefully applied to mine GIS databases to extract pattern in the same way that conventional statistical methods can be applied to spatial data. There are some geoinformational data mining tasks that may be usefully performed by conventional data mining software. Table 1 outlines the range of tools that most data mining packages offer and many of these methods could be usefully applied to spatial data. For example, data reduction tools, such as multivariate classification, can be useful as a means of summarising the essential features of large spatial data sets; for instance, to create geodemographic classifications. Similarly, modelling tools such as neural networks and decision trees can be readily applied to some geographic problems. It can be argued that whilst these methods ignore all of the special features of geographical data; see Table 2. They still "work" to some degree but there are also many exploratory geographical analysis types of data mining task that seemingly they cannot perform.2 What is special about geographical data mining?
However, there is a major potential problem in that if you use conventional data mining tools then you are implicitly forced to accept the key assumption that geographical data are the same as any other data and that there is nothing special about geographical information or indeed geographical analysis that will prevent it being performed by conventional methods. If you input some X, Y referenced data into a data mining package and expected it to identify localised clusters of excess incidence of a disease, then you would probably be very disappointed. These packages could only treat the X,Y co-ordinates as if they were merely two ordinary variables (such as age or income) and it is very likely that nothing useful would be achieved. There is no mechanism for handling location or spatial aggregation or for coping with spatial concepts or even mapping. Conventional data mining tools maybe very powerful but they continue the geographical neglect inherent in conventional statistical methods.
Openshaw (1999) argues that what is now needed are new types of data mining tools that can handle the special nature of spatial information and also capture the spirit and essence of geographicalness that a GIS minded data miner would expect to have available. There is a further problem that needs to be dealt with. If you simply equate geographical data mining with exploratory spatial analysis then maybe some will be misled into believing that this problem has already been solved. However this overlooks the massive difference between exploring a manageable small data set with few variables and the need to perform the same process on massive databases (with two or three orders of magnitude more cases) and possibly high levels of multivariate complexity. Human being based graphical explorations of spatial data just does not scale well. The bottleneck is not computational but is a result of limits on the speed and skills of the spatial analyst. There comes a point where adding extra dimensions to the analysis hyperspace overwhelms human abilities. Visualisation tools are useful but there are also limits as to what they can deliver and, in general, GIS databases often present too much complexity for such a simple minded (albeit technically sophisticated) approach.
This paper describes and presents the results of a series of empirical experiments concerned with evaluating the abilities of a range of data mining tools in detecting patterns in synthetic spatial data. The patterns can be purely spatial, or temporal, or space-time, or space-time-attribute based. Synthetic data sets are used so that the "true" results are known allowing the abilities of the various methods to be investigated. The importance of going beyond purely spatial patterns reflects the growing number of GIS databases that include time as well as a plethora of other variables. This represents a considerable methodological challenge as it is clear that most of the existing exploratory analysis tools available for use with geographical data only function well in a two dimensional map space and may be incapable of finding patterns in more complex time and multiple attribute spaces. Openshaw (1994, 1995) identifies seven possible interactions of the trispace that characterises GIS databases; see Table 3. The need is for tools relevant to GIS which can successfully search for patterns in some or all of these hyperspaces. The problem is that these hyperspaces interact to create or hide patterns. For example, suppose you have a database with X, Y for space, T for time, and C type of event. It is obvious that different events may well contain different geographical patterns. The same event may also exhibit different geographical patterns for different time periods. The essence of the problem is that you cannot study the time-event type interactions separately or sequentially as it may be that the strongest patterns are found only when certain time periods and event types are analysed together. Currently the best available methods tend to totally ignore these interactions and would study the data either together as a single data set or else rely on a priori research design decisions that effectively strangle the data before it can speak, albeit unintentionally.
Section 3 outlines two synthetic data generators. Section 4 outlines the five methods that are applied. Section 5 which presents the results and ideas for further research are contained in the conclusion.
The strategy adopted here is to create synthetic data sets with varying degrees of pattern and then assess the success of a selection of methods in analysing these data. This need becomes even more critical if the patterns being concealed in these data sets are not just localised spatial clusters of varying intensity but also include space-time interactions and more complex structures. The data sets are available on the WWW for others to test out their favourite methods. There are problems with this strategy in that: the synthetic data may contain unrealistic degrees of pattern, the patterns may be too hard or too easy compared with the still unknown patterns that exist in the real-world, there is an implicit assumption that each synthetic pattern is findable (but it may not be for all sorts of reasons), and so far only fairly low dimensional databases have been created. In some ways the unknown cunning and skills of the synthetic pattern generator is being pitted against the abilities of the analysis methods being tested. It is a useful start but it is by no means definitive.
The study region was defined as Yorkshire and Humberside as this yielded a sufficiently large data set. The data relates to 10,430 Census Enumeration Districts (EDs) for which persons were used as the population at risk factor. Each census ED had a corrected 100 metre grid-reference attached to it. For the purposes of this exercise 1,000 events were to be generated from a total population at risk of 4,820,129 persons.
The first data generator used created 10 synthetic data sets that displayed varying degrees of spatial and temporal clustering; from purely random to 25% clustered. The spatial clustering algorithm is that described in Alexander et al (1996). Each synthetic data set had a different degrees of clustering and often different parent locations. The random cases were selected by multinomial allocation with multinomial probabilities proportional to the population at risk (i.e. total persons). This approximates an inhomogeneous Poisson process the intensity of which depends on the distribution of the population. The non-random events were allocated as follows:3.2 Synthetic data generator 1
p1(x,y) = kh(x,y) (i)
where h(x,y) is the density of population at risk at (x,y), and k is a constant of proportionality chosen to ensure that p(x,y) integrates to unity. Clearly, this model is of little use if one wishes to consider clustered data. However, the model can be modified, so that the likelihood of incidents occurring around some places exceeds that expected due to population density alone. One way of achieving this is by multiplying (i) by a spatial kernel function centered around some point (x1,y1), with a bandwidth b. For example one could have
p2(x,y) = kh(x,y)exp[((x-x1)2+(y-y1)2)/2b2] (ii)
Here the likelihood of a point occurring at location (x,y) depends not only on the population at risk, but also on the closeness to some `hot point' (x1,y1). Points will cluster spatially around (x1,y1). The role played by the bandwidth, b, is to control the `tightness' of the cluster. Low values of b will produce more concentrated clusters. Note finally that although (ii) described a clustered process, in reality it is likely that some cases will be part of a cluster while others will be general `background' cases. Indeed, if (ii) were the only generating process, identification of clusters would be a trivial point-plotting exercise. For this reason, the final model will be a mixture of both processes:
p3(x,y) = k (ah(x,y)exp[((x-x1)2+(y-y1)2)/2b2] + (1-a)h(x,y)) (iii)
The newly introduced parameter a can be interpreted as the proportion of the data which is part of the cluster.
The next issue to be addressed is how one can generate random points from the distribution p3 above. To do this in practice, two issues must be considered. Firstly, the data apply to discrete spatial units (census enumeration districts), and not to continuous space. Secondly, it is not immediately clear how one can simulate random numbers in the distributional form given above. To resolve the first problem, instead of randomly generating a real-number pair (x,y), we generate an index to the enumeration district. Call this index i. Since there are 10,430 enumeration districts, i is a random integer between 1 and 10,430. Of course, different values of i will have different probabilities of selection. For example, in the simple model (i), the probability of selection for a given i is proportional to the population at risk resident in the corresponding enumeration district. For the `hot point' model (ii), the probability of selection for a given i is proportional to the population at risk multiplied by the `kernel factor'3.3.1 Computational issues
exp[((x-x1)2+(y-y1)2)/2b2]. Here (x,y) for each enumeration district is taken as the zonal centroid.
Having re-specified the models for discrete data, the next problem is that of actually generating the data. For models (i) and (ii) this can be achieved using rejection sampling, as set out below
Step 1: For each enumeration district, compute a number proportional to the probability of selection. Store these in an array X.
Step 2. Compute M, the maximum value in array X.
Step 3. Generate a uniformly random integer in the range 1...10400. Call this J.
Step 4. Generate a uniform continuous number in the range 0...M. Call this U.
Step 5. If U < X(J) then return j as the selected index. Otherwise repeat from step 3.
(NB. It is not necessary to compute the normalising constant, k in this algorithm.)
Thus, we now have a method for generating cases for models (i) and (ii) - the next stage is to generate the mixture model (iii). This is relatively simple once there is a method for generating models (i) and (ii). If one wishes to condition on a value of a (say 0.2), and a given sample size n (say 1000), then draw a selections from model (i), and n(1-a) selections from model (ii), and merge these. For the example values suggested above, one would draw 800 model (i) selections, and 200 model (ii) selections.
3.3.2 Adding Time and Attribute InteractionThe above section sets out a method for drawing spatially clustered data, but does not consider time or attribute information. Recall that the aim here is to detect interactions between space, time and attributes in clusters. Generating random time or attribute data in itself is straightforward, given the methods described in Section 3. For example, suppose we wished to generate a day in the range 1...365 in addition to the spatial information. Assuming initially that there were time clusters, but that these were independent of any spatial clustering, one could use the mixture method set out in Section 3 to generate a data set. One could choose a `hot day' and a bandwidth, similar to the `hot point', and have some cases clustered around this point in time (with probability a), whilst others occurred uniformly throughout the year (with probability (1-a)). To introduce a degree of interaction here, one could link the probability of the observation coming from the clustered model to the spatial location. For example, suppose the probability a was a kernel function of (x1,y1), with a maximum height of 0.75 - this implies that incidents occurring near to the `hot point' are more likely to also cluster in time - around the `hot day'. When the spatial location of an observation lies exactly on the `hot point', then there is a 0.75 chance that the observation will also be clustered in time. Finally, one could extend the method to incorporate attributes in the same way. Initially this is carried out only for a single, dichotomous attribute taking the values 0 and 1. In the unclustered version, the respective probabilities are c0 and c1 for the values to be selected. However, around the point (x1,y1) these probabilities may be adjusted using a kernel model. All of these kinds of interaction models are used in this study.
Geographical Data Miner (GDM/1) is another recent development. Its design is described in Openshaw (1999). GDM is a development of MAPEX to handle event characteristics and to include GIS coverage linkages as performed by GEM; see Openshaw (1998), Openshaw and Turton (1999ab). The search now has to handle where to look for localised clustering, at which spatial scales, at which time periods, and identify what event characteristics define the patterns. All the spaces interact to create a highly complex search hyperspace. It is emphasised that this not an uncommon problem in GIS where it is increasingly common for the data in a GIS data base can contain some, or all, of the following: map location information, time, event attributes, location attributes. Location attributes are based on typical GIS coverage details; i.e. geology, rainfall, etc; that apply to all the data. Event attributes only apply to the cases (events) which are being analysed. Here time and type are both event attributes that do not exist for locations in the database which are not events. So the data being analysed consists of two types of data: (1) X, Y for census Ed and population of census Ed and (2) X, Y for event, time, and type. Note that the type (1) population data could also be indexed by time but here it is constant. The design purpose in GDM/1 was to extend the MAPEX approach to handle space only, space-time space-event type, time-event type, and space-time-event type interactions. Note that a spatial dimension is always present which makes sense because of the geographical purpose of the method. However, experiments indicate that when space is unimportant then the search circles will either grow so large as to encompass all the data or else focus on disjoint subsets of it that cumulative cover the entire map. The interaction effects are handled via a series of AND and OR implicit operators. Another way of thinking about GDM is as an intelligent query generator. If only you knew in advance what selection to make to detect the strongest pattern then all would be well. The GDM searches the universe of all possible relevant queries (given the available data) to suggest the most promising ones (as measured by some pattern detecting statistic) to make. GDM too needs a sophisticated viewer in order to make sense of the results.4.4 GDM/1
Data mining tools need to be guided by users who understand the problem, the data and the general nature of the analytical methods involved. The methods now used in data mining are usually extensions and generalisations of analytical methods known for decades. The recent interest in data mining is almost completely due to the improved user interfaces that make these techniques more available (and more easy to sell) to businesses that have already spent vast amounts of money on a data warehouses. So there is an instant problem with virtually all conventional data mining packages. Despite their high cost, there is almost certainly little prospect of finding any explicit geographical analysis tools. Many will not even draw maps. The synthetic data was pre-processed to become a complete flat file for each census Ed; thus X,Y, population, 0-1 event, time, type. Each event is represented as either a 0 or 1 count (some records may be duplicated). Where the event count is zero the time and type values were randomly generated with a uniform distribution between relevant limits (1-365, 1-2). Three different strategies were used:4.5 Commercial Data Mining Tools
For the Time-Space-Attribute analysis a Genetic Algorithm based optimiser was build and retro-fitted to the flock based search. Interaction between the flock of agents and the GA is, at this time, is minimal, essentially, the Flock suggests geographic regions of interest, and the GA tries to find the best combination of attribute and time span to describe any clustering found there. As its stands there is no `memory' within the GA to carry information between each suggestion from the flock. In a future, more integrated, system each agent in the flock would carry around its own mini-GA allowing that side of the search to take advantage of the inter agent communication. Macgill (1998) goes into more detail about the flock algorithm.
The results are summarised in Table 5.5 Results
The GAM/K worked surprisingly well and found most of the clusters in both data sets despite the clever time-attribute interactions in data 2. GAM/K told you where the clusters are located fairly rapidly. The results are easy to interpret and the technology is fairly mature and well established. It may not work so well if the data contained more complex structure although the two data sets are by no means simple. It struggled most with data 2 cluster 2. This was because the cluster centre is a rural location and the results were pulled towards and mixed with the nearest large town which had a weakening effect. This problem may well have been created by the size threshold parameter used by GAM/K.
The GAM/K-T results are similar to GAM/K except that it correctly identified the time clustering in data 2. It struggled with data 1 time clusters (as did all the time sensitive methods) leading to the suggestion that the data are incorrect in that the intended 50 to 60 time cluster was randomised over the complete range.
The MAPEX software performs quite well. It easily finds spatial clustering and space-time clustering in two of the three data 2 clusters. It is interesting that this method was able to find the locations of clusters without any awareness of the interactions with the attribute. This was unexpected and may well suggest that the attribute interactions were insufficiently subtle. However, it totally fails to find cluster 1 indicating a likely software bug in the coding of the co-ordinate ranges. Genetic Algorithms are fault tolerant which is a useful property as well as a nuisance. On data 1 there is a good performance in detecting the spatial clustering but almost total failure to identify the correct time period. This mirrors the GAM/K-T performance and may well indicate that these data are not time clustered as expected. Clearly further investigation is needed here.
The GDM/1 results are potentially promising but reflect the early stage in its development. It works quite well but is a little unreliable and requires large amounts of processor time. In some cases each data required a day of CPU time on a fast workstation. It also displays the same generic fault as did MAPEDX when processing data 2 clusters. However, it correctly spotted that the attribute information for data 1 was random (by ignoring it) but it seemed to experience the same difficulties in identifying the time period, again supporting the data error hypothesis. It is a pity that the results for the space-attribute interaction data sets are not yet available.
The flocking results are currently only available for data 2. The flock worked well in detecting purely spatial clustering. The poorest results relate to those data sets where the cluster radius size was large (20km). Here the boids became very susceptible to artefacts from the background population that tended lead the flock away from the real cluster centers. For the smaller radius clusters the performance was good in that it often found two of the clusters with strong indications of the third (slightly offset to the nearest population center). The majority of the development of the GA side of the method had gone into the detection of attribute interactions and the system correctly identified all but two of the attribute, space-attribute, time-attribute effects. It also identified time only clustering but more work will be needed on the time-space interactions of the system. With closer integration between the GA and the flock agents the performance of the system should improve even further.
Finally, there are no Data Mining results because none of the methods that were applied work particularly well. It was also very difficult to interpret the outputs in a manner equivalent to that used for the other methods.
There are also some other problems in that the interpretation of whether or not a synthetic data cluster was found was subjective. The identification and measurement of false positives is also fraught with difficulty because of probabilistic uncertainties as to what is an "real" error. Maybe a more deterministic synthetic data generator needs to be developed so that the deviations in space-time-attribute interactions can be directly assessed. There is clearly scope for considerable additional research in these and related aspects.
The results are very interesting and clearly require further study and investigation. They demonstrate that there are now fairly reliable methods capable of detecting spatial clustering and space time clustering. There ability to also analyze space-time-attribute and space-attribute interactions is less well understood. One surprise that in retrospect should have been obvious was the apparent failure of the commercial data mining packages that were investigated although their testing is continuing. Another surprise was the need to develop data result viewers that would allow the end-user to interpret the results. The extension of exploratory geographical analysis into more complex hyper-spaces generates orders of magnitude more results to investigate. The design objective of creating an intelligent human-machine partnership may require re-thinking. As it stands there is considerable subjectivity in "interpreting" the results. Maybe the old GAM/K approach of doing this interpretation automatically or of adding a results filter before human beings start to visualise them might well be a useful subsequent development. Meanwhile we would hope that other interested researchers will use our data sets to test out and develop new methods of geographical analysis.6 Conclusions
Besag, J. ,Newell, J., 1991, 'The detection of clusters in rare disease', Journal of the Royal Statistical Society Ser. A, 143-155
Dobson, J. E., 1983, 'Automated geography', The Professional Geographer 35, 135-143
Macgill,J.,Openshaw, S. 1998, The use of flocks to drive a Geographic Analysis Machine., Proceedings of Geocomputation 98,Bristol.
Openshaw, S., Charlton, M E., Wymer, C., Craft, A., 1987, 'A Mark I Geographical Analysis Machine for the automated analysis of point data sets', International Journal of Geographical Information Systems 1, 335-358.
Openshaw, S., Craft, A. 1991, Using the Geographical Analysis Machine to search for evidence of clusters and clustering in childhood leukaemia and non-Hodgkin lymphomas in Britain. In: Draper, G., ed., The Geographical Epidemiology of Childhood Leukaemia and Non-Hodgkin Lymphoma in Great Britain 1966-83, London, HMSO, p109-122
Openshaw, S., Fischer, M M., 1995, `A framework for research on spatial analysis relevant to geo-statistical information systems in Europe', Geographical Systems 2, 325-337
Openshaw, S. Perrée, T., 1996, `User centred intelligent spatial analysis of point data', in D. Parker (ed) Innovations in GIS 3 Taylor and Francis, London p.119-134
Openshaw, S., 1994, "Two exploratory space-time attribute pattern analysers relevant to GIS" in S Fotheringham and P Rogerson (eds) GIS and Spatial Analysis Taylor and Francis, London, p83-104
Openshaw, S., 1995 `Developing automated and smart spatial pattern exploration tools for geographical information systems applications', The Statistician 44, 3-16
Openshaw, S. and Openshaw, C., 1997, Artificial Intelligence in Geography. Wiley, Chichester
Openshaw, S., 1998, `Building automated Geographical Analysis and Exploration Machines'in P A Longley, S M Brooks, R Mcdonnell, B Macmillan (eds) Geocomputation: A primer, Wiley Chichester p95-115
Openshaw, S. and Turton, I., 1999a An introduction to High Performance Computing and the Art of Parallel Programming: for geographers, social scientists, and engineers. Routledge, London. (forthcoming)
Openshaw, S., Turton, I., 1999b, 'Using a Geographical Explanations Machine to Analyse Spatial Factors relating to Primary School Performance' (forthcoming)
Openshaw, S., Turton, I, Macgill, J., Davy, J., 1999b, `Putting the Geographical Analysis Machine on the Internet' In B Gittings (eds) Innovations in GIS 6, Taylor and Francis (forthcoming)
Openshaw, S. 1999, 'Geographical Data Mining: key design issues', Proceedings of GeoComputation '99 (forthcoming)
Openshaw, S., Turton, I., Macgill, J, 1999a, 'Using the Geographical Analysis Machine to analyse limiting long term illness', Geographical & Environmental Modelling 3, 83-99
Reynolds, C W., 1987, 'Flocks, herds, and schools: a distributional
behavioural model', Computer Graphics 21, 25-34
_________________________________________________________________________________
1. spatial data
2. time data
3. multiple attribute data
4. geography and time data
5. time and multiple attribute data
6. geography and multiple attribute data
7. geography, time, and multiple attribute data
Data Set | Clusters | %Clustered | Size of Clusters |
1 | 3 | 20 | 132 48 20 |
2 | 1 | 20 | 200 |
3 | 0 | 0.0 | 0 |
4 | 4 | 15 | 123 1 1 25 |
5 | 2 | 10 | 87 13 |
6 | 2 | 5 | 45 5 |
7 | 2 | 2.5 | 24 1 |
8 | 1 | 2.5 | 25 |
9 | 1 | 1.0 | 10 |
10 | 2 | 1.0 | 8 2 |
11 | 0 | 0.0 | 0 |
Data Set | Clusters | %Clustered | Cluster Radius km | Type of Clustering |
1 | 3 | 30 | 5 | Space |
2 | 3 | 30 | 20 | Space |
3 | 3 | 60 | 5 | Space |
4 | 3 | 60 | 20 | Space |
5 | 3 | 30 | 5 | Space-Time |
6 | 3 | 30 | 20 | Space-Time |
7 | 3 | 60 | 5 | Space-Time |
8 | 3 | 60 | 20 | Space-Time |
9 | 3 | 30 | 5 | Space-Attribute |
10 | 3 | 30 | 20 | Space-Attribute |
11 | 3 | 60 | 5 | Space-Attribute |
12 | 3 | 60 | 20 | Space-Attribute |
13 | 3 | 30 | 5 | Space-Time -Attribute |
14 | 3 | 30 | 20 | Space-Time-Attribute |
15 | 3 | 60 | 5 | Space-Time-Attribute |
16 | 3 | 60 | 20 | Space-Time-Attribute |
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Notes: C = number of clusters
T = Time clustering (Y/N)
A = Attribute interaction (Yes/Space/Time)
1 two of the clusters in this data set are very small, all the methods missed these two clusters
2 one of the clusters was very rural and these methods centred the detected cluster over a nearby town.