AUTOCLUST: Automatic clustering via boundary extraction for mining massive point-data sets

Vladimir Estivill-Castro and Ickjai Lee
Department of Computer Science & Software Engineering, The University of Newcastle, Callaghan, NSW 2308, Australia
E-mail: vlad@cs.newcastle.edu.au, ijlee@cs.newcastle.edu.au

Abstract

Widespread clustering methods require user-specified arguments and prior knowledge to produce their best results. This demands pre-processing and/or several trial and error steps. Both are extremely expensive and inefficient for massive data sets. The need to find best-fit arguments in semi-automatic clustering is not the only concern, the manipulation of data to find the arguments opposes the philosophy of ''let the data speak for themselves'' that underpins exploratory data analysis. Our new approach consists of effective and efficient methods for discovering cluster boundaries in point-data sets. The approach automatically extracts boundaries based on Voronoi modelling and Delaunay Diagrams. Parameters are not specified by users in our automatic clustering. Rather, values for parameters are revealed from the proximity structures of the Voronoi modelling, and thus, an algorithm, AUTOCLUST, calculates them from the Delaunay Diagram. This not only removes human-generated bias, but also reduces exploration time. The effectiveness of our approach allows us to detect not only clusters of different densities, but sparse clusters near to high-density clusters. Multiple bridges linking clusters are identified and removed. All this is performed within O(n log n) expected time, where n is the number of data points. We evaluate AUTOCLUST's time efficiency and clustering quality. We compare and contrast AUTOCLUST with other algorithms for clustering large geo-referenced sets of points. A series of detailed performance comparisons with both synthetic data sets and real data sets confirms the virtues of our approach.

1 Introduction

As the size of spatial data increases exponentially and the structure of data becomes more complex, data mining and knowledge discovery techniques become essential tools for successful analysis of large spatial data sets [17]. Clustering is central to Spatial Data Mining [7,20] or Geographical Data Mining [24]. It partitions point-data sets into smaller groups due to proximity. Resulting groups represent geographically interesting concentrations in which further investigations should be undertaken to find possible causal factors. Numerous clustering approaches have been suggested within data mining [1,12,13], spatial databases [4,5,16,20,25,26,30], statistics [9,18,19,28] and GIS communities [6,7,15,22]. Different clustering methods have been proven to offer different capabilities, different domains of applicability and different computational requirements.

Nevertheless, clustering methods share their needs for user-specified arguments and prior knowledge to produce their best results. Such information needs are supplied as density threshold values, merge/split conditions, number of parts, prior probabilities, assumptions about the distribution of continuous attributes within classes, and/or kernel size for intensity testing (for example, grid size for raster-like clustering and radius for vector-like clustering). This parameter tuning is extremely expensive and inefficient for huge data sets because it demands pre-processing and/or several trial and error steps.

The need to find best-fit arguments in wide-spread semi-automatic clustering is not the only concern, the manipulation of data to find the arguments opposes Openshaw's famous postulation [23] to ''Let the data speak for themselves''. This postulation underpins exploratory spatial data analysis. Hypotheses should be uncovered from the data, not from the user's prior knowledge and/or assumptions. So far, clustering approaches detect relatively high-density areas (localised excesses) within the study region. However, this clustering principle alone has difficulty in determining exactly what is high-density and what is relatively less-density. Thus, it frequently fails to find slightly sparse groups when they are adjacent to high-density groups despite of their great importance for further investigations. We present specific illustrations of these problems in Section 2.

We overcome these problems by precise modelling of spatial proximity in association with density. Density is a natural principle, but it should be combined with relative proximity. Precise modelling of spatial proximity means a capability to obtain unambiguous information about global effects and local effects since geographical phenomena are the result of global first-order effects and local second-order effects [2]. Global effects indicate the degree of excesses in the global sense of view while local effects represent local variations. With the combination of both effects, localised groups shall appear.

Gold [10] argues that the traditional modelling approaches are not able to capture spatial proximity uniquely or consistently for point-data sets. As an alternative, Voronoi modelling has several advantages for proximity relations of points over the conventional modelling based on either vector- like or raster-like modelling. In particular, Voronoi models uniquely represent discrete point-data sets and consistently capture spatial proximity [10]. We show that, as a consequence, automatic clustering is feasible.

Delaunay Diagrams (these remove ambiguities from Delaunay Triangulations when co-circular quadruples are present), as duals of Voronoi Diagrams, are used as the source of our approach. Section 3 discusses the use of Delaunay Triangulation in modelling spatial proximity and in clustering. With this structure, we find clusters by locating sharp gradient changes in point density (cluster boundaries). In our automatic clustering, users do not specify arguments. We believe any parameters must be encoded in the proximity structures of the Voronoi modelling, and thus, or algorithm, AUTOCLUST, calculates them from the Delaunay Diagram. This not only removes human-generated bias, but also reduces exploration time.

The intuition behind boundary finding is that, in the Delaunay Diagram, points in the border of a cluster tend to have greater standard deviation of length of their incident edges since they have both short edges and long edges. The former connect points within a cluster and the latter straddle between clusters or between a cluster and noise. This characteristic of border points is the essence for formulating a dynamic edge-elimination criterion. The criterion has the flavour of a statistical test, labelling edges that are units of the standard deviation away from the mean. However, the principles behind our methods go far beyond this initial intuition and at least two justifications appear in Section 4.

Removal of edges labelled as too long from the Delaunay Diagram is the first phase that extracts initial rough boundaries. This phase takes into account local and global variations to ensure that relatively homogeneous clusters are not broken up into tiny uninteresting sub-sets and relatively heterogeneous sub-sets are segmented into meaningful sub-clusters. Section 4 details this process as well as two subsequent analysis phases in order to eliminate some difficulties with graph-based clustering. This process removes bridges (line-like paths with a small number of points) that connect clusters.

The effectiveness of our approach allows us to detect not only clusters of different densities, but sparse clusters near to high-density clusters. Multiple bridges linking clusters are identified and removed. All this within O(n log n) expected time, where n is the number of data points. We confirm AUTOCLUST's time efficiency and clustering quality experimentally in Section 5. These performance evaluations use both synthetic data sets and real data sets.

We contrast AUTOCLUST with other algorithms for clustering large geo-referenced sets of points. Detailed information can be found in companion technical report [8]. It categorises clustering methods into 6 groups on the basis of their working principles. We discuss their major problems in the light of AUTOCLUST. Finally, we point out some other areas of impact of our approach with concluding remarks in Section 6.

2 Mere density fails

Spatial clustering for massive data sets has become central in exploratory data analysis. Early efforts like GAM [22] and CLARANS [20] detect clusters within huge data sets but demand high CPU effort (and thus, are costly with respect to computational time). Later, BIRCH [30] was able to find clusters in the presence of noise. DBSCAN [5] and others [4,12,15] detect clusters of arbitrary shapes. Recently, CHAMELEON [16] and AMOEBA [7] report clusters of different densities when data sets exhibit bridges and high discrepancy in density and noise. Despite a certain level of maturity in spatial clustering, methods still demand parameter tuning from the users and fail to find some interesting groups.

2.1 Sparse clusters adjacent to high-density clusters

Clustering approaches focus on finding relatively high-density areas (density-based and grid-based), merging closer groups (hierarchical and model-based), assigning to closer prototypical representatives (partitioning and model-based) or detecting shorter or stronger interactions (graph-based). All typically miss relatively sparse clusters adjacent to high-density clusters. Consider the data set shown in Figure 1(a).

Figure 1: A sparse cluster adjacent to two high-density clusters: (a) a data set (n = 600); (b) three high-density clusters; (c) one big cluster and one high-density cluster; (d) one sparse cluster; (e) four clusters.

Most clustering algorithms report either

three very dense clusters (C1, C2 and C3 in Figure 1(b)) when user-supplied arguments combine to reveal very high-density clusters, or
one big cluster (C23 in Figure 1(c)) and one dense cluster (C1), when user-supplied arguments are tuned to relax the criterion function in order to detect the sparse cluster (C4 in Figure 1(d)).

These algorithms are not able to detect the three high-density clusters and the sparse cluster (Figure 1(e)) simultaneously.

Clustering methods are inclined to report the big cluster (C23 in Figure 1(c)) as one cluster because the intra-cluster distance of cluster C4 (sparse cluster) is greater than the inter-cluster distance between the high-density clusters (C2 or C3) and the sparse cluster (C4). Apart from the three high-density clusters, the sparse cluster is also important for further correlation investigation between themes. Spotting and locating interesting groups in a particular theme is the starting point of exploratory data analysis. If we assume that points represent crime incidents around cities and surrounding suburbs, then mere density is unable to find correlation between crime incidents and suburbs by missing the distinction between the sparse cluster and the high-density clusters. Often, this phenomenon is found in real data, such as big cities having more crime incidents surrounded by suburbs recording relatively less. Moreover, detection of a boundary allows for exploration of a geographical feature that can explain the sharp decrease in density and frequency (the crimes reported are for an interval of time).

2.2 More than one line-like bridges between two clusters

Graph-based clustering [4,7,15,16,29] methods have gained popularity as alternatives to density methods since they work from an underlying graph model. The observations (points) are assigned to vertices and a discrete relation is encoded, making two related vertices connected with an edge. In this abstract structure, a cut-point is a vertex whose removal disconnects the graph and a bridge is an edge whose removal disconnects the graph [14]. Connected components are the most elementary clusters. Edges have a weight corresponding to the geographical distance between the associated end-vertices.

Hierarchical clustering with simple linkage corresponds to computing the Minimum Spanning Tree of the weighted graph. However, it is well-known that simple linkage suffers from bridges. Problems also arise for other strategies when clusters are connected only at bridges or chains. Well-known solutions for this find all bridges and cut-points [29]. This works when only one bridge connects two clusters.

Figure 2: Multiple bridges: (a) a data set (n = 555); (b) three clusters; (c) three clusters and bridges; (d) one big cluster and one small cluster.

Consider the data in Figure 2(a). Figure 2(b) shows three clusters (C1, C2 and C3). Figure 2(c) shows that there are three path-like bridges connecting two clusters (C1 and C2) while one connecting two clusters (C2 and C3). There are no bridges or cut-points between clusters C1 and C2 while there are between clusters C2 and C3. Therefore, the approach using bridges and cut-points fails to detect two clusters linked by multiple bridges, thus generates two clusters C12 and C3 as shown in Figure 2(d).

2.3 Closely located high-density clusters

Clusters emerge as a combination of global effects and local effects. Clusters may share global first-order effects, but have their own local second-order effects [7]. Global user-supplied arguments are to be used only when the set of points is under the same global circumstance. Thus, the use of global user-supplied arguments should be minimised not only to allow for local variation, but for spotting high-quality clusters.

Figure 3: Closely located high-density clusters: (a) a data set (n = 570); (b) five clusters; (c) four high-density clusters; (d) three high-density clusters and one sparse cluster.

Visual inspection of Figure 3(a) reveals four high-density clusters and one sparse cluster. Figure 3(b) highlights the four high-density clusters with labels C1 to C4 and one sparse cluster with label C5. However, using only global arguments is very difficult to obtain a reply from a clustering system that suggests these five clusters simultaneously. Settings of user-supplied arguments alternate between a reply that

detects four high-density clusters (C1, C2, C3 and C4) leaving the sparse cluster (C5) out, perhaps as noise (refer to Figure 3(c)), or
detect three high-density clusters (C1, C2 and C34) and one sparse cluster (C5) as illustrated in Figure 3(d).

The former is the result of arguments slightly of tune. For example, shorter cut-off criterion function (graph-based), higher density parameter (density-based) or smaller size of kernel, normally circle or rectangle, (density-based or grid-based). By relaxing the global arguments, the sparse cluster now can be identified. But typically, because the separation between C3 and C4 is smaller than the intra-cluster distance in cluster C5, the two are combined into one high-density cluster C34 (refer to Figure 3(d)).

2.4 Argument free

Minimising the need to adjust arguments reduces the risk of tampering with the analysis tools. It also promises higher user friendliness. With our approach, non-experts can now expect good quality clusters without assistance from experts towards argument-tuning. In addition, it reduces experimentation time to search best-fit arguments. Clustering for knowledge discovery is for revealing implicitly hidden information and patterns in data. Thus, our approach supports automatic exploratory analysis [23], rather than human-driven confirmatory analysis.

3 Using Delaunay Diagrams

3.1 Modelling proximity with Delaunay Diagrams

Two-dimensional clustering is the non-trivial process of grouping geographically closer points into the same cluster. Therefore, a model of spatial proximity for a discrete point-data set P = {p₁,p₂,...,p_n} must provide robust answers to which are the neighbours of a point p_i and how far are the neighbours relative to the context of the entire data set P. Once spatial proximity is properly modelled, clustering becomes the matter of finding relative sudden change.

This is similar to finding sharp gradient change, the boundaries of clusters, in point density. Note that, we do not have a continuous density but a discrete point set. The model that does not extrapolate or interpolate is the discrete graph. One possibility is to use the complete graph, with all edges weighed by the geographical distance. Here, the strength of interaction between points is inversely proportional to the length of edge (recall that distance decays effect [2]). However, if we have n data points, there is no need to have a model of size(n²). Moreover, clustering is a summarisation operation and the goal is to obtain the discrete relation p_i is in the same cluster as p_j for all p_i,p_j P (a sub-relation of the complete graph). Thus, clustering is to detect and remove inconsistent interactions (edges) in the graph modelling proximity.

We already mentioned that, despite that raster and vector representations are widely used in modelling geographical databases, they do not consistently capture spatial proximity for point-data sets [10,11]. The Voronoi diagram offers an alternative to overcome the drawbacks of conventional data models [10]. The Voronoi diagram uniquely captures spatial proximity and represents the topology explicitly with the dual graph known as the Delaunay Triangulation. Each Delaunay edge is an explicit representation of a neighbourhood relation between points.

Because Delaunay Triangulations are not unique when co-circularity occurs we use Delaunay Diagrams. These are the result of removing all edges e in the Delaunay Triangulations that are sides of a triangle whose vertices are co-circular with a fourth point. Even in the presence of co-circularity, Delaunay Diagrams guarantee a unique topology. They can be constructed effectively and efficiently (only O(n log n) time). The structure is succinct [21]; the number of edges is linear to the number of points and the expected number of edges incident to point is a constant value. In addition, the effect of long edges is minimised since the Delaunay Diagram is as equilateral as possible. Further, it is a hybrid approach of the two widely adopted models. It is geographically comprehensive like the raster model and holds explicit adjacency like the vector model.

3.2 Previous clustering methods using Delaunay Triangulations

Delaunay Triangulations are so powerful to succinctly represent proximity relationships that several clustering methods have been based on them. The simplest example, is again, the Minimum Spanning Tree, or single-linkage hierarchical clustering.

More robust alternatives have been proposed along the paradigm of edge-removal from the Delaunay Triangulation to obtain the clusters. One alternative looks at the perimeter (sum of length of edges) of the triangles (or facets) of the Delaunay Diagram. The intuition is that, given the nature of Delaunay Diagrams (where a circumcircle of a Delaunay Triangle does not include any other data point), triangles with large perimeter must have vertices that are not in the same cluster. Estivill-Castro and Houle [6] successfully derive the number of clusters using an one-dimensional classification of perimeter values. Another approaches are Kang et al. [15] and Eldershaw and Hegland [4]. They analyse the length of Delaunay edges to find clusters of arbitrary shapes. The intuition is similar, and edges that are too long are classified as inter-cluster edges and removed. The relative success of these previous approaches demonstrates that Delaunay Triangulations contain much implicit proximity information.

However, these previous ideas use directly Delaunay Triangulations and return inconsistent results in the presence of co-circularity. More seriously, they use a global argument as a threshold to discriminate perimeter values or edges lengths. This kind of static approach is not able to detect local variations, and again, fails to identify clusters in situations like those illustrated in Section 2. Recently, Estivill-Castro and Lee [7] propose a dynamic approach which utilises both global and local variations. The idea behind this method is to find relatively high-density clusters or localised excesses based on Delaunay Diagram. This approach overcomes some problems of static approaches and produces localised multi-level clusters. But, it still fails to find relatively sparse clusters in situations like those in Section 2.1 and Section 2.3.

4 AUTOCLUST

4.1 Intuition and definitions

The previous approaches justify the analysis of Delaunay Triangulations or Delaunay Diagrams to detect inter-cluster edges (Delaunay edges whose endpoints are in different clusters). We postulate that, in Delaunay Diagrams, border points of clusters exhibit greater variability (dispersion or spread) of length of their incident edges since they have both inter-cluster edges and intra-cluster edges (Delaunay edges whose endpoints are in the same cluster).

There are several measures for describing dispersion in statistical spatial analysis. Undoubtedly, the standard deviation and its square, the variance, are the most valuable and popular [27]. Thus, for the moment, we may say that AUTOCLUST detects points whose incident edges exhibit lengths that have unusually large standard deviation. In order to precisely define this notion we introduce the following notation. In what follows, let P = {p₁,p₂,..., p_n} denote a set of n points in 2 dimensions and let DD(P) denote the corresponding Delaunay Diagram of P. The graph DD(P) is a planar map that contains vertices and edges. Edges of DD(P) will be called Delaunay edges. For a point p_i P (a vertex of DD(P)), the neighbourhood N(p_i) is the set of Delaunay edges incident to point p_i.

Definition 4.1.1 We denote by Local_Mean(p_i) the mean length of edges in N(p_i). That is,

where, d(p_i) denotes to the number of Delaunay edges incident to p_i (the degree of p_i in the graph theory sense), and denotes to the length of Delaunay edge e_j.

Definition 4.1.2 We denote by Local_St_Dev(p_i) the standard deviation in the length of edges in N(p_i). That is,

Thus, an edge e_j incident to a point p_i could potentially be labelled as inter-cluster (too long) if

However, this would only involve local effects. More seriously, Local_Mean(p_i) and Local_St_Dev(p_i) are statistical indicators based on only d(p_i) length values. This is a very small sample, because

= 6 in Delaunay Diagrams. Thus, we now introduce measures that incorporate global and local effects, and in fact, this measure will result in tests that make a decision on each edge using the data of all

edges in DD(P).

Definition 4.1.3We denote by Mean_St_Dev(P) the average of the Local_St_Dev(p_i) values as we let p_i be each point in P. That is,

Note that, a value of edge-length spread like Local_St_Dev(p_i) is large or small only in relation to the global value reflected in Mean_St_Dev(P). Thus, we introduce an indicator of the relative size of Local_St_Dev(p_i) with respect to Mean_St_Dev(P).

Definition 4.1.4We let Relative_St_Dev(p_i) denote the ratio of Local_St_Dev(p_i) and Mean_St_Dev(P). That is,

With this tool we propose to discriminate edges into those unusually short, those unusually long and those that seem to be proper intra-cluster edges.

Definition 4.1.5 An edge e_j N(p_i) belongs to Short_Edges(p_i) if and only if the length of e_j is less than Local_Mean(p_i) - Mean_St_Dev(P). That is,

Definition 4.1.6 An edge e_j N(p_i) belongs to Long_Edges(p_i) if and only if the length of e_j is greater than Local_Mean(p_i) + Mean_St_Dev(P). That is,

Definition 4.1.7 Other_Edges(p_i) denotes the subset of N(p_i) whose edges are greater than or equal to Local_Mean(p_i) - Mean_St_Dev(P), and less than or equal to Local_Mean(p_i) + Mean_St_Dev(P). That is,

Note that, for each p_i, the neighbourhood N(p_i) is partitioned into exactly 3 subsets: Short_Edges(p_i), Long_Edges(p_i) and Other_Edges(p_i). However, an edge linking p_i with p_j may belong, for example, to both Short_Edges(p_i) and Other_Edges(p_j).

4.2 Working principle

The first phase of our algorithm involves the identification, for each p_i, of Long_Edges(p_i), Short_Edges(p_i) and Other_Edges(p_i). While it is now possible to declare Long_Edges as links between points in different clusters, simply because they clearly indicate that the points are too far away, it is not as simple to assume that Short_Edges are links for points within a cluster. In some cases, these very short links also correspond to points that align themselves on bridges between clusters. For example, consider the data in Figure 4(a). The Delaunay Diagram of these 200 points is shown in Figure 4(b). Some Short_Edges of the data set are illustrated in Figure 4(c). They correspond to two bridges between the two clusters. This is because other Delaunay edges incident to points in the bridges are long, so the edges on the bridges are exceptionally short. Note that edges inside any of the two clusters are of approximately the same length as edges on the bridge. But, edges incident to a point p_i inside a cluster are almost all of equal length. Thus, removing all edges e Short_Edges(p_i) eliminates some bridges, while if such e happens inside a cluster, its removal can later be recuperated using analysis of connected components. Later phases of our algorithm handle other types of bridges and recuperate Short_Edges that are intra-cluster links.

Figure 4: Two clusters connected by two bridges: (a) a data set (n = 200); (b) its Delaunay Diagram; (c) the Short_Edges on bridges.

4.2.1 First justification

While we have the intuitively supported rationale in AUTOCLUST to remove all edges that are in Long_Edges(p_i)

Short_Edges(p_i) for some p_i P, we now give another justifications. Note that the analysis of edges in DD(P) is the means towards grouping points, towards clustering. We noted that the relation p_i is in the same cluster as p_j is a subset of the complete graph and encodes the clustering. This is what establishes the connection between grouping points and the analysis of edges. Previous clustering algorithms using Delaunay Triangulation [4,6,15] focus much more on edges because all operations and cut-off criteria functions are performed and formulated based on edge analysis. AUTOCLUST is closer to the objective of grouping points because Local_Mean(p_i) and Local_St_Dev(p_i) report on local behaviour around p_i. These values represent behaviour of points more than that of edges. Global effects are incorporated in discriminating Long_Edges(p_i), Short_Edges(p_i) and Other_Edges(p_i), with the use of Mean_St_Dev(P). This makes all the n points influence the decisions for any other p_i. Perturbations of a point p_j away from a point p_i do not affect the Delaunay Diagram in the vicinity of p_i. However, the relative configuration of other points p_j away from p_i does influence Mean_St_Dev(P).

4.2.2 Second justification

As a second justification to AUTOCLUST's removal of all edges that are in Long_Edges(p_i)

Short_Edges(p_i) for some p_i P, first recall that border points p_i of a cluster C are expected to have Local_Mean(p_i) and Local_St_Dev(p_i) larger than points totally inside the cluster C. Thus, if Local_Mean(p_i) and Local_St_Dev(p_i) where statistically meaningful in that were supported on a large sample, a statistical test to remove unusually long edges would have the form

where k would be a constant to evaluate how unusual an edge is.

However, for a point p_i inside a cluster C, there is uniformity around it, so all its incident edges have similar lengths and Local_St_Dev(p_i) is relatively small. In this case, we would set the value of k to be larger than 1 in order to widen the acceptance interval and make sure we preserve edges inside a cluster. On the other hand, for a point p_i that does not belong to any cluster (an outlier or noise), there are long edges around it to reach other points. This makes Local_Mean(p_i) large and because of scale that Local_St_Dev(p_i) is in the same units as Local_Mean(p_i), Local_St_Dev(p_i) is relatively large. Thus, in this case, we would set the value of k to be less than one in order to restrict more the acceptance interval and ensure we remove more edges incident to an outlier.

The inverse of Relative_St_Dev(p_i) fulfils perfectly the role required for the factor k. It is less than one for those points p_i that locally exhibit more spread (of length of their incident edges) than the generality of all points in P. It is larger than one for those points p_i that locally exhibit less spread than the generality of all points in P. Incorporating k = Relative_St_Dev(p_i)^-1 into the statistical test, we have the following derivation.

This is the acceptance interval for Other_Edges(p_i).

4.3 A three-phase clustering algorithm

AUTOCLUST consists of three phases. Phase I classifies incident edges into Short_Edges(p_i), Long_Edges(p_i) and Other_Edges(p_i) for each p_i. Phase I also removes all edges in Short_Edges(p_i) and Long_Edges(p_i) for all points p_i. This produces the boundaries of clusters, perhaps with some bridges and perhaps isolating some points. For each point p_i, Phase II recovers edges in Short_Edges(p_i) as long as they do not merge non-trivial connected components, if this happens, it may re-establish the connected components by removing some Other_Edges(p_i). Finally, Phase III looks for bridges in a local region of each p_i. Figure 5 shows the operation of these phases in a data set with 4 clusters. The Delaunay Diagram is presented in Figure 5(a), while Figure 5(b) shows the conclusion of Phase I. Application of the second phase results in Figure 5(c). Data points isolated are attached back. Phase III breaks the bridges between the two bottom clusters.

Figure 5: Cleaning procedures: (a) Delaunay Diagram; (b) after Phase I (initial 3 clusters); (c) after Phase II (3 clusters with refinement); (d) after Phase III (4 clusters).

4.3.1 Phase I: Finding boundaries

Our earlier discussion justifies in detail the removal of all edges in all Long_Edges(p_i) from the Delaunay Diagram. They are simply too long and are merging two distinct clusters. We stress that edges e = (p_i,p_j) in Short_Edges(p_i) where p_j is in the same cluster are temporary removed and to be recuperated in the next phase because p_i and p_j will be in the same connected component.

4.3.2 Phase II: Restoring and re-attaching

The goal of this phase is to restore the edges e = (p_i,p_j) in Short_Edges(p_i) where p_i and p_j are in the same cluster. Also, the goal is to remove some Other_Edges(p_i) that are creating paths that may result in bridges or links between border points and noise points. Thus, the first step of Phase II is to compute connected components and label each point p_i with its connected component CC[p_i]. If a connected component has more than one element, it is called non-trivial. If p_i is the only point in a connected component, then the point p_i is isolated. In that case CC[P_i] is not a cluster. To introduce the processing of Phase II, we bring attention to the following observation.

Observation 4.3.1Let p_i P be such that Other_Edges(p_i) . Then, all edges e Other_Edges(p_i) connect p_i to the same non-trivial connected component.

Using our notation, this means that if two edges e_j = (p_i,p_j) and e_k = (p_i,p_k) are in Other_Edges(p_i), then CC[p_j] = CC[p_k]. The observation follows trivially from the fact that p_j and p_k are connected by the path e_j,e_k. Now, we recuperate Short_Edges, and in some cases, revise the connected component for p_i. If all edges in Short_Edges(p_i) suggest unanimously a connected component, p_i is integrated to that suggestion. If there are two edges in Short_Edges(p_i) suggesting different connected components, the largest is adopted. While Other_Edges(p_i) participate in the creation of the connected components, now they delegate precedence to the suggestion by Short_Edges(p_i) (which in the large majority of the cases is the same suggestion). The component swap typically occurs on isolated boundary points that have no Other_Edges (for example, the isolated points on the left of the bottom cluster in Figure 5(b)). Details of this process are as follows.

For each point p_i, we perform the following steps if Short_Edges(p_i) and (e = (p_i,p_j)) Short_Edges(p_i), CC[p_j] is a non-trivial connected component.

Determine a candidate connected component C for p_i.

If there are two edges e_j = (p_i,p_j) and e_k = (p_i,p_k) in Short_Edges(p_i) with CC[p_j] CC[p_k], then

Compute, for each edge e = (p_i,p_j) Short_Edges(p_i), the size and let .
Let C be the class label of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to p_i).

Otherwise, let C be the label of the connected component all edges e Short_Edges(p_i) connect p_i to.

If the edges in Other_Edges(p_i) connect to a connected component different that C, remove them. Note that

because of Observation 4.3.1 above, when this case applies, all edges in Other_Edges(p_i) are removed, and
only in this case, will p_i swap non-trivial connected components.

Restore all edges e ShortEdges(p_i) that connect to C.

Re-compute connected components.
For each singleton p_i (CC[p_i] = 1), we restore all Short_Edges(p_i) if (e = (p_i,p_j)) Short_Edges(p_i), CC[p_j] is the same connected component C.

Thus, Phase II incorporates points that are isolated in singleton connected components to non-trivial connected components if and only if they are very close (they have at least one Short_Edge to a non-trivial connected component) in step 1. Singletons whose incident edges connected to another singletons are re-attached in step 3. Phase II also re-assigns a point p_i linked to a cluster C by Other_Edges(p_i) if there are Short_Edges(p_i) that indicate membership to

4.3.3 Phase III: Detecting second-order inconsistency

After the first two phases, the current subgraph of DD(P) is the result of local analysis with regard to global effects. However, to complete the analysis, the immediate neighbourhood is extended slightly. Thus, we extend the notion of neighbourhood of a vertex p_i in a graph G as follows.

Definition 4.3.1 The k-neighbourhood N_k,G(p_i) of a vertex p_i in graph G is the set of edges of G that belong to a path of k or less edges starting at p_i.

The third phase, now computes Local_Mean for 2-neighborhoods.

Definition 4.3.2 We denote by Local_Mean_k,G(p_i) the mean length of edges in N_k,G(p_i). That is,

For each point p_i, this phase simply removes all edges in N_2,G(p_i) that are long edges. That is, all e N_2,G(p_i) such that > Local_Mean_2,G(p_i) + Mean_St_Dev(P).

4.4 AUTOCLUST only requires O(n log n) time

Sub-quadratic time complexity is central for clustering algorithms in spatial data mining. AUTOCLUST is applicable to clustering of very large geo-referenced data since it only requires O(n log n) time. AUTOCLUST requires O(n log n) time to construct the Delaunay Diagram. Once this model of proximity is available, AUTOCLUST performs the remaining work in linear time. The calculation of Mean_St_Dev(P) is bounded by twice the number of edges and the edges elimination in Phase I requires O(n) time. To compute connected components only requires O(n) time, since this is linear in the sum of both number of edges and number of points. Each test to restore Short_Edges(p_i) in Phase II is bounded by d(p_i) and thus, the total time for Phase II is also linear. Similarly, the local subgraph around p_i in Phase III has size O(d(p_i)²). Because in the Delaunay Diagram we have E[d(p_i)] = 6, the total execution in this phase requires time proportional to the sum of the degrees of all vertices, which is O(n). Consequently, the time complexity of AUTOCLUST is dominated by computing Delaunay Diagram in O(n log n) time.

5 Performance evaluation

This section presents results of experiments with synthetic data sets and real data sets that illustrate remarkably the robustness of AUTOCLUST.

5.1 Clustering quality with synthetic data sets

Figure 6: Synthetic data sets: (a) Dataset I (n = 8000 and 8 clusters); (b) Dataset II (n = 8000 and 8 clusters); (c) Dataset III (n = 8000 and 6 clusters); (d) Dataset IV ( n = 3250 and 12 clusters).

The data in Figure 6(a) is the result of synthetic data generation aimed at creating situations where clustering methods typically fail. Section 2 explains these types of situations. The clustering obtained with AUTOCLUST demonstrates its robustness. AUTOCLUST correctly identifies two clusters connected by multiple bridges in the top-left corner of Figure 6(a). AUTOCLUST has no difficulty in detecting a cross-shaped, but sparse cluster around a high-density cluster at the top-right corner of that data set. Again, this mimics real-world cases like densely populated city surrounded by slightly less dense suburbs. AUTOCLUST successfully reports the two high-density clusters separated by a twisting narrow gap (at the bottom-left of the data set). This mimics real-world situations where a river separates two highly populated areas. AUTOCLUST correctly reports the doughnut-shaped cluster and the surrounded island-shaped cluster (at the bottom-right of that data set).

The data sets in Figure 6(b) and Figure 6(c) belong to CHAMELEON's benchmark [16]. Regardless of the challenge posed by different sizes, different densities, and different separations (inter-cluster distances), AUTOCLUST correctly identifies 8 clusters and 6 clusters, respectively. Some of these clusters have very little space between them, but AUTOCLUST still detects these boundaries.

The data in Figure 6(d), consists of 12 clusters of various shapes and different densities. Note that many are not convex. AUTOCLUST successfully detects all of them.

5.2 Clustering quality with real data sets

The data sets displayed in Figure 7 are real data sets of Queensland in Australia. Figure 7(a) displays centres for living areas in Queensland and AUTOCLUST clustering results. In the bottom-right of the data sets, Brisbane, the capital city of Queensland, creates one large cluster (incorporating 70 percent of the living areas). AUTOCLUST detected this and simultaneously, it finds three relatively smaller clusters along the coast which correspond to big cities such as Cairns, Atherton and Townsville. In-land areas have no clusters. This highlights the pattern of urban settlement in coastal areas and around capital cities.

Figure 7: Real data sets: (a) living areas (n = 1,837) and AUTOCLUST grouping into 4 clusters; (b) landing grounds (n = 1,579) and AUTOCLUST grouping into 3 clusters (greater than 1 % of n); (c) national parks (n = 227) and AUTOCLUST grouping into 4 clusters (greater than 2% of n).

The data set displayed in Figure 7(b) represents landing grounds including aerodromes and airports. Visual inspection does not suggest any significant cluster. Landing grounds seem to be scattered all over Queensland. AUTOCLUST groups 90 percent of them into a large cluster that covers almost all Queensland. AUTOCLUST's results are consistent with the fact that there is no special pattern in landing grounds. AUTOCLUST also detects two small clusters in the southern portion of Brisbane. Note that, visual inspection reveals that there is less density for landing grounds surrounding the concentrated living areas near Brisbane. Creating the two other clusters South and East of Brisbane. Densely populated urban areas are not likely to have enough space for many airports. Conversely, sparsely populated in-land areas are likely to have enough space for small aerodromes.

Figure 7(c) illustrates a data set for national parks in Queensland. National parks are generally located along the coast and the Great Barrier Reef. AUTOCLUST successfully derives clusters directly overlying the major cities, urban areas and coastal areas. It finds two high-density clusters including more than half of national parks on the outer margin of Australia's continental shelf. Two less dense clusters are detected around urban areas.

5.3 CPU time comparison

While AUTOCLUST requires O(n log n) time, we are interested in verifying experimentally that the constants under the O notation are small. This is to be expected, since the dominating O(n log n) time term in AUTOCLUST is the computation of the Delaunay Diagram. Everything else requires O(n) time.

We compare AUTOCLUST with AMOEBA [7]. AMOEBA also requires O(n log n) time and has been shown to be very efficient, (small constant under the O notation). We generated 7 types of data sets based on a simple mixture model. The number of points varies from 1,000 to 100,000. Each data set has 10 randomly distributed cluster centres within the unit square [0, 1] [0, 1] and 10 percent of the total number of points are just noise points (selected at random within the unit square [0, 1] [0, 1] ). Since AUTOCLUST and AMOEBA require spatial proximity modelling (Delaunay Diagram), the run time comparison shown in Figure 8 includes the time for constructing their respective data structures. All the experiments were performed on a 550MHz Pentium III workstation with 128MB memory.

Figure 8: Comparison of CPU times in seconds.

Figure 8 demonstrates that AUTOCLUST outperforms AMOEBA with respect to CPU-time by a small margin. Note that, however, the exploration time to find best-fit arguments for AMOEBA is not included. AUTOCLUST is argument free. Thus, in exploratory data analysis exercises AUTOCLUST will easily outperform the use of AMOEBA or other clustering methods that need parameter tuning. The experimental results shown in Figure 8 include the time requirements for I/O and clustering process. These experimental results illustrate that AUTOCLUST is efficient in terms of time complexity and scalable to large data sets.

6 Final remarks

Spatial data mining is the discovery of interesting patterns, structures, relationships or characteristics that may exist implicitly in spatial databases [20]. These information is hidden and unknown. Thus, we do not know what kinds of patterns exist, what causes them, how many clusters are embedded and where they are. Only the data can genuinely reveal these information. Hypotheses about the patterns are suggested from the data in exploratory data analysis. This is very distinctive from the confirmation of human generated hypothesis in confirmatory data analysis. Hence, for exploratory data analysis, the need to supply prior knowledge should be minimised, removing human-generated biases. AUTOCLUST is a clustering tool for massive geo-referenced data sets free of user-supplied arguments. Moreover, this freedom eliminates expensive human-based exploration time for finding best-fit arguments needed for alternative approaches. AUTOCLUST incorporates global variations and local variations. Hence, it is robust to noise, outliers, bridges and the type of distribution. It is able to detect clusters with arbitrary shapes, different sizes and different densities. Clusters with multiple bridges are also correctly separated. It only requires O(n log n) time, where n is the size of the data set. A series of comparison proves the superiority of AUTOCLUST in terms of efficiency and effectiveness.

We expect that our methods will extended in two directions. First, the approach can be extended to 3 dimensions by computing a spanner graph of linear size and obtaining local and global information efficiently [3] from this graph. Second, the approach can naturally be extended to other metrics with well-defined Delaunay Diagrams like Manhattan distance or distances. Further research will explore these directions.

References

[1]: R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 94-105, 1998.
[2]: T. C. Bailey and A. C. Gatrell. Interactive Spatial Analysis. Wiley, New York, 1995.
[3]: M. T. Dickerson and D. Eppstein. Algorithms for proximity problems in higher dimensions. Computational Geometry - Theory and Applications, 5:277-291, 1996.
[4]: C. Eldershaw and M. Hegland. Cluster Analysis using Triangulation. In B. J. Noye, M. D. Teubner, and A. W. Gill, editors, Computational Techniques and Applications: CTAC97, pages 201-208. World Scientific, Singapore, 1997.
[5]: M. Ester, M. P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226-231, 1996.
[6]: V. Estivill-Castro and M. E. Houle. Robust Clustering of Large Geo-referenced Data Sets. In Proceedings of the 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 327-337, 1999.
[7]: V. Estivill-Castro and I. Lee. AMOEBA: Hierarchical Clustering Based on Spatial Proximity Using Delaunay Diagram. In Proceedings of the 9th International Symposium on Spatial Data Handling, 2000. to appear.
[8]: V. Estivill-Castro and I. Lee. AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets. Technical Report 2000-03, Department of Computer Science and Software Engineering, University of Newcastle, 2000.
[9]: C. Fraley. Algorithms for Model-Based Gaussian Hierarchical Clustering. SIAM Journal on Scientific Computing, 20(1):270-281, 1998.
[10]: C. M. Gold. Problems with handling spatial data - The Voronoi approach. CISM Journal ACSGC, 45(1):65-80, 1991.
[11]: C. M. Gold. The meaning of Neighbour. In G. Tinhofer and G. Schmidt, editors, Theories and Methods of Spatio-Temporal Reasoning in Geographic Space, pages 220-235, Berlin, 1992. Springer-Verlag Lecture Notes in Computer Science 639.
[12]: S. Guha, R. Rastogi, and K. Shim. CURE: An Efficient Clustering Algorithm for Large Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 73-84, 1998.
[13]: S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Attributes. In Proceedings of the International Conference on Data Engineering, pages 512-521, 1999.
[14]: F. Harary. Graph Theory. Addison-Wesley, Reading, MA, 1969.
[15]: I. Kang, T. Kim, and K. Li. A Spatial Data Mining Method by Delaunay Triangulation. In Proceedings of the 5th International Workshop on Advances in Geographic Information Systems (GIS-97), pages 35-39, 1997.
[16]: G. Karypis, E. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. IEEE Computer: Special Issue on Data Analysis and Mining, 32(8):68-75, 1999.
[17]: K. Koperski, J. Han, and J. Adhikari. Mining Knowledge in Geographical Data. Communications of the ACM. to appear.
[18]: F. Murtagh. A Survey of Recent Advances in Hierarchical Clustering Algorithms. The Computer Journal, 26:354-359, 1983.
[19]: F. Murtagh. Cluster analysis using proximities. In J. Hampton, R. Michalski, P. Theuns, and I. V. Mechelen, editors, Categories and Concepts: Theoretical Views and Inductive Data Analysis, pages 225-245. Academic Press, New York, 1993.
[20]: R. T. Ng and J. Han. Efficient and Effective Clustering Method for Spatial Data Mining. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), pages 144-155, 1994.
[21]: A. Okabe, B. N. Boots, and K. Sugihara. Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. John Wiley & Sons, West Sussex, 1992.
[22]: S. Openshaw. A Mark 1 Geographical Analysis Machine for the automated analysis of point data sets. International Journal of GIS, 1(4):335-358, 1987.
[23]: S. Openshaw. Two exploratory space-time-attribute pattern analysers relevant to GIS. In S. Fotheringham and P. Rogerson, editors, Spatial Analysis and GIS, pages 83-104. Taylor and Francis, London, 1994.
[24]: S. Openshaw. Geographical data mining: key design issues. In Proceedings of GeoComputation 99, 1999.
[25]: W. Wang, J. Yang, and R. Muntz. STING: A Statistical Information Grid Approach to Spatial Data Mining. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), pages 144-155, 1997.
[26]: W. Wang, J. Yang, and R. Muntz. STING+: An Approach to Active Spatial Data Mining. In Proceedings of the International Conference on Data Engineering, pages 116-125, 1999.
[27]: R. Webster and M. A. Oliver. Statistical Methods in Soil and Land Resource Survey. Oxford University Press, New York, 1990.
[28]: C. S. Wllace. Intrinsic Classfication of Spatially Correlated Data. The Computer Journal, 41(8):602-611, 1998.
[29]: C. T. Zahn. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. IEEE Transactions of Computers, 20(1):68-86, 1971.
[30]: T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 103-114, 1996.