AUTOCLUST: Automatic clustering via boundary extraction for mining massive
point-data sets
Vladimir Estivill-Castro and Ickjai Lee
Department of Computer Science & Software Engineering, The University
of Newcastle, Callaghan, NSW 2308, Australia
E-mail: vlad@cs.newcastle.edu.au,
ijlee@cs.newcastle.edu.au
Abstract
Widespread clustering methods require user-specified arguments and prior
knowledge to produce their best results. This demands pre-processing and/or
several trial and error steps. Both are extremely expensive and inefficient
for massive data sets. The need to find best-fit arguments in semi-automatic
clustering is not the only concern, the manipulation of data to find the
arguments opposes the philosophy of ''let the data speak for themselves''
that underpins exploratory data analysis. Our new approach consists of
effective and efficient methods for discovering cluster boundaries in point-data
sets. The approach automatically extracts boundaries based on Voronoi modelling
and Delaunay Diagrams. Parameters are not specified by users in our automatic
clustering. Rather, values for parameters are revealed from the proximity
structures of the Voronoi modelling, and thus, an algorithm, AUTOCLUST,
calculates them from the Delaunay Diagram. This not only removes human-generated
bias, but also reduces exploration time. The effectiveness of our approach
allows us to detect not only clusters of different densities, but sparse
clusters near to high-density clusters. Multiple bridges linking clusters
are identified and removed. All this is performed within O(n
log n) expected time, where n is the number of data points.
We evaluate AUTOCLUST's time efficiency and clustering quality. We compare
and contrast AUTOCLUST with other algorithms for clustering large geo-referenced
sets of points. A series of detailed performance comparisons with both
synthetic data sets and real data sets confirms the virtues of our approach.
1 Introduction
As the size of spatial data increases exponentially
and the structure of data becomes more complex, data mining and knowledge
discovery techniques become essential tools for successful analysis of
large spatial data sets [17].
Clustering is central to Spatial Data Mining [7,20]
or Geographical Data Mining [24].
It partitions point-data sets into smaller groups due to proximity. Resulting
groups represent geographically interesting concentrations in which further
investigations should be undertaken to find possible causal factors. Numerous
clustering approaches have been suggested within data mining [1,12,13],
spatial databases [4,5,16,20,25,26,30],
statistics [9,18,19,28]
and GIS communities [6,7,15,22].
Different clustering methods have been proven to offer different capabilities,
different domains of applicability and different computational requirements.
Nevertheless, clustering methods share their needs for user-specified
arguments and prior knowledge to produce their best results. Such information
needs are supplied as density threshold values, merge/split conditions,
number of parts, prior probabilities, assumptions about the distribution
of continuous attributes within classes, and/or kernel size for intensity
testing (for example, grid size for raster-like clustering and radius for
vector-like clustering). This parameter tuning is extremely expensive and
inefficient for huge data sets because it demands pre-processing and/or
several trial and error steps.
The need to find best-fit arguments in wide-spread semi-automatic clustering
is not the only concern, the manipulation of data to find the arguments
opposes Openshaw's famous postulation [23]
to ''Let the data speak for themselves''. This postulation underpins exploratory
spatial data analysis. Hypotheses should be uncovered from the data, not
from the user's prior knowledge and/or assumptions. So far, clustering
approaches detect relatively high-density areas (localised excesses) within
the study region. However, this clustering principle alone has difficulty
in determining exactly what is high-density and what is relatively less-density.
Thus, it frequently fails to find slightly sparse groups when they are
adjacent to high-density groups despite of their great importance for further
investigations. We present specific illustrations of these problems in
Section 2.
We overcome these problems by precise modelling of spatial proximity
in association with density. Density is a natural principle, but it should
be combined with relative proximity. Precise modelling of spatial proximity
means a capability to obtain unambiguous information about global effects
and local effects since geographical phenomena are the result of global
first-order effects and local second-order effects [2].
Global effects indicate the degree of excesses in the global sense of view
while local effects represent local variations. With the combination of
both effects, localised groups shall appear.
Gold [10]
argues that the traditional modelling approaches are not able to capture
spatial proximity uniquely or consistently for point-data sets. As an alternative,
Voronoi modelling has several advantages for proximity relations of points
over the conventional modelling based on either vector- like or raster-like
modelling. In particular, Voronoi models uniquely represent discrete point-data
sets and consistently capture spatial proximity [10].
We show that, as a consequence, automatic clustering is feasible.
Delaunay Diagrams (these remove ambiguities from Delaunay Triangulations
when co-circular quadruples are present), as duals of Voronoi Diagrams,
are used as the source of our approach. Section 3
discusses the use of Delaunay Triangulation in modelling spatial proximity
and in clustering. With this structure, we find clusters by locating sharp
gradient changes in point density (cluster boundaries). In our automatic
clustering, users do not specify arguments. We believe any parameters must
be encoded in the proximity structures of the Voronoi modelling, and thus,
or algorithm, AUTOCLUST, calculates them from the Delaunay Diagram. This
not only removes human-generated bias, but also reduces exploration time.
The intuition behind boundary finding is that, in the Delaunay Diagram,
points in the border of a cluster tend to have greater standard deviation
of length of their incident edges since they have both short edges and
long edges. The former connect points within a cluster and the latter straddle
between clusters or between a cluster and noise. This characteristic of
border points is the essence for formulating a dynamic edge-elimination
criterion. The criterion has the flavour of a statistical test, labelling
edges that are units of the standard deviation away from the mean. However,
the principles behind our methods go far beyond this initial intuition
and at least two justifications appear in Section 4.
Removal of edges labelled as too long from the Delaunay Diagram is the
first phase that extracts initial rough boundaries. This phase takes into
account local and global variations to ensure that relatively homogeneous
clusters are not broken up into tiny uninteresting sub-sets and relatively
heterogeneous sub-sets are segmented into meaningful sub-clusters. Section
4 details this process as well as two subsequent
analysis phases in order to eliminate some difficulties with graph-based
clustering. This process removes bridges (line-like paths with a small
number of points) that connect clusters.
The effectiveness of our approach allows us to detect not only clusters
of different densities, but sparse clusters near to high-density clusters.
Multiple bridges linking clusters are identified and removed. All this
within O(n log n) expected time, where n is
the number of data points. We confirm AUTOCLUST's time efficiency and clustering
quality experimentally in Section 5. These
performance evaluations use both synthetic data sets and real data sets.
We contrast AUTOCLUST with other algorithms for clustering large geo-referenced
sets of points. Detailed information can be found in companion technical
report [8].
It categorises clustering methods into 6 groups on the basis of their working
principles. We discuss their major problems in the light of AUTOCLUST.
Finally, we point out some other areas of impact of our approach with concluding
remarks in Section 6.
2 Mere density fails
Spatial clustering for massive data sets has become central in exploratory
data analysis. Early efforts like GAM [22]
and CLARANS [20]
detect clusters within huge data sets but demand high CPU effort (and thus,
are costly with respect to computational time). Later, BIRCH [30]
was able to find clusters in the presence of noise. DBSCAN [5]
and others [4,12,15]
detect clusters of arbitrary shapes. Recently, CHAMELEON [16]
and AMOEBA [7]
report clusters of different densities when data sets exhibit bridges and
high discrepancy in density and noise. Despite a certain level of maturity
in spatial clustering, methods still demand parameter tuning from the users
and fail to find some interesting groups.
2.1 Sparse clusters adjacent to high-density clusters
Clustering approaches focus on finding relatively high-density areas (density-based
and grid-based), merging closer groups (hierarchical and model-based),
assigning to closer prototypical representatives (partitioning and model-based)
or detecting shorter or stronger interactions (graph-based). All typically
miss relatively sparse clusters adjacent to high-density clusters. Consider
the data set shown in Figure 1(a).
Figure 1: A sparse cluster adjacent to two high-density clusters: (a) a
data set (n = 600); (b) three high-density clusters; (c) one big
cluster and one high-density cluster; (d) one sparse cluster; (e) four
clusters.
Most clustering algorithms report either
-
three very dense clusters (C1, C2 and C3 in Figure
1(b)) when user-supplied arguments combine to reveal
very high-density clusters, or
-
one big cluster (C23 in Figure 1(c)) and
one dense cluster (C1), when user-supplied arguments are tuned to
relax the criterion function in order to detect the sparse cluster (C4
in Figure 1(d)).
These algorithms are not able to detect the three high-density clusters
and the sparse cluster (Figure 1(e)) simultaneously.
Clustering methods are inclined to report the big cluster (C23
in Figure 1(c)) as one cluster because the intra-cluster
distance of cluster C4 (sparse cluster) is greater than the inter-cluster
distance between the high-density clusters (C2 or C3) and
the sparse cluster (C4). Apart from the three high-density clusters,
the sparse cluster is also important for further correlation investigation
between themes. Spotting and locating interesting groups in a particular
theme is the starting point of exploratory data analysis. If we assume
that points represent crime incidents around cities and surrounding suburbs,
then mere density is unable to find correlation between crime incidents
and suburbs by missing the distinction between the sparse cluster and the
high-density clusters. Often, this phenomenon is found in real data, such
as big cities having more crime incidents surrounded by suburbs recording
relatively less. Moreover, detection of a boundary allows for exploration
of a geographical feature that can explain the sharp decrease in density
and frequency (the crimes reported are for an interval of time).
2.2 More than one line-like bridges between two
clusters
Graph-based clustering [4,7,15,16,29]
methods have gained popularity as alternatives to density methods since
they work from an underlying graph model. The observations (points) are
assigned to vertices and a discrete relation is encoded, making two related
vertices connected with an edge. In this abstract structure, a cut-point
is a vertex whose removal disconnects the graph and a bridge is
an edge whose removal disconnects the graph [14].
Connected components are the most elementary clusters. Edges have a weight
corresponding to the geographical distance between the associated end-vertices.
Hierarchical clustering with simple linkage corresponds to computing
the Minimum Spanning Tree of the weighted graph. However, it is well-known
that simple linkage suffers from bridges. Problems also arise for other
strategies when clusters are connected only at bridges or chains. Well-known
solutions for this find all bridges and cut-points [29].
This works when only one bridge connects two clusters.
Figure 2: Multiple bridges: (a) a data set (n = 555); (b) three
clusters; (c) three clusters and bridges; (d) one big cluster and one small
cluster.
Consider the data in Figure 2(a). Figure 2(b)
shows three clusters (C1, C2 and
C3). Figure 2(c)
shows that there are three path-like bridges connecting two clusters (C1
and C2) while one connecting two clusters (C2 and C3).
There are no bridges or cut-points between clusters C1
and C2 while there are between clusters C2 and C3.
Therefore, the approach using bridges and cut-points fails
to detect two clusters linked by multiple bridges, thus generates two clusters
C12
and C3 as shown in Figure 2(d).
2.3 Closely located high-density clusters
Clusters emerge as a combination of global effects and local effects. Clusters
may share global first-order effects, but have their own local second-order
effects [7].
Global user-supplied arguments are to be used only when the set of points
is under the same global circumstance. Thus, the use of global user-supplied
arguments should be minimised not only to allow for local variation, but
for spotting high-quality clusters.
Figure 3: Closely located high-density clusters: (a) a data set (n
= 570); (b) five clusters; (c) four high-density clusters; (d) three high-density
clusters and one sparse cluster.
Visual inspection of Figure 3(a) reveals four
high-density clusters and one sparse cluster. Figure 3(b)
highlights the four high-density clusters with labels C1 to C4
and one sparse cluster with label C5. However, using only global
arguments is very difficult to obtain a reply from a clustering system
that suggests these five clusters simultaneously. Settings of user-supplied
arguments alternate between a reply that
-
detects four high-density clusters (C1, C2, C3 and
C4) leaving the sparse cluster (C5) out, perhaps as noise
(refer to Figure 3(c)), or
-
detect three high-density clusters (C1, C2 and C34)
and one sparse cluster (C5) as illustrated in Figure 3(d).
The former is the result of arguments slightly of tune. For example, shorter
cut-off criterion function (graph-based), higher density parameter (density-based)
or smaller size of kernel, normally circle or rectangle, (density-based
or grid-based). By relaxing the global arguments, the sparse cluster now
can be identified. But typically, because the separation between C3
and C4 is smaller than the intra-cluster distance in cluster
C5,
the two are combined into one high-density cluster C34 (refer to
Figure 3(d)).
2.4 Argument free
Minimising the need to adjust arguments reduces the risk of tampering with
the analysis tools. It also promises higher user friendliness. With our
approach, non-experts can now expect good quality clusters without assistance
from experts towards argument-tuning. In addition, it reduces experimentation
time to search best-fit arguments. Clustering for knowledge discovery is
for revealing implicitly hidden information and patterns in data. Thus,
our approach supports automatic exploratory analysis [23],
rather than human-driven confirmatory analysis.
3 Using Delaunay Diagrams
3.1 Modelling proximity with Delaunay Diagrams
Two-dimensional clustering is the non-trivial process of grouping geographically
closer points into the same cluster. Therefore, a model of spatial proximity
for a discrete point-data set
P = {p1,p2,...,pn}
must provide robust answers to which are the neighbours of a point pi
and how far are the neighbours relative to the context of the entire data
set P. Once spatial proximity is properly modelled, clustering becomes
the matter of finding relative sudden change.
This is similar to finding sharp gradient change, the boundaries of
clusters, in point density. Note that, we do not have a continuous density
but a discrete point set. The model that does not extrapolate or interpolate
is the discrete graph. One possibility is to use the complete graph, with
all edges weighed by the geographical distance. Here, the strength of interaction
between points is inversely proportional to the length of edge (recall
that distance decays effect [2]).
However, if we have n data points, there is no need to have a model
of size(n2). Moreover, clustering
is a summarisation operation and the goal is to obtain the discrete relation
pi is in the same cluster as pj for
all pi,pj
P (a sub-relation of the complete graph). Thus, clustering is to
detect and remove inconsistent interactions (edges) in the graph modelling
proximity.
We already mentioned that, despite that raster and vector representations
are widely used in modelling geographical databases, they do not consistently
capture spatial proximity for point-data sets [10,11].
The Voronoi diagram offers an alternative to overcome the drawbacks of
conventional data models [10].
The Voronoi diagram uniquely captures spatial proximity and represents
the topology explicitly with the dual graph known as the Delaunay Triangulation.
Each Delaunay edge is an explicit representation of a neighbourhood relation
between points.
Because Delaunay Triangulations are not unique when co-circularity occurs
we use Delaunay Diagrams. These are the result of removing all edges e
in the Delaunay Triangulations that are sides of a triangle whose vertices
are co-circular with a fourth point. Even in the presence of co-circularity,
Delaunay Diagrams guarantee a unique topology. They can be constructed
effectively and efficiently (only O(n log n) time).
The structure is succinct [21];
the number of edges is linear to the number of points and the expected
number of edges incident to point is a constant value. In addition, the
effect of long edges is minimised since the Delaunay Diagram is as equilateral
as possible. Further, it is a hybrid approach of the two widely adopted
models. It is geographically comprehensive like the raster model and holds
explicit adjacency like the vector model.
3.2 Previous clustering methods using Delaunay
Triangulations
Delaunay Triangulations are so powerful to succinctly represent proximity
relationships that several clustering methods have been based on them.
The simplest example, is again, the Minimum Spanning Tree, or single-linkage
hierarchical clustering.
More robust alternatives have been proposed along the paradigm of edge-removal
from the Delaunay Triangulation to obtain the clusters. One alternative
looks at the perimeter (sum of length of edges) of the triangles (or facets)
of the Delaunay Diagram. The intuition is that, given the nature of Delaunay
Diagrams (where a circumcircle of a Delaunay Triangle does not include
any other data point), triangles with large perimeter must have vertices
that are not in the same cluster. Estivill-Castro and Houle [6]
successfully derive the number of clusters using an one-dimensional classification
of perimeter values. Another approaches are Kang
et al. [15]
and Eldershaw and Hegland [4].
They analyse the length of Delaunay edges to find clusters of arbitrary
shapes. The intuition is similar, and edges that are too long are classified
as inter-cluster edges and removed. The relative success of these previous
approaches demonstrates that Delaunay Triangulations contain much implicit
proximity information.
However, these previous ideas use directly Delaunay Triangulations and
return inconsistent results in the presence of co-circularity. More seriously,
they use a global argument as a threshold to discriminate perimeter values
or edges lengths. This kind of static approach is not able to detect local
variations, and again, fails to identify clusters in situations like those
illustrated in Section 2. Recently, Estivill-Castro
and Lee [7]
propose a dynamic approach which utilises both global and local variations.
The idea behind this method is to find relatively high-density clusters
or localised excesses based on Delaunay Diagram. This approach overcomes
some problems of static approaches and produces localised multi-level clusters.
But, it still fails to find relatively sparse clusters in situations like
those in Section 2.1 and Section 2.3.
4 AUTOCLUST
4.1 Intuition and definitions
The previous approaches justify the analysis of Delaunay Triangulations
or Delaunay Diagrams to detect inter-cluster edges (Delaunay edges whose
endpoints are in different clusters). We postulate that, in Delaunay Diagrams,
border points of clusters exhibit greater variability (dispersion or spread)
of length of their incident edges since they have both inter-cluster edges
and intra-cluster edges (Delaunay edges whose endpoints are in the same
cluster).
There are several measures for describing dispersion in statistical
spatial analysis. Undoubtedly, the standard deviation and its square, the
variance, are the most valuable and popular [27].
Thus, for the moment, we may say that AUTOCLUST detects points whose incident
edges exhibit lengths that have unusually large standard deviation. In
order to precisely define this notion we introduce the following notation.
In what follows, let P = {p1,p2,...,
pn} denote a set of n points in 2 dimensions and
let DD(P) denote the corresponding Delaunay Diagram of P.
The graph DD(P) is a planar map that contains vertices and
edges. Edges of DD(P) will be called Delaunay edges. For
a point pi P (a vertex
of DD(P)), the neighbourhood N(pi)
is the set of Delaunay edges incident to point
pi.
Definition 4.1.1 We denote by Local_Mean(pi) the
mean length of edges in
N(pi). That is,
where, d(pi) denotes to the number of Delaunay edges incident
to
pi (the degree of pi in the graph theory sense),
and denotes to the length of Delaunay edge ej.
Definition 4.1.2 We denote by Local_St_Dev(pi)
the standard deviation in the length of edges in N(pi). That
is,
Thus, an edge ej incident to a point pi
could potentially be labelled as inter-cluster (too long) if
However, this would only involve local effects. More seriously,
Local_Mean(pi)
and Local_St_Dev(pi) are statistical indicators
based on only d(pi) length values. This is a very
small sample, because = 6 in Delaunay Diagrams.
Thus, we now introduce measures that incorporate global and local effects,
and in fact, this measure will result in tests that make a decision on
each edge using the data of all edges in
DD(P).
Definition 4.1.3We denote by Mean_St_Dev(P)
the average of the
Local_St_Dev(pi) values as we let pi
be each point in
P. That is,
Note that, a value of edge-length spread like Local_St_Dev(pi)
is large or small only in relation to the global value reflected in Mean_St_Dev(P).
Thus, we introduce an indicator of the relative size of Local_St_Dev(pi)
with respect to Mean_St_Dev(P).
Definition 4.1.4We let Relative_St_Dev(pi)
denote the ratio of
Local_St_Dev(pi) and Mean_St_Dev(P). That
is,
With this tool we propose to discriminate edges into those unusually
short, those unusually long and those that seem to be proper intra-cluster
edges.
Definition 4.1.5 An edge ej
N(pi) belongs to Short_Edges(pi) if and only if the
length of ej is less than
Local_Mean(pi) - Mean_St_Dev(P).
That is,
Definition 4.1.6 An edge ej
N(pi) belongs to
Long_Edges(pi) if and only if the
length of ej is greater than Local_Mean(pi) + Mean_St_Dev(P).
That is,
Definition 4.1.7 Other_Edges(pi) denotes the subset
of N(pi) whose edges are greater than or equal to Local_Mean(pi)
- Mean_St_Dev(P), and less than or equal to Local_Mean(pi) +
Mean_St_Dev(P).
That is,
Note that, for each pi, the neighbourhood N(pi)
is partitioned into exactly 3 subsets: Short_Edges(pi),
Long_Edges(pi)
and Other_Edges(pi). However, an edge linking
pi with pj may belong, for example,
to both
Short_Edges(pi) and Other_Edges(pj).
4.2 Working principle
The first phase of our algorithm involves the identification, for each
pi, of Long_Edges(pi), Short_Edges(pi)
and
Other_Edges(pi). While it is now possible
to declare
Long_Edges as links between points in different clusters,
simply because they clearly indicate that the points are too far away,
it is not as simple to assume that Short_Edges are links for points
within a cluster. In some cases, these very short links also correspond
to points that align themselves on bridges between clusters. For example,
consider the data in Figure 4(a). The Delaunay
Diagram of these 200 points is shown in Figure 4(b).
Some Short_Edges of the data set are illustrated in Figure 4(c).
They correspond to two bridges between the two clusters. This is because
other Delaunay edges incident to points in the bridges are long, so the
edges on the bridges are exceptionally short. Note that edges inside any
of the two clusters are of approximately the same length as edges on the
bridge. But, edges incident to a point pi inside a cluster
are almost all of equal length. Thus, removing all edges e
Short_Edges(pi) eliminates some bridges,
while if such e happens inside a cluster, its removal can later
be recuperated using analysis of connected components. Later phases of
our algorithm handle other types of bridges and recuperate
Short_Edges
that are intra-cluster links.
Figure 4: Two clusters connected by two bridges: (a) a data set (n
= 200); (b) its Delaunay Diagram; (c) the
Short_Edges on bridges.
4.2.1 First justification
While we have the intuitively supported rationale in AUTOCLUST to remove
all edges that are in Long_Edges(pi)
Short_Edges(pi) for some pi
P, we now give another justifications. Note that the analysis of
edges in DD(P) is the means towards grouping points, towards
clustering. We noted that the relation pi is in the same
cluster as pj is a subset of the complete graph and encodes
the clustering. This is what establishes the connection between grouping
points and the analysis of edges. Previous clustering algorithms using
Delaunay Triangulation [4,6,15]
focus much more on edges because all operations and cut-off criteria functions
are performed and formulated based on edge analysis. AUTOCLUST is closer
to the objective of grouping points because
Local_Mean(pi)
and Local_St_Dev(pi) report on local behaviour
around pi. These values represent behaviour of points
more than that of edges. Global effects are incorporated in discriminating
Long_Edges(pi), Short_Edges(pi)
and
Other_Edges(pi), with the use of Mean_St_Dev(P).
This makes all the n points influence the decisions for any other
pi. Perturbations of a point pj away
from a point pi do not affect the Delaunay Diagram in
the vicinity of pi. However, the relative configuration
of other points pj away from pi does
influence Mean_St_Dev(P).
4.2.2 Second justification
As a second justification to AUTOCLUST's removal of all edges that are
in Long_Edges(pi)
Short_Edges(pi) for some
pi
P, first recall that border points pi of a cluster
C are expected to have Local_Mean(pi) and
Local_St_Dev(pi)
larger than points totally inside the cluster C. Thus, if Local_Mean(pi)
and
Local_St_Dev(pi) where statistically meaningful
in that were supported on a large sample, a statistical test to remove
unusually long edges would have the form
where k would be a constant to evaluate how unusual an edge is.
However, for a point pi inside a cluster C,
there is uniformity around it, so all its incident edges have similar lengths
and Local_St_Dev(pi) is relatively small. In this
case, we would set the value of k to be larger than 1 in order to
widen the acceptance interval and make sure we preserve edges inside a
cluster. On the other hand, for a point pi that does
not belong to any cluster (an outlier or noise), there are long edges around
it to reach other points. This makes
Local_Mean(pi)
large and because of scale that
Local_St_Dev(pi)
is in the same units as Local_Mean(pi),
Local_St_Dev(pi)
is relatively large. Thus, in this case, we would set the value of k
to be less than one in order to restrict more the acceptance interval and
ensure we remove more edges incident to an outlier.
The inverse of Relative_St_Dev(pi) fulfils
perfectly the role required for the factor k. It is less than one
for those points
pi that locally exhibit more spread
(of length of their incident edges) than the generality of all points in
P. It is larger than one for those points pi that
locally exhibit less spread than the generality of all points in P.
Incorporating k = Relative_St_Dev(pi)-1
into the statistical test, we have the following derivation.
This is the acceptance interval for Other_Edges(pi).
4.3 A three-phase clustering algorithm
AUTOCLUST consists of three phases. Phase
I classifies incident edges into Short_Edges(pi),
Long_Edges(pi) and
Other_Edges(pi)
for each pi. Phase I also removes all edges in Short_Edges(pi)
and Long_Edges(pi) for all points pi.
This produces the boundaries of clusters, perhaps with some bridges and
perhaps isolating some points. For each point pi, Phase
II recovers edges in Short_Edges(pi) as long as
they do not merge non-trivial connected components, if this happens, it
may re-establish the connected components by removing some Other_Edges(pi).
Finally, Phase III looks for bridges in a local region of each pi.
Figure 5 shows the operation of these phases
in a data set with 4 clusters. The Delaunay Diagram is presented in Figure
5(a), while Figure 5(b)
shows the conclusion of Phase I. Application of the second phase results
in Figure 5(c). Data points isolated are attached
back. Phase III breaks the bridges between the two bottom clusters.
Figure 5: Cleaning procedures: (a) Delaunay Diagram; (b) after Phase I
(initial 3 clusters); (c) after Phase II (3 clusters with refinement);
(d) after Phase III (4 clusters).
4.3.1 Phase I: Finding boundaries
Our earlier discussion justifies in detail the removal of all edges in
all Long_Edges(pi) from the Delaunay Diagram.
They are simply too long and are merging two distinct clusters. We stress
that edges e = (pi,pj) in Short_Edges(pi)
where pj is in the same cluster are temporary removed
and to be recuperated in the next phase because pi and
pj will be in the same connected component.
4.3.2 Phase II: Restoring and re-attaching
The goal of this phase is to restore the edges e = (pi,pj)
in Short_Edges(pi) where pi
and pj are in the same cluster. Also, the goal is to
remove some Other_Edges(pi) that are creating
paths that may result in bridges or links between border points and noise
points. Thus, the first step of Phase II is to compute connected components
and label each point
pi with its connected component
CC[pi]. If a connected component has more than
one element, it is called non-trivial. If
pi is the only
point in a connected component, then the point pi is
isolated. In that case CC[Pi] is not a cluster.
To introduce the processing of Phase II, we bring attention to the following
observation.
Observation 4.3.1Let pi
P be such that
Other_Edges(pi) .
Then, all edges e Other_Edges(pi) connect
pi to the same non-trivial connected component.
Using our notation, this means that if two edges
ej
= (pi,pj) and ek
= (pi,pk) are in
Other_Edges(pi),
then CC[pj] = CC[pk].
The observation follows trivially from the fact that pj
and pk are connected by the path ej,ek.
Now, we recuperate
Short_Edges, and in some cases, revise the connected
component for pi. If all edges in Short_Edges(pi)
suggest unanimously a connected component, pi is integrated
to that suggestion. If there are two edges in Short_Edges(pi)
suggesting different connected components, the largest is adopted. While
Other_Edges(pi) participate in the creation
of the connected components, now they delegate precedence to the suggestion
by Short_Edges(pi) (which in the large
majority of the cases is the same suggestion). The component swap typically
occurs on isolated boundary points that have no Other_Edges (for
example, the isolated points on the left of the bottom cluster in Figure
5(b)). Details of this process are as follows.
-
For each point pi, we perform the following steps if
Short_Edges(pi)
and (e = (pi,pj))
Short_Edges(pi), CC[pj]
is a non-trivial connected component.
-
Determine a candidate connected component C for pi.
-
If there are two edges
ej = (pi,pj)
and
ek = (pi,pk)
in
Short_Edges(pi) with
CC[pj]
CC[pk], then
-
Compute, for each edge e = (pi,pj)
Short_Edges(pi), the size
and let .
-
Let C be the class label of the largest connected component (if
there are two different connected components with cardinality M,
we let C be the one with the shortest edge to pi).
-
Otherwise, let C be the label of the connected component all edges
e Short_Edges(pi)
connect pi to.
-
If the edges in Other_Edges(pi) connect
to a connected component different that C, remove them. Note that
-
because of Observation 4.3.1 above, when this
case applies, all edges in Other_Edges(pi)
are removed, and
-
only in this case, will pi swap non-trivial connected
components.
-
Restore all edges e ShortEdges(pi)
that connect to C.
-
Re-compute connected components.
-
For each singleton pi (CC[pi]
= 1), we restore all Short_Edges(pi) if
(e = (pi,pj))
Short_Edges(pi), CC[pj]
is the same connected component C.
Thus, Phase II incorporates points that are isolated in singleton connected
components to non-trivial connected components if and only if they are
very close (they have at least one Short_Edge to a non-trivial connected
component) in step 1. Singletons whose incident edges connected to another
singletons are re-attached in step 3. Phase II also re-assigns a point
pi linked to a cluster C by Other_Edges(pi)
if there are
Short_Edges(pi) that indicate membership
to .
4.3.3 Phase III: Detecting second-order inconsistency
After the first two phases, the current subgraph of
DD(P)
is the result of local analysis with regard to global effects. However,
to complete the analysis, the immediate neighbourhood is extended slightly.
Thus, we extend the notion of neighbourhood of a vertex pi
in a graph G as follows.
Definition 4.3.1 The k-neighbourhood Nk,G(pi)
of a vertex pi in graph G is the set of edges of G that belong
to a path of k or less edges starting at pi.
The third phase, now computes Local_Mean for 2-neighborhoods.
Definition 4.3.2 We denote by Local_Meank,G(pi)
the mean length of edges in Nk,G(pi). That is,
For each point pi, this phase simply removes all edges
in
N2,G(pi) that are long edges.
That is, all
e N2,G(pi)
such that > Local_Mean2,G(pi)
+ Mean_St_Dev(P).
4.4 AUTOCLUST only requires O(n
log n) time
Sub-quadratic time complexity is central for clustering algorithms in spatial
data mining. AUTOCLUST is applicable to clustering of very large geo-referenced
data since it only requires O(n log n) time. AUTOCLUST
requires O(n log n) time to construct the Delaunay
Diagram. Once this model of proximity is available, AUTOCLUST performs
the remaining work in linear time. The calculation of
Mean_St_Dev(P)
is bounded by twice the number of edges and the edges elimination in Phase
I requires O(n) time. To compute connected components only
requires O(n) time, since this is linear in the sum of both
number of edges and number of points. Each test to restore Short_Edges(pi)
in Phase II is bounded by d(pi) and thus, the
total time for Phase II is also linear. Similarly, the local subgraph around
pi in Phase III has size O(d(pi)2).
Because in the Delaunay Diagram we have
E[d(pi)]
= 6, the total execution in this phase requires time proportional to the
sum of the degrees of all vertices, which is
O(n). Consequently,
the time complexity of AUTOCLUST is dominated by computing Delaunay Diagram
in O(n log n) time.
5 Performance evaluation
This section presents results of experiments with synthetic data sets and
real data sets that illustrate remarkably the robustness of AUTOCLUST.
5.1 Clustering quality with synthetic data sets
Figure 6: Synthetic data sets: (a) Dataset I (n = 8000 and 8 clusters);
(b) Dataset II (n = 8000 and 8 clusters); (c) Dataset III (n
= 8000 and 6 clusters); (d) Dataset IV ( n = 3250 and 12 clusters).
The data in Figure 6(a) is the result of
synthetic data generation aimed at creating situations where clustering
methods typically fail. Section 2 explains
these types of situations. The clustering obtained with AUTOCLUST demonstrates
its robustness. AUTOCLUST correctly identifies two clusters connected by
multiple bridges in the top-left corner of Figure 6(a).
AUTOCLUST has no difficulty in detecting a cross-shaped, but sparse cluster
around a high-density cluster at the top-right corner of that data set.
Again, this mimics real-world cases like densely populated city surrounded
by slightly less dense suburbs. AUTOCLUST successfully reports the two
high-density clusters separated by a twisting narrow gap (at the bottom-left
of the data set). This mimics real-world situations where a river separates
two highly populated areas. AUTOCLUST correctly reports the doughnut-shaped
cluster and the surrounded island-shaped cluster (at the bottom-right of
that data set).
The data sets in Figure 6(b) and Figure
6(c) belong to CHAMELEON's benchmark [16].
Regardless of the challenge posed by different sizes, different densities,
and different separations (inter-cluster distances), AUTOCLUST correctly
identifies 8 clusters and 6 clusters, respectively. Some of these clusters
have very little space between them, but AUTOCLUST still detects these
boundaries.
The data in Figure 6(d), consists of 12
clusters of various shapes and different densities. Note that many are
not convex. AUTOCLUST successfully detects all of them.
5.2 Clustering quality with real data sets
The data sets displayed in Figure 7 are real data
sets of Queensland in Australia. Figure 7(a) displays
centres for living areas in Queensland and AUTOCLUST clustering results.
In the bottom-right of the data sets, Brisbane, the capital city of Queensland,
creates one large cluster (incorporating 70 percent of the living areas).
AUTOCLUST detected this and simultaneously, it finds three relatively smaller
clusters along the coast which correspond to big cities such as Cairns,
Atherton and Townsville. In-land areas have no clusters. This highlights
the pattern of urban settlement in coastal areas and around capital cities.
Figure 7: Real data sets: (a) living areas (n = 1,837) and AUTOCLUST
grouping into 4 clusters; (b) landing grounds (n = 1,579) and AUTOCLUST
grouping into 3 clusters (greater than 1 % of n); (c) national parks
(n = 227) and AUTOCLUST grouping into 4 clusters (greater than 2%
of n).
The data set displayed in Figure 7(b) represents
landing grounds including aerodromes and airports. Visual inspection does
not suggest any significant cluster. Landing grounds seem to be scattered
all over Queensland. AUTOCLUST groups 90 percent of them into a large cluster
that covers almost all Queensland. AUTOCLUST's results are consistent with
the fact that there is no special pattern in landing grounds. AUTOCLUST
also detects two small clusters in the southern portion of Brisbane. Note
that, visual inspection reveals that there is less density for landing
grounds surrounding the concentrated living areas near Brisbane. Creating
the two other clusters South and East of Brisbane. Densely populated urban
areas are not likely to have enough space for many airports. Conversely,
sparsely populated in-land areas are likely to have enough space for small
aerodromes.
Figure 7(c) illustrates a data set for national
parks in Queensland. National parks are generally located along the coast
and the Great Barrier Reef. AUTOCLUST successfully derives clusters
directly overlying the major cities, urban areas and coastal areas. It
finds two high-density clusters including more than half of national parks
on the outer margin of Australia's continental shelf. Two less dense clusters
are detected around urban areas.
5.3 CPU time comparison
While AUTOCLUST requires O(n log n) time, we are interested
in verifying experimentally that the constants under the O notation
are small. This is to be expected, since the dominating
O(n
log n) time term in AUTOCLUST is the computation of the Delaunay
Diagram. Everything else requires O(n) time.
We compare AUTOCLUST with AMOEBA [7].
AMOEBA also requires O(n log n) time and has been
shown to be very efficient, (small constant under the O notation).
We generated 7 types of data sets based on a simple mixture model. The
number of points varies from 1,000 to 100,000. Each data set has 10 randomly
distributed cluster centres within the unit square [0, 1]
[0, 1] and 10 percent of the total number of points are just noise points
(selected at random within the unit square [0, 1]
[0, 1] ). Since AUTOCLUST and AMOEBA require spatial proximity modelling
(Delaunay Diagram), the run time comparison shown in Figure 8
includes the time for constructing their respective data structures. All
the experiments were performed on a 550MHz Pentium III workstation with
128MB memory.
Figure 8: Comparison of CPU times in seconds.
Figure 8 demonstrates that AUTOCLUST outperforms
AMOEBA with respect to CPU-time by a small margin. Note that, however,
the exploration time to find best-fit arguments for AMOEBA is not included.
AUTOCLUST is argument free. Thus, in exploratory data analysis exercises
AUTOCLUST will easily outperform the use of AMOEBA or other clustering
methods that need parameter tuning. The experimental results shown in Figure
8 include the time requirements for I/O and clustering
process. These experimental results illustrate that AUTOCLUST is efficient
in terms of time complexity and scalable to large data sets.
6 Final remarks
Spatial data mining is the discovery of interesting patterns, structures,
relationships or characteristics that may exist implicitly in spatial databases
[20].
These information is hidden and unknown. Thus, we do not know what kinds
of patterns exist, what causes them, how many clusters are embedded and
where they are. Only the data can genuinely reveal these information. Hypotheses
about the patterns are suggested from the data in exploratory data analysis.
This is very distinctive from the confirmation of human generated hypothesis
in confirmatory data analysis. Hence, for exploratory data analysis, the
need to supply prior knowledge should be minimised, removing human-generated
biases. AUTOCLUST is a clustering tool for massive geo-referenced data
sets free of user-supplied arguments. Moreover, this freedom eliminates
expensive human-based exploration time for finding best-fit arguments needed
for alternative approaches. AUTOCLUST incorporates global variations and
local variations. Hence, it is robust to noise, outliers, bridges and the
type of distribution. It is able to detect clusters with arbitrary shapes,
different sizes and different densities. Clusters with multiple bridges
are also correctly separated. It only requires O(n log n)
time, where
n is the size of the data set. A series of comparison
proves the superiority of AUTOCLUST in terms of efficiency and effectiveness.
We expect that our methods will extended in two directions. First, the
approach can be extended to 3 dimensions by computing a spanner graph of
linear size and obtaining local and global information efficiently [3]
from this graph. Second, the approach can naturally be extended to other
metrics with well-defined Delaunay Diagrams like Manhattan distance or
distances. Further research will explore these directions.
References
-
[1]
-
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic Subspace
Clustering of High Dimensional Data for Data Mining Applications. In Proceedings
of the ACM SIGMOD International Conference on Management of Data, pages
94-105, 1998.
-
[2]
-
T. C. Bailey and A. C. Gatrell. Interactive Spatial Analysis. Wiley,
New York, 1995.
-
[3]
-
M. T. Dickerson and D. Eppstein. Algorithms for proximity problems in higher
dimensions. Computational Geometry - Theory and Applications, 5:277-291,
1996.
-
[4]
-
C. Eldershaw and M. Hegland. Cluster Analysis using Triangulation. In B.
J. Noye, M. D. Teubner, and A. W. Gill, editors, Computational Techniques
and Applications: CTAC97, pages 201-208. World Scientific, Singapore,
1997.
-
[5]
-
M. Ester, M. P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings
of the 2nd International Conference on Knowledge Discovery and Data Mining,
pages 226-231, 1996.
-
[6]
-
V. Estivill-Castro and M. E. Houle. Robust Clustering of Large Geo-referenced
Data Sets. In Proceedings of the 3rd Pacific-Asia Conference on Knowledge
Discovery and Data Mining, pages 327-337, 1999.
-
[7]
-
V. Estivill-Castro and I. Lee. AMOEBA: Hierarchical Clustering Based on
Spatial Proximity Using Delaunay Diagram. In Proceedings of the 9th
International Symposium on Spatial Data Handling, 2000. to appear.
-
[8]
-
V. Estivill-Castro and I. Lee. AUTOCLUST: Automatic Clustering via Boundary
Extraction for Mining Massive Point-Data Sets. Technical Report 2000-03,
Department of Computer Science and Software Engineering, University of
Newcastle, 2000.
-
[9]
-
C. Fraley. Algorithms for Model-Based Gaussian Hierarchical Clustering.
SIAM Journal on Scientific Computing, 20(1):270-281, 1998.
-
[10]
-
C. M. Gold. Problems with handling spatial data - The Voronoi approach.
CISM Journal ACSGC, 45(1):65-80, 1991.
-
[11]
-
C. M. Gold. The meaning of Neighbour. In G. Tinhofer and G. Schmidt, editors,
Theories and Methods of Spatio-Temporal Reasoning in Geographic Space,
pages 220-235, Berlin, 1992. Springer-Verlag Lecture Notes in Computer
Science 639.
-
[12]
-
S. Guha, R. Rastogi, and K. Shim. CURE: An Efficient Clustering Algorithm
for Large Databases. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 73-84, 1998.
-
[13]
-
S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for
Categorical Attributes. In Proceedings of the International Conference
on Data Engineering, pages 512-521, 1999.
-
[14]
-
F. Harary. Graph Theory. Addison-Wesley, Reading, MA, 1969.
-
[15]
-
I. Kang, T. Kim, and K. Li. A Spatial Data Mining Method by Delaunay Triangulation.
In Proceedings of the 5th International Workshop on Advances in Geographic
Information Systems (GIS-97), pages 35-39, 1997.
-
[16]
-
G. Karypis, E. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. IEEE Computer: Special Issue on Data
Analysis and Mining, 32(8):68-75, 1999.
-
[17]
-
K. Koperski, J. Han, and J. Adhikari. Mining Knowledge in Geographical
Data. Communications of the ACM. to appear.
-
[18]
-
F. Murtagh. A Survey of Recent Advances in Hierarchical Clustering Algorithms.
The Computer Journal, 26:354-359, 1983.
-
[19]
-
F. Murtagh. Cluster analysis using proximities. In J. Hampton, R. Michalski,
P. Theuns, and I. V. Mechelen, editors, Categories and Concepts: Theoretical
Views and Inductive Data Analysis, pages 225-245. Academic Press, New
York, 1993.
-
[20]
-
R. T. Ng and J. Han. Efficient and Effective Clustering Method for Spatial
Data Mining. In Proceedings of the 20th International Conference on
Very Large Data Bases (VLDB), pages 144-155, 1994.
-
[21]
-
A. Okabe, B. N. Boots, and K. Sugihara. Spatial Tessellations: Concepts
and Applications of Voronoi Diagrams. John Wiley & Sons, West Sussex,
1992.
-
[22]
-
S. Openshaw. A Mark 1 Geographical Analysis Machine for the automated analysis
of point data sets. International Journal of GIS, 1(4):335-358,
1987.
-
[23]
-
S. Openshaw. Two exploratory space-time-attribute pattern analysers relevant
to GIS. In S. Fotheringham and P. Rogerson, editors, Spatial Analysis
and GIS, pages 83-104. Taylor and Francis, London, 1994.
-
[24]
-
S. Openshaw. Geographical data mining: key design issues. In Proceedings
of GeoComputation 99, 1999.
-
[25]
-
W. Wang, J. Yang, and R. Muntz. STING: A Statistical Information Grid Approach
to Spatial Data Mining. In Proceedings of the 23rd International Conference
on Very Large Data Bases (VLDB), pages 144-155, 1997.
-
[26]
-
W. Wang, J. Yang, and R. Muntz. STING+: An Approach to Active Spatial Data
Mining. In Proceedings of the International Conference on Data Engineering,
pages 116-125, 1999.
-
[27]
-
R. Webster and M. A. Oliver. Statistical Methods in Soil and Land Resource
Survey. Oxford University Press, New York, 1990.
-
[28]
-
C. S. Wllace. Intrinsic Classfication of Spatially Correlated Data. The
Computer Journal, 41(8):602-611, 1998.
-
[29]
-
C. T. Zahn. Graph-Theoretical Methods for Detecting and Describing Gestalt
Clusters. IEEE Transactions of Computers, 20(1):68-86, 1971.
-
[30]
-
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering
Method for Very Large Databases. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 103-114, 1996.