ESTIVILL-CASTRO, V. and LEE, I.

AUTOCLUST: Automatic Clustering via Boundary Extraction for Massive Point Data Sets

ESTIVILL-CASTRO, V. and LEE, I.
University of Newcastle, New South Wales, Australia
Email: vlad@cs.newcastle.edu.au

Key words: Clustering, Boundary Extraction, Geographical Data Mining, Delaunay Diagram, Voronoi Modelling

Clustering is the partition of point data sets into smaller groups due to proximity. Clustering is central to spatial data mining and it is used for analysing correlation between themes. As the amount of spatial data grows explosively and the structure of data becomes more complex, both efficiency (fast enough) and effectiveness (quality of clustering) become central aspects of successful clustering. Traditional clustering methods require user-specified arguments and prior knowledge to produce their best results. This demands pre-processing or/and several trial and error steps. Both are extremely expensive and inefficient for massive data sets. The need to find best-fit arguments in traditional semi-automatic clustering is not the only concern; manipulation of data to find the appropriate arguments opposes the philosophy of "let the data speak" that underpins exploratory data analysis.

So far, efficient clustering approaches have been based on detecting relatively denser areas within the study region. However, this clustering principle alone has difficulty in determining exactly what has high density and what is relatively sparser. Thus, it frequently fails to find slightly sparser groups, when they are adjacent to dense groups, despite their great importance for further investigations.

Density is a natural principle, but it should be combined with relative proximity. Clusters can then be found by locating sharp gradient changes in point density (cluster boundaries). Our new approach consists of effective and efficient methods for discovering such cluster boundaries in point data sets. We propose a new and fully automatic clustering algorithm. The approach extracts boundaries based on Voronoi modelling and Delaunay Diagrams (these provide proximity information and eliminate the ambiguities in Delaunay triangulations when point cocircularity occurs). Voronoi modelling has several advantages for proximity relations of points over the conventional modelling based on either vector-like or raster-like modelling. In particular, Voronoi models uniquely represent discrete point data sets and consistently capture spatial proximity. We show that, as a consequence, automatic clustering is feasible. Parameters are not specified by users in our automatic clustering, they are encoded in the proximity structures of the Voronoi modelling, and thus, our algorithm, AUTOCLUST, calculates them from the Delaunay Diagram. This not only removes human-generated bias, but also reduces exploration time.

In the Delaunay Diagram, points in the border of a cluster tend to have greater standard deviation of length of their edges since they have both short edges and long edges. The former connect points within a cluster and the latter straddle between clusters or between a cluster and noise. This characteristic of border points is the essence for formulating a dynamic edge-elimination criterion. The criterion has the flavour of a statistical test, labelling edges that are units of the standard deviation away from the mean. Removal of labelled edges from the Delaunay Diagram extracts initial rough boundaries. Local and global variations are taken into account to ensure that relatively homogeneous clusters are not broken up into tiny, uninteresting sub-sets and relatively heterogeneous sub-sets are segmented into meaningful sub-clusters. To overcome problems of graph-based clustering, a three-phase edge-elimination process is applied. This process removes bridges that connect clusters with line-like paths and a small number of points. The effectiveness of our approach allows us to detect not only clusters of different density, but sparser clusters near to high density clusters. Multiple bridges linking clusters are identified and removed, all this within O (n log n) expected time, where n is the number of data points. We evaluate AUTOCLUST's time efficiency and clustering quality. We contrast AUTOCLUST with other algorithms for clustering large georeferenced sets of points. A series of detailed performance comparisons with both synthetic data sets and real data sets confirm the virtues of our approach.