Using Genetic Algorithms in Clustering Problems

Marco Painho and Fernando Bação
Higher Institute of Statistics and Information Management, New University of Lisbon, Travessa Estêvão Pinto (Campolide) P-1070-124 Lisboa
E-mail: painho@isegi.unl.pt
E-mail: bacao@isegi.unl.pt

Abstract

In this paper a Genetic Algorithm (GA) approach to the clustering problem is proposed. The optimisation capabilities of Genetic Algorithms are well known and commonly used in a variety of scientific fields. Openshaw and Openshaw (1997) note that Genetic Algorithms are "an extremely powerful, widely applicable search technique that provides a global search for problems with many local suboptima…". Here the clustering task is tackled through a Genetic Algorithm, which attempts to minimize the within cluster variance.

The Genetic Algorithm is tested against the traditional K-Means method, and an unsupervised neural network (Kohonen's self organising map). First, through the use of artificial generated data a particular environment is set up, then a real world example, involving the electoral results in Portugal, is used. Another important aspect is related with the possibility of using GA in geographical clustering problems. It seems obvious that the flexibility offered by GA should encourage researchers to develop new and improve ways of tackle the clustering tasks in Geography.

1. INTRODUCTION

Clustering has always been a key task in the process of acquiring knowledge. The complexity and specially the diversity of phenomena has forced society to organise things based on their similarities. One can say that the objective of cluster analysis "is to sort the observations into groups such that the degree of "natural association" is high among members of the same group and low between members of different groups" (Anderberg, 1973). The complexity of such task is easily recognized due to the number of possible arrangements. Even a small number of elements (25) to be clustered in a small number of groups (5) raise a very large number of possibilities (2,436,684,974,110,751). Sensitivity to initial points and convergence to local optima are usually among the problems affecting the interactive techniques such as K-Means (Bradley and Fayyad, 1998). Emerging technology promises to help achieving better clustering results. Artificial neural networks have been mentioned as a step forward on the subject of improving clustering, by a number of authors. Openshaw, Blake and Wymer (1995) argue "that the use of a neuroclassifier provides a much more flexible and potentially superior means of generating census classification", which is a known clustering problem. But there are other methods.

Cluster analysis can be defined as "a wide variety of procedures that can be used to create a classification. These procedures empirically form "clusters" or groups of highly similar entities." (Aldenderfer, M., Blashfield, R. - 1984). In other words it can be said that cluster analysis defines groups of cases, through a number of procedures, which are more similar between them than all the others.

Largely used, cluster analysis has merited the attention of a very large number of academic disciplines. From marketing to geography they have been widely used and tested. Urban researchers are among those who have heavily relied on cluster algorithms within their research work (Openshaw and Wymer 1994, Plane and Rogerson 1994, Painho, Vasconcelos, Fonseca and Geirinhas 1997). Most of the work done on internal spatial and social structure of cities has in some way used classification as a basis for analysis.

Census data is surely the most fundamental source of information in urban research. Identification of deprivation areas, geodemographics and social services provision, are among the processes in urban research that rely on the use of census variables and classification techniques. It seems obvious that any improvement in classification methods is bound to have impacts in this area. This is especially important if we consider the specific problems usually associated with census data (Openshaw and Wymer 1994).

Before advancing in proposing new and improved ways of dealing with clustering tasks in geographic information systems environment one should consider the alternatives in terms of classifiers. This is especially important if one takes into account that flexibility is one of the major issues that can ease the work on adapting these classifiers to incorporate geographic aspects.

This paper starts with the discussion of some relevant aspects on clustering and geographic issues. After that the clustering methods used are briefly described, followed by the presentation of the empirical setting and its results, finally some conclusions are presented.

2. CLUSTERING AND SOME GEOGRAPHIC CONSIDERATIONS

As already been noted cluster analysis has become a heavily used tool in Geography, particularly in what is usually called Geodemography. Geodemography has been one of the most marketable areas in Geography and GIS, which is very good for the discipline. The strange thing about it is that the Geo that it contains seems to be merely on the map that usually depicts the results. These results come from more or less sophisticated cluster algorithms, which can be considered basically a-spatial. It is known that geographers play a major role in the development of this type of methodologies, but probably with only all the a-spatial knowledge that they have. What seems to be lacking is a geographical perspective to the problem.

Another questionable aspect of these classifications is related with the quality issue. Partly, this issue can be addressed on the grounds that, it is better to have Geodemographics, in spite of the limitations, than nothing at all. This seems to be a reasonable argument, specially, if privacy issues come as a limitation in developing other approaches. Nevertheless it should not be used to prevent improvement on the existing methodologies.

It is also interesting to note the persistence of a number of geographers in criticise all the work in Geodemographics. It is obvious that a number of commercial interests have promised what they cannot deliver. It is also obvious that the academic community should not endorse these unreal expectations. But why concentrate on destroying what can be a very good opportunity to develop new fields of expertise and keep Geography in touch with the real world? It seems much more useful to propose improvements or new methodologies, and contribute to the development of new promising fields in Geography.

The performance of the classification methods in itself is not enough to solve all the problems related with the quality of census based classifications, but it is definitely an important issue. Openshaw, Blake and Wymer (1995) argue "in general, the quality of any area-based census classification reflects three major factors. First, the classification algorithm that is used, second, the way and extent to which knowledge about socio-spatial structure is used and is represented in the classification, and third, the sensitivity of the technology to what can be condensed to be the geographical realities of the spatial census data classification problem".

For some time now work has been done on the use of unsupervised neural networks, especially Kohonen Self-Organizing Map, in clustering procedures. A more flexible approach seems to be the main promise, this is particularly relevant when classifying census data (Openshaw and Wymer 1994) and also if the introduction of some kind of geographical knowledge is considered.

On the other hand one can approach the optimisation problem posed by clustering using Genetic Algorithms (GA) as the optimisation tool. GA have long been used in different kinds of complex problems, usually with encouraging results. In this paper a genetic algorithm is used to optimise the objective function used in the k-means algorithm. This measure of within cluster variance is used as a fitness function.

In spite of the value that new and improved classifiers can add to clustering tasks in Geography, probably the quantum leap that Openshaw and Wymer (1994) talk about will come not from the improvement of classifiers but from the introduction of some spatial reasoning in the process as they propose. Nevertheless the classifier used is of major significance, if not because of the performance then because it will condition the way in which geographical knowledge will be introduced in the clustering process.

Here the main objective of the work is to evaluate the GA approach to the clustering problem. It is well known that data quality is usually worst than what it should be and this problem is even more difficult to deal with when it comes to multidimensional data, which varies in precision and resolution. Clustering methods should be capable of providing satisfactory classifications and tackle the challenges prompted by the presence of some fuzziness in data. That's what the GA should be able to do, preferably better than the traditional methods like k-means. Having established this, one can think of other ways of using GA in cluster analysis, which are more adequate to treat spatial data and easy to couple with GIS.

3. THE CLUSTERING METHODS

Three methods were used: K-means, Self-Organising Map and Genetic Algorithms.

3.1 K-means

K-means is the first method evaluated, it is largely known and used, and represents a kind of standard. Churchill (1995) notes "currently, the most popular partitioning method is the k-means approach…", obviously its popularity comes from the fact that it has been established as the best partitioning method. Openshaw, Blake and Wymer (1995) also argue that "conventional best practice is to employ a K-means nearest neighbour classifier to spatial census data that has been ortho-normalised". Schaffer and Green (1998) also point out "During the late 1980s, however, non-hierarchical methods, such as k-means, became dominant for segmenting the large data sets typically encountered in marketing."

The central idea in k-means is to start with an initial partition and then from there, by changing cluster membership, improve the partition. As Anderberg (1973) notes "The broad concept for this methods (non-hierarchical clustering methods) is very similar to that underlying the steepest descendent algorithms used for unconstrained optimisation in nonlinear programming".

Typically the k-means algorithm starts with an initialisation process in which seeds positions are defined. This initial step can have a significant impact on the performance of the method (Bradley and Fayyad, 1998) and can be done in a number of ways (Anderber,g 1973, Bradley and Fayyad, 1998). After the initial seed had been defined each data element is assigned to the nearest seed. The next step consists on repositioning the seeds, this can be done after all elements are assign to the nearest seed or as each one of the elements is assigned. After this, a new assignment step is necessary and process will go on until no further improvement can be made, in other words a local optimum has been found. Considering that the assignments will be done on the basis of the distance to the nearest seed, implicitly this process will produce a minimization of the "sum of the L2 distance squared between each data point and its nearest cluster center" (Bradley and Fayyad, 1998).

3.2 Self-Organising Map

The second approach relies on the use of a Self-Organizing Map (SOM). The SOM can be characterised as "an unsupervised network that seeks to learn a continuous topological mapping of a set of inputs onto a set of outputs in such a way that the outputs acquire the same topological order as the inputs, by means of self-organisation based on data examples" (Openshaw and Wymer 1994). Neurons are typically organized in a 2D grid, and the SOM tries to find clusters such that any two clusters that are close to each other in the grid space have codebook vectors that are close to each other in the input space.

In the self-organization process the input vectors are presented to the network, and the cluster unit whose weight vector is closest (usually in terms of Euclidean distance) is chosen as the winner. The next step is to update the value of the winning unit and neighbouring units, this will approximate the values of the units to the one of the input vector. This can be viewed as a motion of the units in the direction of the input vector, the magnitude of this movement depends on the learning rate, which decreases along the process in order to obtain convergence.

Bearing in mind that Vector Quantization (VQ) is essentially the same as the k-means algorithm, and that the VQ is a special case of the SOM, in which the neighbourhood size is zero, one can say that there is a close relation between SOM and k-means. Openshaw and Wymer (1994) go further and say that the basic SOM algorithm "…is essentially the same as a K means classifier; with a few differences due to neighbouring training which might well be regarded as a form of simulated annealing and it may provide better results and avoid some local optima."

3.3 Genetic Algorithm

Finally, the third approach is based on the use of Genetic Algorithms (GA). Evolution has proven to be a very powerful mechanism in finding good solutions to difficult problems. One can look at the natural selection as an optimisation method, which tries to produce adequate solutions to particular environments.

In spite of the large number of applications of GA in different types of optimisation problems, there is very little research on using this kind of approach to the clustering problem. In fact, and bearing in mind the quality of the solutions that this technology has showed in different types of fields and problems (Beasley, Bull and Martin, 1993a, Mitchell, 1996) it makes perfect sense to try to use it in clustering problems.

The flexibility associated with GA is one important aspect to bear in mind. With the same genome representation and just by changing the fitness function one can have a different algorithm. In the case of spatial analysis this is particularly important since one can try different fitness functions in an exploratory phase.

The GA minimizes the square error of the cluster dispersion:

K being the number of clusters, mk the centre of cluster Ck, which makes it similar to the k-means algorithm.

In the genome each gene represents a data point and defines cluster membership. All necessary evolution operators can be implemented with this scheme. As pointed by Demiriz et all (1999) the major problem associated with this representation scheme is that it is not scalable, on the other hand it seems to be computationally efficient when the number of data points is not too large. The GA was implemented through the customisation of Galib (1996). The population size used was 100, with a replacement percentage of 50%, a crossover probability of 0,9 and a mutation probability of 0,01, uniform crossover was applied and 15000 generations were evaluated.

4. THE EMPIRICAL EXPERIMENT RESULTS

4.1 Data Generation

In this study real and artificial data are used. The advantage of using the artificial data is related to the possibility of knowing in advance what to look for, what is important in the data and what kind of classification can be obtained from a clustering method. This avoids arguments on the types of measures used to evaluate the quality of each clustering outcome. With this approach we can very clearly state that a certain method has performed better in a certain data set.

The data generated has the following characteristics: 750 observations, 16 variables, which make up 5 clusters.

After defining the centroid for each variable in the 5 clusters the program generated 150 elements around their class centroid, based on an error parameter, which is the "size" of the standard deviation of the error (which is normally distributed), on a 7-point scale. The term error is actually more a noise parameter, which is introduced in each element. This parameter really just increases and decreases the "tightness" of the clusters.

The real data are from the 1995 legislative elections in Portugal and is constituted by the voting percentages of the 5 major Portuguese parties, as well as abstention. The objective is to give an image of the behaviour of the different regions in terms of their voting patterns.

4.2 Results with artificial data

A confusion matrix was build to evaluate each of the methods with error parameters of 1, 1,5 and 2. Table 1 summarizes the most important results. With an error parameter (e.p.) of 1, which means that the standard deviation in each variable in each of the 5 clusters was approximately 1, in general we can say that all the three methods did very well. The three methods managed to reproduce with great precision the original structure of the data. K-means and the SOM methods failed to classify 2 elements and the GA 1 nevertheless the square error of the cluster dispersion is larger in the GA.

In the next step the error parameter was increased to 1.5. The "tightness" of the clusters was thus decreased, generating a greater number of outliers, shortening the distance between clusters. The classification is more challenging, due to difficulties that arise from "less clear" cluster boundaries. In this case K-means method performed rather poorly. In the clustering produced by the GA only 18 elements are misclassified. The original clusters are very clearly reconstructed and it is possible to detect the "interaction" between elements of the 301-450 class and 601-750 class, which deeply affected the K-Means method. The clustering made by the SOM produced the best results, although it's difficult to draw important conclusions on the superiority of SOM over the GA, due to the proximity in terms of results. Only 13 elements are misclassified (minus 5 when compared with the GA), and it is still possible to detect the "interaction" between elements of the 301-450 class and 601-750 class. Structurally there seems to be no important differences between the two approaches.

Finally the error parameter was increased to 2. The results are very clear, supporting the idea that GA and SOM will fail more as the error parameter increases but the fundamental structure of the data is still respected. Another important aspect is that the SOM and the GA have better performance in terms of optimising the "within cluster variability".

4.3 Test with electoral data

This is a less demanding problem in which the clustering is based on 6 variables, which depict the result of the 5 parties with parliamentary representation and the percentages of electoral abstention across the Portuguese municipalities (275). The following maps represent the results from the three methods.

In general terms one can say that there are no major differences between them. Table 2 presents the values of the square error of the cluster dispersion:

In this case the GA obtained the lowest value, which means that it was the best method in reducing the cluster dispersion, indicating a good performance in terms of optimisation.

5. DISCUSSION AND CONCLUSIONS

The goal of this work was not to draw conclusions on the superiority of any of the methods, instead the objective was to evaluate the quality of a clustering GA when compared with the standard techniques. Here the conclusions are optimistic, the GA approach was competent dealing with the clustering tasks. Nevertheless it can surely be improved.

It's important to note that this was not a completely fair trial because in the case of the K-Means and the Self Organising Map a commercial software with very sophisticated algorithms was used, whereas the Genetic Algorithm was implemented by an algorithm still under initial development that surely can be improved.

The future chanlenge consists on using different representation schemes for the GA and compare the qualities and shortcomings of the different representations. But the major challenge will be to use GA as the optimisation method and devising new and more spatial relevant objective functions that can be used in order to develop geographic clustering methods.

However it is too early to draw conclusions on which one is the best method, further and extensive testing would have to be done. It is known that the initialisation process can have major impacts on the k-means performance. Probably with a different and more sophisticated initialisation procedure the k-means results would be better. Nevertheless, a commercial statistical package was used, and the results should be enough to warn users of the problems that can happen when using the standard k-means algorithms.

Acknowledgements

The authors would like to thank Joaquim Bação for the help customizing GAlib, Prof. Dwaine Ball (University of Nebraska) for making the simulated data set available as well as for his criticisms and Sandra Catarino.

References

Aleksander, Igor & Morton, Helen 1995. An Introduction to Neural Computing. Thompson Computer Press

Aldenderfer, M., Blashfield, R. 1984. Cluster analysis, Series: Quantitative Applications in the Social Sciences, Sage Publications

Anderberg, M. R. 1973. Cluster Analysis for Applications, Academic Press

Beasley, D., Bull, D.R., & Martin, R.R. 1993a. "An Overview of Genetic Algortihms: Part 1, Fundamentals", University Computing, 15(2) 58-69. Available by ftp from ENCORE (See Q15.3) in file: GA/papers/over93.ps.gz or from ftp://ralph.cs.cf.ac.uk/pub/papers/GAs/ga_overview1.ps

Beasley, D., Bull, D.R., & Martin, R.R. 1993b. "An Overview of Genetic Algortihms: Part 2, Research Topics", University Computing, 15(4) 170-181. Available by ftp from ENCORE (See Q15.3) in file: GA/papers/over93-2.ps.gz or from ftp://ralph.cs.cf.ac.uk/pub/papers/GAs/ga_overview2.ps

Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford University Press

Bradley, P. S., Fayyad, U. M. 1998. Refining Initial Points for K-Means Clustering, In J. Shavlik, editor, Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco

Churchill, G. 1995. Marketing Research: Methodological Foundations, 6th edition. Fort Worth, Texas: Dryden

Demiriz, A, Bennett, K. P., Embrechts, M. J. 1999. Semi-Supervised Clustering using Genetic Algorithms. Proceedings of Artificial Neural Networks in Engineering

Fausett, L 1994. Fundamentals of Neural Networks: Architectures, Algorithms, and Applications, Englewood Cliffs, NJ: Prentice Hall,

Fischer, M., & Gopal, S. 1993. Neurocomputing and Spatial Information Processing From General Considerations to a Low Dimensional Real World Application. In M. Painho (Ed.), Proceedings of the workshop on New Tools for Spatial Analysis (pp. 55-68). Lisboa: Eurostat.

Goldberg, D. 1989. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading (MA): Addison-Wesley.

Gopal S. 1998. Artificial Neural Networks for Spatial Data Analysis, NCGIA Core Curriculum in GIScience, http://www.ncgia.ucsb.edu/giscc/units/u188/u188.html

Hartigan, J. 1975. Clustering Algorithms, Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons.

Kohonen, T. 1988. Self-organization and Associative Memory (2nd Edition). Springer-Verlag: New York

Mitchell, M. 1996. An Introduction to Genetic Algorithms. MIT Press, Cambridge, Mass.

Openshaw S., & Openshaw C. 1997. Artificial Intelligence in Geography, Chichester, England, John Wiley & Sons

Openshaw, S., Blake, M., & Wymer, C. 1995. Using neurocomputing methods to classify Britain's residential areas. In Peter Fisher, Innovations in GIS 2 (pp. 97-111). Taylor and Francis.

Openshaw, S. & Wymer, C. 1994. Classifying and regionalizing census data. In S. Openshaw, Census Users Handbook (pp. 239-270). Cambridge, UK: Geo Information International.

Painho, M. Vasconcelos, L. Fonseca, J. Geirinhas J. 1997. The Clustering of Urban Environment: the Case of Lisbon. In João Reis Machado & Jack Ahern, Environmental Challenges in an Expanding Urban World and the Role of Information Technologies. CNIG.

Plane, D. & Rogerson, P. 1994. The Geographical Analysis of Population with Application to Planning and Business, John Wiley & Sons, Inc.

Sato, M. & Sato, Y. & Jain, L. C. 1997. Fuzzy Clustering Models and Applications Springer-Verlag: New York.

Schaffer, C., & Green, P. 1998. Cluster-based market segmentation: some further comparisons of alternative approaches. Journal of the Market Research Society, Volume 40 Number 2 April, 155-163.

Wall, M. 1996. Galib: A C++ Library of Genetic Algorithms Components. MIT, http://lancet.mit.edu/ga/