Data Mining and Knowledge Discovery 3 (3): 263-290, September 1999  Copyright © 1999 Kluwer Academic Publishers All rights reserved
 

A Fast Parallel Clustering Algorithm for Large Spatial Databases


Xiaowei Xu
Corporate Technology, Siemens AG, Otto-Hahn-Ring 6, D-81730 München, Germany. Xiaowei.Xu@mchp.siemens.de

Jochen Jäger
Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany.
jaeger@informatik.uni-muenchen.de

Hans-Peter Kriegel
Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany.
kriegel@informatik.uni-muenchen.de

Abstract
The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as
to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘shared-nothing’ architecture with
multiple computers interconnected through a network. A fundamental component of a shared-nothing system is its distributed data structure.
We introduce the dR*-tree, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the
data are replicated on every computer. We implemented our method using a number of workstations connected via Ethernet (10 Mbit). A
performance evaluation shows that PDBSCAN offers nearly linear speedup and has excellent scaleup and sizeup behavior.

Keywords
clustering algorithms, parallel algorithms, distributed algorithms, scalable data mining, distributed index structures, spatial databases

Article ID: 241179

You may send me an email to ask for a repint.