Xiaowei
Xu
Professor Phone: (501) 683-7266 |
Teaching
·
Spring 2008
o
IFSC
2315: Information Systems Software
|
July 1998 |
Ph.D. (Dr. rer. nat.) in computer science, University
of Munich (LMU), |
|
Dec. 1987 |
M.Sc in computer science, Shenyang
Institute for Computing Technology,
Chinese
Academy of Sciences, P. R.
China |
|
July 1983 |
B.Sc. in computer science, Nankai
University, |
|
2007 – present |
Professor, Department of Information Science, University of
Arkansas at Little Rock |
|
2002 - 2007 |
Associate Professor, Department of Information Science, University of Arkansas at Little Rock |
|
2003 |
Visiting Professor, Microsoft Research China , hosted by Dr. Wei-Yin Ma |
|
2003 |
Visiting Research Scientist and Consultant, Siemens AG, Corporate Technology, hosted by Dr. Volker Tresp. |
|
1998 - 2002 |
Senior Research Scientist, Siemens AG, Corporate Technology, Information and Communications |
|
1993 - 1998 |
Teaching and Research Assistant, University of |
|
1992 - 1993 |
Visiting Scholar, |
|
1983 - 1992 |
Research Scientist, |
Research Interests
·
Knowledge
Discovery in Databases and Data Mining: Clustering algorithms, classification,
trend detection, feature and instance selection, parallel and distributed data
mining algorithms, interfaces to database management systems, scalable
algorithms for large databases, web/text mining, collaborative filtering, data
mining in biological/medical databases
·
Spatial Database
Systems: Efficient query processing, spatial access methods, applications in
geographic information systems
·
Multimedia
Database Systems: Similarity search, index structures, query processing,
applications in biological, medical and CAD databases
·
OLAP and Data
Warehousing: Index structures, data reduction, data mining and knowledge
discovery in data warehousing environment
Summary
of previous Work (references refer to the publication list)
With my students and colleagues, I developed a set of scalable data mining methods for the (semi-) automatic extraction and analysis of "patterns" from spatial as well as web logs and customer databases. The specific topics that I have been able to make contributions to can be broadly categorized into the following areas:
· Knowledge Discovery and Data Mining
o Clustering Algorithms
DBSCAN [25] is a density-based clustering method, which was designed to detect clusters of arbitrary shape as well as to distinguish noise in spatial and multi-dimensional databases. Technically, the algorithm is based on region queries, which can be supported efficiently by spatial index structures such as R-trees (at least, if the dimension of the data space is not too high). PDBSCAN [4] is a parallel version of DBSCAN on the `shared-nothing' architecture with multiple computers interconnected through a network. PDBSCAN offers linear speedup and has excellent scaleup and sizeup behavior. For clustering in dynamic databases (i.e., when the database changes through insertions and deletions over time) an efficient incremental version of DBSCAN was developed [21] . Determining "natural" parameters for a density-based clustering of a data set may be difficult. This problem is solved by the new clustering algorithm DBCLASD, which is based on the assumption that the points inside a cluster are uniformly distributed [22]. BRIDGE [16] efficiently merges the K-means and DBSCAN by exploiting the advantages of one to counter the limitations of the other and vice versa. One problem with DBSCAN is its tendency to merge many slightly connected clusters together. The problem is addressed by the RDBC algorithm [3] .
DBSCAN (Windows/NT version) can be downloaded from http://ifsc.ualr.edu/xwxu/Software/dbscan.zip. After unzip the file, run dbscan.exe.
o Collaborative Filtering
Collaborative filtering uses a database about consumers' preferences to make personal product recommendations and is achieving widespread success in E-Commerce nowadays. However, the traditional collaborative filtering algorithms do not scale well to the ever-growing number of consumers. The quality of the recommendation also needs to be improved in order to gain more trust from the consumers. To improve the efficiency and the accuracy, feature weighting and instance selection are studied from a unified information-theoretic perspective [2] . Two feature-weighting methods to improve the accuracy of collaborative filtering algorithms were proposed in [15] . Furthermore, we introduced an information-theoretic approach to measure the relevance of a consumer (instance) to the given product (target concept) and proposed to reduce the training data set by selecting only highly relevant instances [14]. A further significant permanence improvement can be achieved by the data reduction techniques [13].
o Spatial Data Mining
The effectivity of spatial clustering algorithms is somewhat limited because they do not fully exploit the richness of the different types of data contained in a spatial database. We proposed the concept of density-connected sets and present GDBSCAN, a significantly generalized version of DBSCAN ( [5] and [23]). The major properties of this algorithm are as follows: (1) any symmetric predicate can be used to define the neighborhood of an object allowing a natural definition in the case of spatially extended objects such as polygons, and (2) the cardinality function for a set of neighboring objects may take into account the non-spatial attributes of the objects as a means of assigning application specific weights. Density-connected sets can be used as a basis to discover trends in a spatial database [23]. We defined trends in spatial databases and showed how to apply GDBSCAN algorithm for the task of discovering such knowledge. An application of this technique in the area of economic geography can be found in [23].
o Web Mining
A great challenge for web site designers is how to ensure users' easy access to important web pages efficiently. We developed a clustering based approach to address this problem [3]. Our approach to this challenge is to perform efficient and effective correlation analysis based on web logs and construct clusters of web pages to reflect the co-visit behavior of web site users. We presented a novel approach for adapting DBSCAN in the problem domain of web page clustering [17], and show that our new methods can generate high-quality clusters for very large web logs when previous methods fail. Based on the high-quality clustering results, we then applied the data-mined clustering knowledge to the problem of adapting web interfaces to improve users' performance. We developed an automatic method for web-interface adaptation: by introducing index pages that minimize overall user browsing costs [3]. The index pages are aimed at providing short cuts for users to ensure that users get to their objective web pages fast, and we solved a previously open problem of how to determine an optimal number of index pages. We empirically showed that our approach performs better than many of the previous algorithms based on experiments on several realistic web-log files.
· Similarity Search in Spatial and Multimedia Databases
Proteins play an important role in every living organism since they are the acting instances for all fundamental processes of life like in the digestive system, metabolism, and immunosystem. The function of proteins takes place as an interaction with other molecules which is called docking. An important heuristic for the prediction of molecular interaction is the "key-and-lock"-principle. The docking sites of the partner molecules have a strong complementarity, especially concerning the geometry. Many docking sites may be determined solely by this complementarity geometry. Thus, the docking problem may be transformed to a search problem for complementary surface segments. In the project BIOWEPRO [29], funded by German Ministry for Education, Science, Research, and Technology (BMBF), we developed new database techniques to effectively and efficiently support the 1:n-docking prediction for proteins [28]. Our approach includes new representation and storage methods for molecular surfaces as well as new methods for similarity query processing for 3D surface segments with respect to shape similarity. The selection of segments from the database which have a similar (or complementary) 3D shape yields a set of potential docking candidates for the query protein. While following our segmentation approach, we computed the molecular surface and extract potential docking segments for all the proteins in our database. For each of the segments, various shape representations are computed that are appropriate to support a complementarity search in the database.
· Multidimensional Query Processing
We developed a new technique for multidimensional query processing which can be widely applied in database systems [18]. The new technique, called tree striping, generalizes the well-known inverted lists and multidimensional indexing approaches. A theoretical analysis of this generalized technique shows that both, inverted lists and multidimensional indexing approaches, are far from being optimal. A consequence of the analysis is that the use of a set of multidimensional indexes provides considerable improvements over one d-dimensional index (multidimensional indexing) or d one-dimensional indexes (inverted lists). The basic idea of tree striping is to use the optimal number k of lower-dimensional indexes determined by the theoretical analysis for efficient query processing. We confirmed our theoretical results by an experimental evaluation on large amounts of real and synthetic data. The results show a speed-up of up to 310% over the multi-dimensional indexing approach and a speed-up factor of up to 123 (12,300%) over the inverted-lists approach.
Current and future Work
The areas where I am particularly interested in making further progress can be roughly categorized as follows:
· Text Mining
Unstructured text databases are common in many manufacturing and service business operations. The service reports of automobiles, description of claims in insurance industry, medical records are some examples. Over the period of time such databases continue to grow and become a huge and unwieldy source of information. This information can be used for making the business operation more efficient and saving unnecessary expenses. For example, an automobile manufacturing industry may have a database containing customer service records performed by its dealers; such information may be used, for example, to make decisions about the future thrust-directions on research and development based on the reported problems for some product. Also, such information is very valuable in making marketing related decisions. However, data mining from large text/hyper-text databases is especially challenging because of its extremely high dimensional data, and distributed storage. I am currently working on a new hierarchical clustering algorithm to construct the concept hierarchy automatically from large text corpus [10].
· Spatial-temporal Data Mining
Spatial and temporal data mining is the non-trivial extraction of implicit, potentially useful and novel knowledge with an implicit or explicit spatio-temporal content from large spatio-temporal databases. Spatio-temporal data mining is a very promising subfield of data mining because increasingly large volume of spatio-temporal data is collected and need to be analyzed. Spatio-temporal data mining is also challenging research area because the spatio-temporal data and knowledge is much more complex then no-spatial and no-temporal data. I am working on spatio-temporal data mining method for personalized location-dependent information filtering. An information filtering algorithm which explores the content of the information, the usage of the information, and the location/time of information will be developed for mobile business.
· Data Mining in biological data / Bioinformatics
I am also very interested in the area of Bioinformatics. Many data analysis tasks in Biology can be approached from a data mining perspective. I have the experience in the development of a database management system to support protein-protein docking prediction [ 28,29]. In the future, I want to working on the following problems: 1. Clustering gene expression data of different tissues, which requires the development of clustering techniques for ultra-high dimensional data (about 200,000). 2. Finding suppression relations between genes using the same gene expression database as in the first project. 3. Prediction the structure of protein.
Professional Activities
· Invited Talks
o International Workshop on Management of Information on the Web - Web Data and Text Mining (MIW'01), in conjunction with the 12th International Conference on Database and Expert Systems Applications (DEXA'2001), September, 2001, Munich, Germany
o
o Dagstuhl Symposium on Declarative Database on the Web, September, 1999
o
Microsoft
Research
o ABB Corporate Research Ltd., ABB Industry AG, April, 1998
o Ubilab, Information Technology Laboratory of UBS AG, June, 1998
o
European Science
Foundation Workshop on "From Information Fusion to Data Mining",
· Program Committee (PC) Members
o
Program Committee Member of the IEEE
International Conference on Data Mining (ICDM 06)
o
Program Committee Member of the IEEE
International Conference on Data Mining (ICDM 05)
o
Program Committee Member of the Annual ACM
Symposium on Applied Computing (SAC 05)
o
Session Chair for
the Second
o International Workshop on Management of Information on the Web - Web Data and Text Mining (MIW'01), in conjunction with the 12th International Conference on Database and Expert Systems Applications (DEXA'2001), September, 2001, Munich, Germany
o International Workshop on Data Models and Databases on Clusters and the Grid (DataGrid 2001), in conjunction with IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2001), May 2001, Brisbane, Australia
· Refereeing and Reviewing of Journal and Conference Submissions
o Distributed and Parallel Databases, An International Journal, Kluwer Academic Publishers
o Knowledge and Information Systems, An International Journal, Springer-Verlag
o
ACM SIGMOD
International Conference on Management of Data (SIGMOD'95),
o Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'97) in cooperation with ACM-SIGMOD'97 Tucson, Arizona, May, 1997
Publications
A list of my publications from the DBLP Bibliography Server
Refereed Journals
1.
K. Yu, A.
Schwaighofer, V. Tresp, X. Xu and H.-P. Kriegel: “Probabilistic
Memory-based Collaborative Filtering”, IEEE Transactions on Knowledge
& Data Engineering, special issue on "Mining and Searching the
Web", Volume 16, Number 1, January 2004, pp. 56-69.
2. K. Yu, X. Xu, M. Ester and H.-P. Kriegel: "Feature Weighting and Instance Selection for Collaborative Filtering: An Information-Theoretic Approach", Knowledge and Information Systems, Volume 5, Number 2, April 2003, pp. 201-224, Springer-Verlag London Ltd.
3.
Z. Su,
4. X. Xu, J. Jäger and H.-P. Kriegel: "A Fast Parallel Clustering Algorithm for Large Spatial Databases", Data Mining and Knowledge Discovery, an International Journal, Volume 3, Issue 3, September 1999, pp. 263-290, Kluwer Academic Publishers. Abstract
5. J. Sander, M. Ester, H.-P. Kriegel, X. Xu: "Density-Based Clustering in Spatial Databases: A New Algorithm and its Applications", Data Mining and Knowledge Discovery, an International Journal, Volume 2, Issue 2, June 1998, pp. 169-194, Kluwer Academic Publishers. Abstract
6. M. Ester, H.-P. Kriegel, J. Sander, X. Xu: "Clustering for Mining in Large Spatial Databases", KI Künstliche Intelligenz (Journal of Artificial Intelligent), 1, 1998, ScienTec Publishing, pp. 18-24.
7. X. Xu, M. Ester, H.-P. Kriegel, J. Sander: "Clustering and Knowledge Discovery in Spatial Databases", Vistas in Astronomy, 1997, (Special issue, proceedings of European Science Foundation workshop on "From Information Fusion to Data Mining", Granada, April 1997, editors R. Molina, F. Murtagh and A. Heck).
Refereed Conference Proceedings
8.
Xiaowei Xu,
Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger, “SCAN: A Structural
Clustering Algorithm for Networks”, The Thirteenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Jose, CA, Aug. 12-15,
2007.
9.
Zhidan Feng, Xiaowei Xu, Nurcan Yuruk, Thomas
Schweiger, “A
Novel Similarity-based Modularity Function for Graph Partitioning”, 9th
International Conference on Data Warehousing and Knowledge Discovery (DaWaK
2007), Regensburg, Germany, 3-7 September 2007.
10. Nurcan Yuruk, Mutlu Mete, Xiaowei Xu, and Thomas Schweiger,
"A Divisive
Hierarchical Structural Clustering Algorithm for Networks", IEEE ICDM
Workshop on Mining Graphs and Complex Structures (MGCS2007), In conjunction
with the Seventh IEEE Int. Conf. of Data Mining (ICDM 2007), October 28, 2007,
Embassy Suites Hotel, Omaha, NE, USA
11. X. Xu, Z. Feng and T. Schweiger: “Fast and Effective
Clustering Very Large Networks Using Density-Based Clustering Algorithm”, DIMACS Workshop on
Clustering Problems in Biological Networks, May 9 - 11, 2006, DIMACS
Center, CoRE Building, Rutgers University, Piscataway, NJ.
12. X. Xu, M. Mete and N. Yuruk: "Mining Concept
Associations for Knowledge Discovery from Large Texual Databases", 20th
Annual ACM Symposium on Applied Computing,
13. Z. Xu, X. Xu, K. Yu and V. Tresp: "A Hybrid
Relevance-Feedback Approach to Text Retrieval", 25th European Conference
on Information Retrieval Research (ECIR'03),
14. Z. Xu, K. Yu and V. Tresp, X. Xu: "Representative
Sampling for Text Classification using Support Vector Machines", 25th
European Conference on Information Retrieval Research (ECIR'03), Pisa, Italy -
April 14-16, 2003. Paper
(pdf 361k)
15.
K. Yu, X. Xu, A.
Schwaighofer and H.-P. Kriegel: "A Likelihood-Based Approach to Data
Selection for Collaborative Filtering", ACM 11th International Conference
on Information and Knowledge Management (CIKM'02), November 2002,
16. F. Beil, M. Ester and X. Xu: "Frequent Term-Based
Text Clustering", 8th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, July 23-26, 2002,
17.
K. Yu, X. Xu, J.
Tao. M. Ester and H.-P. Kriegel: "Efficient Collaborative Filtering:
Reduction Techniques for Large Preference Database with Many Missing
Values", Second SIAM International Conference on Data Mining (SDM'02),
April 2002, Ar lington, VA. Paper (pdf 485k)
18.
K. Yu, Z. Wen, X. Xu, M. Ester, and H.-P. Kriegel. "Selecting Relevant Instances for Efficient and
Accurate Collaborative Filtering", ACM 10th International Conference on
Information and Knowledge Management (CIKM'01), November 2001,
19.
K. Yu, Z. Wen, X.
Xu and M. Ester: "Feature Weighting and Instance Selection for
Collaborative Filtering". 2nd International Workshop on Management of
Information on the Web - Web Data and Text Mining (MIW'01), in conjunction with
the 12th International Conference on Database and Expert Systems Applications
(DEXA'01), September 2001.
20.
Manoranjan Dash,
Huan Liu, Xiaowei Xu: "'1+1>2': Merging Distance and Density Based
Clustering", 7th International Conference on Database Systems for Advanced
Applications (DASFAA'01), April 18-20, 2001,
21. Zhong Su, Qiang Yang, Hong-Jiang Zhang, Xiaowei Xu and
Yu-Hen Hu: "Correlation-based Document Clustering using Web Logs",
34th HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES (HICSS-34), January
3-6, 2001, IEEE Computer Society. Paper (pdf 71k)
22.
S. Berchtold, C,
Böhm, D. A. Keim, H.-P. Kriegel, X. Xu: "Optimal Multidimensional Query
Processing Using Tree Striping", Int. Conf. on Data Warehousing and
Knowledge Discovery (DaWaK'00),
23.
X. Xu: "Web
Mining for E-Commerce", Dagstuhl Symposium on Declarative Database on the
Web, September 1999,
24.
A. Lazarevic, X.
Xu, T. Fietz and Z. Obradovic: "Clustering-Regression-Ordering Steps for
Knowledge Discovery in Spatial Databases", International Joint Conference
on Neural Networks (IJCNN'99), July 10-16, 1999,
25.
M. Ester, H.-P.
Kriegel, J. Sander, M. Wimmer, X. Xu: "Incremental Clustering for Mining
in a Data Warehousing Environment", 24 International Conference on Very
Large Databases (VLDB'98), August 24 - 27, 1998,
26. X. Xu, M. Ester, H.-P. Kriegel, J. Sander: "A
distribution-based Clustering Algorithm for Mining in Large Spatial
Databases", 14th Int. Conf. on Data Engineering (ICDE'98),
27. M. Ester, H.-P. Kriegel, J. Sander, X. Xu:
"Density-Connected Sets and their Application for Trend Detection in
Spatial Databases", 3nd int. Conf. on Knowledge Discovery and Data Mining
(KDD'97), AAAI Press, 1997. Paper (pdf 166k)
28.
X. Xu, M. Ester,
H.-P. Kriegel, and J. Sander: "Efficient Clustering for Knowledge
Discovery in Spatial Databases", Eurapean Science Foundation Workshop on
"From Information Fussion to Data Mining",
29.
M. Ester, H.-P.
Kriegel, J. Sander, X. Xu: "A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise", 2nd int. Conf. on
Knowledge Discovery and Data Mining (KDD'96),
30.
M. Ester, H.-P.
Kriegel, X. Xu: "Knowledge Discovery in Large Spatial Databases: Focusing
Techniques for Efficient Class Identification", 4th Int. Symp. on Large
Spatial Databases (SSD'95),
31.
M. Ester, H.-P.
Kriegel, X. Xu: "A Database Interface for Clustering in Large Spatial
Databases", 1st Int. Conf. on Knowledge Discovery and Data Mining
(KDD'95),
32.
M. Ester, H.-P. Kriegel, T. Seidl, X. Xu:
"Formbasierte Suche nach komplementaeren 3D-Oberflaechen in einer
Protein-Datenbank", GI-Fachtagung Datenbanken in Buero, Technik und
Wissenschaft (BTW '95), Informatik aktuell, Springer, 1995, pp. 373-382. (in german). Paper (pdf 238k)
33. D. Schomburg, U. Jakob, M. Meyer, P. Wilson, G. Sagerer, F. Ackermann, G. Herrmann, S. Posch, M. Soumpasis, G. Grimm, B. Ihmels, M. Strahm, H.-P. Kriegel, T. Seidl, T. Schmidt, M. Ester, X. Xu: 'BIOWEPRO' - Wechselwirkungen von Proteinen, BMBF (Hrsg.): Tagungsband BMBF-Statusseminar Bioinformatik "Molekulare Bioinformatik und Evolutionäre Algorithmen", Braunschweig, 07.-08.10.1995, pp.125-153.
Book
31. Xiaowei Xu: "Efficient
Clustering for Knowledge Discovery in Spatial Databases",
Last update 1/14/2008