Document Clustering

Contents

  1. Overview
  2. Applications
  3. Clustering API

1. Overview

Document clustering is the act of collecting similar documents into bins, where similarity is some function on a document. The clustering algorithms implemented for LEMUR are described in "A Comparison of Document Clustering Techniques", Michael Steinbach, George Karypis and Vipin Kumar. TextMining Workshop. KDD. 2000. With the exception of Probabilistic Latent Semantic Analysis (PLSA), all use cosine similarity in the vector space model as their metric.

The LEMUR clustering support provides two principle APIs, the Cluster API, which defines the clusters themselves, and the ClusterDB API, which defines how Clusters are persistently stored. Similarity is scored via a SimilarityMethod object. Currently there is a single SimilarityMethod, CosSim, defined.

2. Applications

Cluster

Performs the basic online clustering task. In conjunction with an incremental indexer (such as using a KeyfileIncIndex), it could be used for the TDT topic detection task. It iterates over the documents in the index, assigning each document that is not in any cluster to a cluster. The document id, cluster id, and score are printed to the standard output. The parameters accepted by Cluster are:

OfflineCluster

The example application that demonstrates the basic offline clustering task. Provides k-means and bisecting k-means partitional clustering. It will run each algorithm on the first 100 documents in the index (or all of them if less than 100) and print out the results. The parameters accepted by OfflineCluster are:

PLSA

Perform Probabilistic Latent Semantic Analysis (PLSA) on a collection, building three probability tables: P(z), P(d|z), and P(w|z) where z ∈ Z are the latent variables (categories), d ∈ D are the documents in the collection, and w ∈ W are the terms in the vocabulary over the collection, or open those tables and read them into memory to illustrate their potential use. The implementation (the PLSA class) is based on the Java reference implementation from Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. "Methods and Metrics for Cold-Start Recommendations." in Proceedings of the 25'th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002). See http://www.cis.upenn.edu/datamining/software_dist/PennAspect/. Note that training takes a long time.

The parameter doTrain (true|false) determines whether the tables are constructed or read. The default value is true. The other parameters accepted by PLSA are:

3. Clustering API

Cluster

The cluster API provides an abstraction over a collection of cluster elements (ClusterElt), enabling the addition or removal of elements. It provides a score method for that uses the similarity between the object and another Cluster (see SimilarityMethod below). It also provides read and write methods for use by the ClusterDB. The two concrete subclasses of Cluster are AgglomCluster and CentroidCluster. Cluster instances are created via the ClusterFactory::allocateCluster method.

ClusterDB

The ClusterDB API provides for interactions with persistent collections of Cluster objects. There are two concrete implementors of this API. FlatfileClusterDB, which stores the Cluster objects in a flat file, similar to those used by the Inv(FP)Index class, and KeyfileClusterDB, which stores the Cluster objects in Keyfiles. ClusterDB implementors must provide methods to add or remove elements from a Cluster, merge two Clusters, or split a Cluster into multiple clusters. They must provide a way to retrieve a Cluster given its id number or given the document id of an element within the cluster. Finally they need to provide a Factory method for creating a new, empty Cluster.

SimilarityMethod

SimilarityMethod is an abstraction over comparing two ClusterRep (vector space representation) Cluster objects. The CosSim method is the only concrete implementation, it computes the cosine similarity. SimilarityMethods must provide two methods, one to weigh a vector (such as normalizing) and the similiarity function itself. SimilarityMethod objects are created by the SimFactory::makeSim method, using parameters defined in the ClusterParam namespace. To add a new SimilarityMethod, one needs to do the following:

  1. In ClusterParam.hpp add a symbol for the new method in the simTypes enum.
  2. In ClusterParam.hpp add an else if to test the simTypeString for equality with the new method type parameter.
  3. In SimFactory.hpp add an include statement for the new method's header file.
  4. In SimFactory.hpp add a case for the new method symbol to makeSim that makes an instance of the new method.
  5. Recompile your lemur library.

OfflineCluster

OfflineCluster provides k-means and bisecting k-means clustering of a set of documents, returning the k clusters in a vector. The clusters are not persistent. The Cluster class uses k-means to implement its split method.