This section describes the design of some of the most useful classes in the toolkit. A complete list of all classes and their documentation can be found in the Lemur source documentation
The classes DocStream, Document, DocumentProps, and TokenTerm are a group of abstract classes that specify an abstract interface for streaming a text collection. By subclassing these classes, a user may define a customized document stream handler.
The interface is expected to be used in the following way. First, iteration can be performed on DocStream to obtain each instance of Document sequentially. However, the call of the method nextDoc is expected to be followed by another iteration call startTermIteration before the next call of nextDoc can be performed. This is because the method nextDoc does NOT advance the file pointer to the beginning of the next document; it is when we iterate over terms (i.e., getting instances of TokenTerm) that the file position pointer gets advanced. Note that you can call startTermIteration multiple times to repeat reading the terms in a document, and you can also call skipToEnd to "fast" forward to the end of the current document without reading all the terms, which then would allow you to call nextDoc to get the next document. An instance of TokenTerm provides the spelling of a term and possibly other information. The obtained document instance also supplies information about document ID and possibly other information in DocumentProps.
The Lemur toolkit has an implementation of DocStream called BasicDocStream. It supports the following simple format:
<DOC docid1> term1 term2 ... termN </DOC> <DOC docid2> ... </DOC>
BasicDocStream can be used to access a document stream for any purpose (not just indexing).
Note that as long as the interface of DocStream is respected, the indexer will not care about where the terms or documents are actually from. This means, we can "fake" a collection of documents as one single document (e.g., by pooling all terms together), and this would allow us to support indexing of collections for distributed IR.
2.2 Text handlers (utility level)
The abstract class TextHandler allows for the chaining or pipelining of common parser components. It provides a general and flexible support for preprocessing documents. This interface is extended to provide a Parser, Stemmer, and Stopper. See Parsing in Lemur for more details.
2.3 Building an index (Indexer Level)
The "push" method of building an index is to use the PushIndex interface to add documents and terms; thus, the indexing process is actually controlled by the application, which allows more than one index to be built at the same time. The application can use any kind of text handling, such as the ones provided in the utility level before passing the tokens to the PushIndex. There are three implemented PushIndex classes: InvIndex, InvFPIndex, which stores position information, and KeyfileIncIndex. Both Inv indexes use delta encoding and RVL compression and are not inhibited by system file size limits.
Here is an example of how the PushIndex interface can be used to build one regular index and another index with one very large document at the same time:
InvFPPushIndex *regIndex = new InvFPPushIndex(indexname, memorysize, maxfilesize);
InvFPPushIndex *onedocIndex = new InvFPPushIndex(indexname, memorysize, maxfilesize);
...
onedocIndex->beginDoc(documentprops);
while (moreTokens()) {
switch (token) {
case DOCID:
// close out the previous
if (!firstDocument()) {
setEndDocumentProps(documentprops)
regIndex->endDoc(documentprops);
}
// start a new document
setNewDocumentProps(documentprops)
regIndex->beginDoc(documentprops);
break;
case TERM:
setTermInformation(invfpterm);
regIndex->addTerm(invfpterm);
onedocIndex->addTerm(invfpterm);
break;
}
regIndex->endCollection(collectionprops);
onedocIndex->endDoc(documentprops);
onedocIndex->endCollection(collectionprops);
2.4 Access to index information (Indexer Level)
The most important Lemur abstract classes at the indexer level are Index, TermInfoList, DocInfoList, TermInfo, and DocInfo. These classes provide a uniform interface for access to an indexed document collection. They are usually the only classes that an application program needs to deal with when accessing an index. The functions supported by Index can be grouped in the following way:
These two ways of access are supported by two functions on Index: termInfoList and docInfoList. Function termInfoList returns an instance of TermInfoList which supports iterations over all the counts along with their corresponding term ID. Similarly, docInfoList returns an instance of DocInfoList, which supports iterations over all the counts along with the corresponding document ID. The instances returned by functions termInfoList and docInfoList must be deleted later. However, when iterating over entries of either a DocInfoList or a TermInfoList, you will get a pointer to a static instance in the indexer, so there is no need to, and you must not, delete the instance for each entry. (See the example below.)
InvIndex is a basic inverted index. InvFPIndex stores word positions in a document, in addition to the basic frequency matrix information, and can return a TermInfoList with the terms in sequence order, according to the position information, by extending the Index interface to include a termInfoListSeq method. The position information can be useful for evaluating proximity operators or algorithms that analyze word sequences, such as Hidden Markov Models (HMM). Position information can be accessed by using the extended classes InvFPDocInfo (from DocInfo) and InvFPTerm (from TermInfo). InvIndex and InvFPIndex are not inhibited by a system file size limit. Also, they use delta encoding and RVL compression.
BuildIndex builds an InvIndex or InvFPIndex . Usually, an application programmer would only need to construct an instance of the indexer that is actually used, and then will access information through the abstract interface Index. The class IndexManager has an openIndex method that can create the proper index using its file suffix.
The following is an example of how to use the index interface:
Index *ind;
ind = IndexManager::openIndex("index-toc-file.ifp");
// open the index specified by the table-of-content (toc) file "index-toc-file.ifp"
// IndexManager recognizes the suffix of the toc file (in this case ".ifp") and uses it to
// infer the actual type of the index to be opened (InvFPIndex).
int t1;
...
// now fetch the doc info list for term t1
DocInfoList *dList = ind->docInfoList(t1);
dList->startIteration();
while (dList->hasMore()) {
DocInfo * entry = dList->nextEntry(); // obtains a pointer to a static instance
cout << "entry doc id: " << entry->docID() << endl;
cout << "entry term count: " << entry->termCount() << endl;
// note that you MUST NOT delete entry!
}
delete dList; // you MUST delete dList!
// TermInfoList can be accessed in exactly the same way
delete ind; // The IndexManager creates a dynamic instance.
2.5 Counts and language models (Indexer Level)
The classes UnigramLM, Counter, DocUnigramCounter, and SmoothedMLEstimator provide support for building language models on top of an Index The class UnigramLM is an abstract representation of a unigram language model. It supports looking up a probability for a given word and iterating over all the non-zero probabilities. Since each word is referred to by its integer index, a UnigramLM is also required to hold a lexicon ID that provides an interpretation of a word index. Typically, this is the file name of the term lexicon associated with an Index. This id is unused.
The class Counter is an abstract representation of a counter that provides an interface for access to the counts it holds. The concept of "count" is fairly general, and could be an integer, real, or user-defined value. Such generality can be useful for algorithms such as Expectation Maximization (EM), where the counts are fractional. The only implemented counter is based on an array (ArrayCounter). Since an array must be held in the memory, such a counter has a limit of how many entries can be stored, as well as a limit of how many such counters can be active at the same time. (This limitation can be broken by implementing a disk-based counter.) To facilitate language model construction based on an Index, we implemented a DocUnigramCounter which is a subclass of ArrayCounter. A DocUnigramCounter can be constructed in three different ways, corresponding to counting the words in a single document, in a set of documents, or in the entire collection.
The class SmoothedMLEstimator is a special UnigramLM that is estimated based on smoothing the maximum likelihood estimate. This is very common when we want to estimate a language model based on a piece of observed text (e.g., a document, or a set of documents). From this viewpoint, perhaps, an alternative (better?) name for this class would be TextUnigramLM. SmoothedMLEstimator holds a counter, which is assumed to hold the counts needed by the maximum likelihood estimator. The reason for the existence of this class is that several implementations use a smoothed maximum likelihood estimator based on a counter, and require iteration over non-zero probability entries. Without SmoothedMLEstimator, the implementation of such iterations would have to be duplicated in each of the specific smoothing methods. In addition to the maximum likelihood estimator, three specific smoothing methods are implemented: fixed coefficient linear smoothing, a Bayesian method using a Dirichlet prior, and the popular Laplace smoothing which is a special case of the Dirichlet method with a uniform prior.
The separation of the counter from the language model interface makes it easy to change the storage-level representation of counts without affecting the smoothing algorithm. For example, the Dirichlet prior smoothing method can take any concrete counter, as long as it conforms to the interface defined by the abstract class Counter. The general way of constructing a unigram language model is to first construct a counter and then use the counter to construct a (possibly smoothed) language model. For example, to construct a non-smoothed document language model, one could construct a DocUnigramCounter based on a document ID and an Index, and then construct a maximum likelihood unigram language model (MLUnigramLM) using the DocUnigramCounter as the counter.
2.6 General Retrieval Architecture (Retrieval Level)
Classes Query, QueryRep, and RetrievalMethod provide a very general, yet flexible support for different types of retrieval methods. Query represents any query, including a structured query. QueryRep represents any (internal) query representation determined by a specific retrieval method. It may also be used to hold any document-independent, query-specific values precomputed before scoring each document to make scoring process more efficient. The major abstract class for retrieval is RetrievalMethod which is on top of an Index and supports four methods:
It is reasonable to require any retrieval method to support a method for scoring one single document. Given that we have scoreDoc, it may seem redundant to have scoreCollection. However, scoring a collection based on scoreDoc would mean to score each document one by one, which is inefficient. Indeed, the scoring of the whole collection can often be much more efficient for many practical retrieval methods, if done on the inverted index. The existence of scoreCollection makes it possible for those retrieval methods to override it with a more efficient scoring strategy. For example, TextQueryRetMethod implements a generic scoring procedure based on the inverted index.
Now, one might also wonder why we also need scoreDoc now that we have scoreCollection. scoreDoc is needed for two main reasons. First, it may be impractical to score the whole collection with some complex retrieval algorithms, or you may only be interested in re-ranking some subset of documents. In that case, scoreCollection would be inappropriate, and being able to score a single document will be necessary. Second, sometimes, it may be desirable to compute the similarity between two documents using a retrieval function (e.g., treating one document as a query), and in that case, scoreDoc is also useful.
The support of feedback through modifying a QueryRep has three advantages:
In most cases, a query can be represented by a set of weighted terms, or a term vector. This is often true when the query is a piece of text (i.e., a text query). But, it may not be true for a Boolean query. When a query can be represented as a set of weighted terms, the retrieval function can often be computed efficiently via an inverted index. The abstract class TextQueryRetMethod is designed to support such retrieval functions. It inherits from RetrievalMethod and implements a generic inverted index scoring procedure, which involves iterating over all the query terms, for each query term, looking up all the entries in the inverted index to find the information about all the documents containing this term, and then, computing a sum of weights of each matched term for all the documents that match at least one query term. A ScoreAccumulator serves for accumulating the score for a document incrementally. A TextQueryRetMethod may use any ScoreAccumulator that is passed in through the constructor. This provides the flexibility of using different kinds of accumulators. For example, if you can afford to store an entry for each document in memory, then using the ArrayAccumulator may be appropriate. But, if you have a huge collection, you may want to use a ScoreAccumulator that sits on the disk. A subclass of TextQueryRetMethod overrides appropriate functions to provide a specific term weighting strategy, a specific function for combining the weights, and any needed term-independent transformation or adjustment of the score.
Formally, given a query q =(q1,q2,...,qN) and a document d=(d1,d2,...,dN), where q1,...,qN and d1,...,dN are terms, TextQueryRetMethod assumes the following general scoring function:
That is, the score of a document d against a query q is a function g of the accumulated weight w for each matched term. The score is thus determined by two functions g and w; both may depend on the whole query or document. The function w gives the weight of each matched term, while the function g makes it possible to perform any further transformation of the sum of the weight of all matched terms based on the "summary" information of a query or a document (e.g., document length).s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d)
Three abstract classes TextQueryRep, DocumentRep, and ScoreFunction are designed to support this general scoring function in the following way:
The weighting function w and score adjustment function g typically assume and depend on some particular information and representation of the query and document, so a specific ScoreFunction (for a specific retrieval method) only works for some specific TextQueryRep and DocumentRep that are appropriate for the specific retrieval method. Such coordination is implemented by grouping compatible TextQueryRep, DocumentRep, and ScoreFunction methods together underneath a TextQueryRetMethod class. Different document/query representations are possible by subclassing a DocumentRep or TextQueryRep. Similarly, alternative scoring functions can be achieved by subclassing ScoreFunction.
The generic scoring procedure is implemented by the function scoreInvertedIndex in TextQueryRetMethod, and the procedure is sketched as follows:
Normally, the documents that did not match any query term are ignored from scoring, but these documents do not necessarily have a zero score. It depends on the retrieval method, they may have a non-zero score. Although the scores of such documents are generally lower than any of the document that matches at least one query term, it is sometimes useful to have the actual scores of those documents. For this purpose, the function scoreInvertedIndex has an option to allow you to score "all" the documents, including the one that has not matched any query term.
The general scoring function given above looks more restrictive than it is. It may not be obvious, but many retrieval functions can be re-written in this form after some algebra transformation. A typical transformation is to convert the sum over the query terms that have not been matched by a document into the sum over all the query terms minus the sum over all the matched ones. A similar transformation can be done for a sum over the document terms that have not been matched by a query. As a result, the scoring formula often just involves the sum over all the matched terms and some other query-dependent or document-dependent constants.
TextQueryRetMethod also implements scoreDoc by iterating over query terms and checking each query term against a hash table created for the document to be scored.
2.8 Specific retrieval methods for text queries (Retrieval Level)
A specific inverted-index scorable retrieval method can be implemented by sub-classing TextQueryRetMethod. Since TextQueryRetMethod already implemented scoreDoc and scoreCollection, a specific retrieval method only needs to override the following four virtual functions and define its "own version" of TextQueryRep, DocumentRep, and ScoreFunction by subclassing these three abstract classes.
The instance returned by the computeTextQueryRep implemented in a specific retrieval method will actually be a specific TextQueryRep that depends on the specific TextQueryRetMethod class. The same is true for computeDocRep. For example, in the KL-divergence retrieval method, the actual DocumentRep is SimpleKLDocModel, which is, of course, a subclass of DocumentRep. In order to support different document model smoothing methods, we created different subclasses of SimpleKLDocModel (i.e., JelinekMercerDocModel, DirichletPriorDocModel, and AbosluteDiscoutDocModel).
In Version 1.1, three retrieval methods were implemented: TFIDFRetMethod, OkapiRetMethod, and SimpleKLRetMethod Each of these classes, along with its corresponding TextQueryRep, DocumentRep, and ScoreFunction, implements a different retrieval method. They are all subclasses of the abstract class TextQueryRetMethod. These classes are to be used in the following way:
The feedback method implemented in TFIDFRetMethod is a modified and simplified Rocchio. It basically constructs the centroid vector of the set of relevant documents (or the top-N documents in the case of pseudo feedback), and adds the vector to the original query vector with some weighting coefficient. It does not use any non-relevant (negative) documents. See details of the implementation of the Lemur TFIDF for more information.
OkapiRetMethod is intended to be an exact implementation of the BM25 retrieval function as described in [1]. However, the feedback part as implemented appears to have some problems, since it does not give reasonable feedback performance. The BM25 retrieval formula has a query TF for each term. In case the term is a new term extracted from the feedback documents, it does not have a natural "query TF". It is unclear from the original paper how this is set, so we implemented it as an additional parameter.
SimpleKLRetMethod implements the KL-divergence retrieval model, which is an extension of the query-likelihood approach [4]. It essentially scores a document by computing the KL-divergence between the query language model and the document language model. Since we are only interested in ranking documents, we rewrote the KL-divergence formula and only computes the part that affects ranking, which is essentially the cross entropy of the query model with respect to the document model.
2.9 Document and query models (Retrieval Level)
The two classes SimpleKLDocModel and SimpleKLQueryModel are the abstract interface for the document language model and query language model to be used in a KL-divergence retrieval model. This means that to add a new query or document model estimation method, one generally creates a subclass of these abstract classes or provides a way to construct an instance of these classes. For example, JelinekMercerDocModel, DirichletPriorDocModel, and AbosluteDiscoutDocModel are three subclasses of SimpleKLDocModel, corresponding to three different document model smoothing methods evaluated in [2]. (The backoff version is also implemented.)
The SimpleKLDocModel class assumes the following general smoothing strategy:
p(w|d) = pseen(w|d) if w is seen in document dwhere pseen(w|d) gives the discounted maximum likelihood estimate of p(w|d), p(w|C) is the collection language model, and a(d) is a document-dependent constant that controls the amount of smoothing. Given a collection language model (computed based on an Index), a smoothing method is essentially characterized by pseenand the backoff weight a(d). Accordingly, SimpleKLDocModel defines an interface for them.
p(w|d) = a(d) p(w|C) otherwise
SimpleKLQueryModel supports iteration over non-zero probabilities of a unigram query language model for scoring. It also supports query model updating through a method that allows linear interpolation of the existing query model with any given unigram language model (see below). Thus, it is possible to estimate a (feedback) unigram language model in any way you like and then evaluate its retrieval effectiveness by using this interpolation function to update an existing query model. A special case is to interpolate with a zero weight for the original model, which will then be equivalent to using the estimated feedback model alone.
2.10 Feedback as Query Language Model Updating (Retrieval Level)
With the KL-divergence retrieval formula, it is possible to view feedback as re-estimating or updating a query language model. SimpleKLRetMethod implemented three different methods for estimating a query model based on a subset of (feedback) documents, all by interpolating the newly estimated model with the original query model with some empirically set coefficient. The three feedback methods are the collection mixture model method, the divergence minimization method, and the Markov chain translation model. The first two are proposed and evaluated in [3] and the third in [4]. The three methods are implemented as three functions in SimpleKLRetMethod. A special parameter controls which method to use when the function updateQuery is called. A new feedback method can be easily added by subclassing a RetrievalMethod and overrides the method updateQuery with the new method.
2. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval, In 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), 2001.
3. C. Zhai and J. Lafferty, Model-based feedback in the language modeling approach to information retrieval, Tenth International Conference on Information and Knowledge Management (CIKM 2001), 2001.
4. J. Lafferty and C. Zhai. Risk minimization and language modeling in information retrieval. In 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), 2001.
The Lemur Project
Last modified: Monday, 13-Jun-2005 13:15:03 EDT