Lemur API Programming Examples

The following are some simple examples of using the Lemur API in your own programs. More examples can be found in the Lemur Wiki Pages.

Browsing an index
Building a collection language model
Building a smoothed language model based on a set of documents
Making a simple retrieval application
Adding a new document model smoothing method
Adding a new query model updating method
Implementing a completely new retrieval method

1. Browsing an index

This example prints out all entries in the term-to-document index and document-to-term index of a collection with a table of contents (toc) file named "index-file" built using the basic indexer.



Index *ind;  

ind = IndexManager::openIndex("index-toc-file.key");  
 // open the index specified by the table-of-content (toc) file "index-toc-file.key"
// IndexManager recognizes the suffix of the toc file (in this case ".key") and uses it to 
// infer the actual type of the index to be opened (KeyfileIncIndex).


// first, browse through the term->document index (i.e., the inverted index)

int termID;

// iterate over all possible termID's, the termCountUnique() function
// gives the total count of unique terms, i.e., the vocabulary size.
// Note that the term index 0 is reserved for out-of-vocabulary
// terms, so we start from 1.

for (termID = 1; termID <= ind->termCountUnique(); termID++) {

   cout << "term->document index entries for term : "
        << ind->term(termID) << endl;
   // The function call term(termID) returns the string form of the term.

  
   // now fetch doc info list for each term, this creates an
   // instance of DocInfoList, which needs to be deleted later! 
   DocInfoList *docList = ind->docInfoList(termID);
   
   // iterate over entries in docList
   docList->startIteration();   
   DocInfo *dEntry;
   while (docList->hasMore()) {
      dEntry = docList->nextEntry(); 
      // note that nextEntry() does *not* return an instance,
      // instead, it passes out a pointer to a local static variable.
      // so no "delete" is needed. 

      // print out this entry
      cout << "-> " << dEntry->termCount() << " times in doc " 
           << ind->document(dEntry->docID()) << endl;
   }
   delete docList; // note that you MUST delete docList!
}

// second, browse through the document->term index 

int docID;

// iterate over all possible docID's, the docCount() function
// gives the total count of documents, i.e., the collection size.
// Note that the doc ID index 0 is reserved for out-of-vocabulary
// doc ID's, so we start from 1.
for (docID = 1; docID <= ind->docCount(); docID++) {

   cout << "document->term index entries for doc : "
        << ind->document(docID) << endl;
   // The function call document(docID) returns the string form of the ID.

  
   // now fetch term info list for each document, this creates an
   // instance of TermInfoList, which needs to be deleted later! 
   TermInfoList *termList = ind->termInfoList(docID);
   
   // iterate over entries in termList, i.e., all words in the document
   termList->startIteration();   
   TermInfo *tEntry;
   while (termList->hasMore()) {
      tEntry = termList->nextEntry(); 
      // note that nextEntry() does *not* return an instance,
      // instead, it passes out a pointer to a local static variable.
      // so no "delete" is needed. 
      // print out this entry
      cout << "-> word " << ind->term(tEntry->termID()) << " occurs "
           << tEntry->count() << " times" << endl;
   }
   delete termList; // don't forget to delete termList!
}

2. Building a collection language model

A collection unigram language model can be built by counting the unigrams in a collection with a DocUnigramCounter and then constructing (estimating) a language model.

Index *myCollection;

// construct a counter (counting all unigrams in a collection)
DocUnigramCounter *counter = new DocUnigramCounter(*myCollection);

// construct a maximum likelihood collectionlanguage model
MLUnigramLM * maxLikelihoodCollectionLM 
       = new MLUnigramLM ( *counter, myCollection->termLexiconID());

// construct a Laplace-smoothed collection language model 
LaplaceUnigramLM *laplaceCollectionLM =
        new LaplaceUnigramLM (*counter, myCollection->termLexiconID(), 
                              myCollection->termCountUnique());

// access the probability of word "retrieval"
   cout << "p(retrieval)=" << laplaceCollectionLM->prob(myCollection->term("retrieval")) << endl;

3. Building a smoothed language model based on a set of documents

Assume that the (integer) ID's for the set of documents are stored in a vector of integers (i.e., an instance of the STL template vector<int>), we will, again, first construct a DocUnigramCounter, and then construct a language model. This time, we will use the Dirichlet prior smoothing method and using the collection language model as the reference model. As can be seen, this just involves instantiate a different unigram language model class based on the counter.

Index *myCollection;

UnigramLM *refLangMod = laplaceCollectionLM; 
// this is our reference model

vector <int> docIDs; 
// the ID's of the documents that we want to use to build the language model 

// now construct the counter
DocUnigramCounter * counter = new DocUnigramCounter(docIDs, *myCollection);

DirichletUnigramLM  *dirDocSetLM = 
        new DirichletUnigramLM (*counter, myCollection->termLexiconID(),
                          *refLangMod, priorSampleSize);
      // priorSampleSize is the parameter for Dirichlet prior smoothing

4. Making a simple retrieval application

This example illustrates how a particular retrieval method can be used to run a set of text queries on an indexed collection.

// Step 1. Open the indexer
Index *dbIndex = IndexManager::openIndex("my-index-file.key");
// The IndexManager determines which concrete indexer to instantiate
// based on the extension of ".key" 


// Step 2. Open a text query stream
DocStream *queryStream = new BasicDocStream("my-text-query-file");
// The text query is assumed to conform to the format required by BasicDocStream

// Step 3. Obtain a score accumulator, we'll use the ArrayAccumulator

ArrayAccumulator accumulator(ind->docCount());

// Step 4. Instantiate RetrievalMethod, we'll use TFIDF
RetrievalMethod *myMethod = new TFIDFRetMethod(*dbIndex, accumulator); 



// Define a variable to hold results.

IndexedRealVector results; 

// Step 5. Execute each query

queryStream->startDocIteration();

TextQuery *q;
while (queryStream->hasMore()) {

    Document *d = qryStream->nextDoc(); // fetch the query document
    q = new TextQuery(*d); // construct a TextQuery
    cout << "query : "<< q->id() << endl;
    QueryRep * qr = myMethod->computeQueryRep(*q); // compute the query representation 
    // now score all documents    
    myMethod->scoreCollection(*qr, results);

    results.Sort(); // sorting results, assume a higher score means more relevant

    // now print out top 10 results
    cout << "Top 10 results: " << endl;
    int resCount = 0;
    IndexedRealVector::iterator it;
    it = results.begin();
    while ((it != results->end()) && (count++ < 10)) {
         cout << myIndex->document((*it).ind)  // this is the document ID 
              << (*it).val  // this is the score of this document
              << endl;
         it++;
    }
}

5 Adding a new document model smoothing method

In order to add a new document model smoothing method for the KL-divergence retrieval method, we need to create a subclass of SimpleKLDocModel. This class should have an interface similar to the following:

class MyNewSimpleKLDocModel : public SimpleKLDocModel {
public:

   // You need to and only need to implement the following 
   // two functions based on your new smoothing algorithm
   double seenProb(double termFreq, int termID); // this gives p_seen(w|d)
   double unseenCoeff();   // this gives a(d)
};

There are at least two different ways to use this MyNewSimpleKLDocModel in the KL-divergence retrieval model:

Change the toolkit source file SimpleKLRetMethod: Add a parameter value (to param.smthMethod) to indicate using this new smoothing algorithm, and in the ComputeDocRep function, instantiate MyNewSimpleKLDocModel when the parameter says so. This requires recompiling the toolkit, and is not the preferred way from the viewpoint of object-oriented design.

Subclass SimpleKLRetMethod and override the implementation of the virtual function computeDocRep:

class MyNewKLRetMethod : SimpleKLRetMethod {
public:
   // You should re-implement this function by using your new smoothing method
   virtual DocumentRep *computeDocRep(int docID);
};

In the implementation file, you may have something like:

DocumentRep *MyNewKLRetMethod::computeDocRep(int docID)
{
   return (new MyNewSimpleKLDocModel(...));
}

6. Adding a new query model updating method

Just like the addition of a new document smoothing method, there are also at least two ways for doing this.

Change SimpleKLRetMethod: Add the implementation of your new method and modify the updateQuery method in SimpleKLRetMethod to use your new method when the parameter says to. (You'll also need to add a parameter value to indicate when your method should be used.) Again, this is not the preferred way from the viewpoint of object-oriented design.
Subclass SimpleKLRetMethod: override the implementation of the virtual function updateQuery with that of your new implementation.

7. Implementing a completely new retrieval method

We need to distinguish two different types of retrieval methods, depending on whether the retrieval function can be rewritten in terms of functions g and w as explained in Section 2.7 of the API Guide: General retrieval methods for text queries.

If the method can be rewritten in such a form, then, it is relatively easy to implement it by subclassing TextQueryRetMethod. (Most practical retrieval algorithms can be transformed into such a form. ) In general, you will need to define all of the following four elements:

Query representation: create a subclass of TextQueryRep.
Document representation: create a subclass of DocumentRep.
Score function: create a subclass of ScoreFunction.
Your retrieval method: create a subclass of TextQueryRetMethod that includes your way of updating a query (for feedback) and supports generation of your particular TextQueryRep and DocumentRep.

If your new retrieval method is close enough to one of the existing methods (i.e., TFIDF, Okapi, or KL-divergence), you may be able to reuse some of the existing subclasses of TextQueryRep or DocumentRep. Sometimes, even if you can not reuse any of the existing concrete TextQueryRep or DocumentRep, it may still be easier to just modify an existing one, rather than writing your own class from scratch.

If your scoring function can not be transformed to this form, you will need to subclass the general RetrievalMethod. You will need to do all the following:

Query representation: create a subclass of QueryRep.
Your retrieval method: create a subclass of RetrievalMethod that implements the following methods according to your retrieval algorithm:
- computeQueryRep computes your own QueryRep
- scoreDoc computes the score of a single document with respect to a QueryRep
- updateQuery implements your way of updating a query (for feedback)
You may also override the generic (inefficient) implementation of scoreCollection in RetrievalMethod.

For these methods, it is usually impractical (taking a very long time) to score the whole collection, unless it is a very small one. Generally, you may consider only scoring a pre-selected subset of documents (i.e., a working set). That is, consider the task as re-ranking a given subset of documents. The application RetEval.cpp supports such evaluation. You just need to set the parameter useWorkingSet to true (a non-zero integer) and set the parameter workSetFile to the file with the IDs of the selected subset of documents. Version 1.1 of Lemur does not have any support for structured queries. But, it is possible to add a retrieval method working on structured queries by subclassing RetrievalMethod. You will need to modify the current retrieval evaluation program RetEval.cpp to take care of the loading of structured queries. As is, it only recognizes text queries.

The following is a sketch of a possible implementation of a retrieval method that can be rewritten in terms of w and g.

// define your own query representation to support your scoring function
class MyQueryRep : public TextQueryRep {
       ...
};

// define your own doc representation to support your scoring function
class MyDocRep : public DocumentRep {
       ...
};

class MyScoreFunc: public ScoreFunction {
   double matchedTermWeight (QueryTerm *qTerm, TextQueryRep *qRep, DocInfo *info, DocumentRep *dRep) {
        // downcast all objects (when necessary) to get access to
        // all the information required for scoring.
        // Such downcasting is safe, because the objects that are passed in
        // must have been generated originally by MyRetMethod.

        MyQueryTerm *trueQueryTerm = (MyQueryTerm *)qTerm; 
        // Do this, if you have a MyQueryTerm class and you need the extra information
        // provided by it for scoring.

        MyQueryRep *trueQueryRep  = (MyQueryRep *) qRep;
  MyDocRep *trueDocRep = (MyDocRep *) dRep;

  // now calculate the weight
      }
      
      double adjustedScore (double origScore, TextQueryRep *qRep, DocumentRep *dRep);
        // This can be implemented similarly.
}

class MyRetMethod :  public TextQueryRetMethod {
public:
   
   MyRetMethod(Index &dbIndex, ScoreAccumulator &accumulator) {
         ...
   scFunc = new MyScoreFunc(...);
   }

   TextQueryRep *computeTextQueryRep (TextQuery &qry) {
       return (new MyQueryRep(qry, ...));
   }


   DocumentRep *computeDocRep (int docID) {
       return (new MyDocRep(docID, ...));
   }

   ScoreFunction *scoreFunc() {
       return (scFunc);  
   }

   void updateQuery(TextQueryRep &origRep, DocIDSet &relDocs) {
          MyQueryRep *trueQueryRep = dynamic_cast(&origRep);
    // implement your favorite feedback algorithm here.
    }
   
protected:
   MyScoreFunc *scFunc;
};

As mentioned above, it is possible that you may be able to reuse one or more of the existing classes; In that case, you may not need all the subclasses shown in this example. For example, you may be able to reuse the TFIDFDocRep as your DocumentRep, so only need to subclass TextQueryRep and ScoreFunction. It is also possible that you may partially reuse some existing concrete representation, so that you do not need to subclass the abstract interface. For example, you may want to subclass TFIDFDocRep, rather than DocumentRep, if you can partially reuse some components in TFIDFDocRep.