Main Page | Namespace List | Class Hierarchy | Class List | File List | Namespace Members | Class Members | File Members | Related Pages

Retrieval Evaluation Application

This application runs retrieval experiments (with/without feedback) to evaluate different retrieval models as well as different parameter settings for those models.

Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has either a non-zero (integer) value or the value true, scoring will be on a working set specified in a file given by "workingSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. The reason for having a third column of numerical values is so that any retrieval result of the simple format (i.e., non-trec format) generated by Lemur could be directly used as a "workingSetFile" for the purpose of re-ranking, which is convenient. Also, the third column could be used to provide a prior probability value for each document, which could be useful for some algorithms. By default, scoring is on the whole collection.

It currently supports six different models:

  1. The popular TFIDF retrieval model
  2. The Okapi BM25 retrieval function
  3. The KL-divergence language model based retrieval method
  4. The InQuery (CORI) retrieval model
  5. Cosine similarity model
  6. Indri structured query language

The parameter to select the model is retModel. Valid values are:

It is suspected that there is a bug in the implementation of the feedback for Okapi BM25 retrieval function, because the performance is not as expected.

Other common parameters (for all retrieval methods) are:

  1. index: The complete name of the index table-of-content file for the database index.

  2. textQuery: the query text stream

  3. resultFile: the result file

  4. resultFormat: whether the result format should be of the TREC format (i.e., six-column) or just a simple three-column format <queryID, docID, score>. String value, either trec for TREC format or 3col for three column format. The integer values, zero for non-TREC format, and non-zero for TREC format used in previous versions of lemur are accepted. Default: TREC format.

  5. resultCount: the number of documents to return as result for each query

  6. feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)

  7. feedbackTermCount: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.)

Model-specific parameters are:

Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.

The first three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.

In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )


Generated on Tue Jun 15 11:02:58 2010 for Lemur by doxygen 1.3.4