Note: These tutorials are out of date, please see the Lemur Wiki instead.
Lemur Project Tutorials:
Starting Out
Retrieval: Batch Retrieval
Contents
IndriRunQuery
IndriRunQuery is a command-line interface for batch retrieval via the Indri query interface. To issue a query via the IndriRunQuery application, you need to create a parameter file, much like one that was created to build an index, and is run by executing "IndriRunQuery <parameter_file>"
At the most basic, an IndriRunQuery parameter file should consist of a memory parameter, an index path, and a query. As an example:
<parameters> <memory>256M</memory> <index>/path/to/the/index</index> <query>the query to issue</query> </parameters>
If you do not want to create a parameter file, the various paramters can be issued from the command line. To issue the same parameters from the example above:
IndriRunIndex -memory=256M -index=/path/to/the/index -query="the query to issue".
For a full listing of available parameters to use with IndriRunQuery, see the API documentation.
RetEval
Much like IndriRunQuery above, the RetEval application is a command-line interface for batch retrieval, but permits querying via other methods such as TF/IDF, Okapi and KL-Divergence.
To run RetEval, issue the command "RetEval <parameter_file>" where the <parameter_file> is the path to the parameter file used for your queries. The parameter file's structure and options are below:
retModel
This is the retrieval model to use. It currently supports six different models:
- tfidf or 0 for TFIDF
- okapi or 1 for Okapi
- kl or 2 for Simple KL
- inquery or 3 for InQuery
- cori_cs or 4 for CORI collection selection
- cos or 5 for cosine similarity
- indri or 7 for Indri SQL
Other common parameters (for all retrieval methods) are:
- index: The complete name of the index table-of-content file for the database index.
- textQuery: the query text stream
- resultFile: the result file
- TRECResultFormat: whether the result format is of the TREC format (i.e., six-column) or just
a simple three-column format
. Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format) - resultCount: the number of documents to return as result for each query
- feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
- feedbackTermCount: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.)
Model-specific parameters are:
- For TFIDF:
- feedbackPosCoeff: the coefficient for positive terms in (positive) Rocchio feedback. We only implemented the positive part and non-relevant documents are ignored.
- doc.tfMethod: document term TF weighting method: rawtf for RawTF, logf for log-TF, and bm25 for BM25TF
- doc.bm25K1: BM25 k1 for doc term TF
- doc.bm25B : BM25 b for doc term TF
- query.tfMethod: query term TF weighting method: rawtf for RawTF, logf for log-TF, and bm25 for BM25TF
- query.bm25K1: BM25 k1 for query term TF. bm25B is set to zero for query terms
- For Okapi:
- BM25K1 : BM25 K1
- BM25B : BM25 B
- BM25K3: BM25 K3
- BM25QTF: The TF for expanded terms in feedback (the original paper about the Okapi system is not clear about how this is set, so it's implemented as a parameter.)
- For KL-divergence:
- smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
- smoothMethod: One of : Jelinek-Mercer (jm), Dirichlet prior (dir), Absolute discounting (ad), or Two stage (twostage)
- smoothStrategy: Either interpolate (interpolate) or backoff (backoff)
- JelinekMercerLambda: The collection model weight in the JM interpolation method. Default: 0.5
- DirichletPrior: The prior parameter in the Dirichlet prior smoothing method. Default: 1000
- discountDelta: The delta (discounting constant) in the absolute discounting method. Default 0.7.
Query model updating method (i.e., pseudo feedback):
- queryUpdateMethod: feedback method (mixture model (mixture), divergence minimization (divmin), Markov chain (mc), relevance model 1 (rm1) or relevance model 2 (rm2).
- Method-specific feedback parameters:
For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:
- feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
- feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
- feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
- feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.
All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.
- For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
- For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
- For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.
In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )
Document model smoothing parameters:


[Previous: Retrieval via the Web Interface] [Back to TOC]
The Lemur Project
Last modified: June 21, 2007. 09:14:12 am

