Lemur Indexing Applications

Contents

  1. BuildIndex
  2. BuildDocMgr
  3. BuildPropIndex
  4. IndriBuildIndex
  5. PassageIndexer
  6. IncIndexer
  7. IncPassageIndexer

1. BuildIndex

This application builds an Inv(FP)Index, KeyfileIncIndex, or IndriIndex for a collection of documents.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. indexType: the type of the index you want to build
  3. memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
  4. position: store position information (def = 1), applicable only for inv indexes. Keyfile and Indri always store positions.
  5. stopwords: name of file containing the stopword list.
  6. acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
  7. countStopWords: If true, count stopwords in document length.
  8. docFormat:
  9. stemmer:
  10. dataFiles: name of file containing list of datafiles to index.

2. BuildDocMgr

BuildDocMgr builds a document manager. A DocumentManager is necessary for later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.

Summary of required parameters:

  1. manager:required name of the document manager (without extension)
  2. managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
  3. docFormat:
  4. dataFiles: name of file containing list of names datafiles (one line per datafile name, use full path)
The following parameters are optional for building an index
  1. index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (Inv(FP)Index). default is inv
  3. memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
  4. position: store position information (def = 1).
  5. stopwords: name of file containing the stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.
  6. acronyms: name of file containing the acronym list.
  7. countStopWords: If true, count stopwords in document length.
  8. stemmer:

3. BuildPropIndex

This application builds an InvFPIndex for a collection of documents with properties associated with terms.

Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...

* data files can be specified on the command line OR in a metafile specified as the dataFiles parameter

The parameters are:

  1. index: name of the index to create (don't include extension)
  2. indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (InvFPIndex). default is inv
  3. memory: memory (in bytes) of InvFPPushIndex cache (def = 96000000).
  4. stopwords: name of file containing the stopword list.
  5. acronyms: name of file containing the acronym list.
  6. countStopWords: If true, count stopwords in document length.
  7. docFormat:
  8. stemmer:
  9. dataFiles: name of file containing list of datafiles to index.

4. IndriBuildIndex

This application builds an Indri Repository for a collection of documents. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.

Repository construction parameters

memory
an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
corpus
a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
path
The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
class
The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:
  • html -- web page data.
  • trecweb -- TREC web format, eg terabyte track.
  • trectext -- TREC format, eg TREC-3 onward.
  • doc -- Microsoft Word format (windows platform only).
  • ppt -- Microsoft Powerpoint format (windows platform only).
  • pdf -- Adobe PDF format.
  • txt -- Plain text format.
Combining each of these elements, the parameter file would contain:
<corpus>
  <path>/path/to/file_or_directory</path>
  <class>trecweb</class>
</corpus>
metadata
a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options
  1. field -- Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.
  2. forward -- Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and as metadata.forward=fieldname on the command line.
  3. backward -- Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and as metadata.backward=fieldname on the command line.
field
a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:
name
the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
numeric
the symbol true if the field contains numeric data, otherwise the symbol false, specified as <field><numeric>true</numeric></field> in the parameter file and as -field.numeric=true on the command line. This is an optional parameter, defaulting to false. Note that 0 can be used for false and 1 can be used for true.
stemmer
a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.
stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.

5. PassageIndexer

This application builds an FP passage index for a collection of documents. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
  7. stemmer:
  8. dataFiles: name of file containing list of datafiles to index.
  9. passageSize: Number of terms per passage.

6. IncIndexer

This application builds an FP index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
  7. stemmer:
  8. dataFiles: name of file containing list of datafiles to index.

7. IncPassageIndexer

This application builds an FP passage index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
  7. stemmer:
  8. dataFiles: name of file containing list of datafiles to index.
  9. passageSize: Number of terms per passage.

The Lemur Project
  The Lemur Project
  Last modified: Monday, 13-Jun-2005 13:09:54 EDT