Lemur Indexing Applications

Contents

  1. BuildIndex
  2. IndriBuildIndex
  3. BuildDocMgr
  4. BuildPropIndex

1. BuildIndex

This application builds a KeyfileIncIndex, or IndriIndex for a collection of documents.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. indexType: the type of the index you want to build
    • key for KeyfileIncIndex (.key)
    • indri for IndriIndex (.ind)
  3. memory: memory (in bytes) to pre-allocate (def = 96000000).
  4. stopwords: name of file containing the stopword list.
  5. acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
  6. countStopWords: If true, count stopwords in document length.
  7. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  8. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  9. dataFiles: name of file containing list of datafiles to index.

2. IndriBuildIndex

This application builds an Indri Repository for a collection of documents. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.

Repository construction parameters

memory
an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
corpus
a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
path
The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
class
The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:
  • html -- web page data.
  • trecweb -- TREC web format, eg terabyte track.
  • trectext -- TREC format, eg TREC-3 onward.
  • trecalt -- TREC format, eg TREC-3 onward, with only the TEXT field included.
  • doc -- Microsoft Word format (windows platform only).
  • ppt -- Microsoft Powerpoint format (windows platform only).
  • pdf -- Adobe PDF format.
  • txt -- Plain text format.
annotations
The pathname of the file containing offset annotations for the documents specified in path. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file and as -corpus.annotations=/path/to/file on the command line.
metadata
The pathname of the file or directory containing offset metadata for the documents specified in path. Specified as <corpus><metadata>/path/to/file</metadata></corpus> in the parameter file and as -corpus.metadata=/path/to/file on the command line.

Combining the first two of these elements, the parameter file would contain:
<corpus>
  <path>/path/to/file_or_directory</path>
  <class>trecweb</class>
</corpus>

metadata
a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options
  1. field -- Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.

  2. forward -- Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and as metadata.forward=fieldname on the command line. The external document id field "docno" is automatically added as a forward metadata field.

  3. backward -- Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and as metadata.backward=fieldname on the command line. The external document id field "docno" is automatically added as a backward metadata field.

field
a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:

name
the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
numeric
the symbol true if the field contains numeric data, otherwise the symbol false, specified as <field><numeric>true</numeric></field> in the parameter file and as -field.numeric=true on the command line. This is an optional parameter, defaulting to false. Note that 0 can be used for false and 1 can be used for true.
stemmer
a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.
stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping. Here is Indri's standard stopword list in the IndriBuildIndex parameter file format.

3. BuildDocMgr

BuildDocMgr builds a document manager. A DocumentManager is necessary for later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.

Summary of required parameters:

  1. manager:required name of the document manager (without extension)
  2. managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
  3. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  4. dataFiles: name of file containing list of names datafiles (one line per datafile name, use full path)
The following parameters are optional for building an index
  1. index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. indexType:the type of index to create. Currently only "key" (KeyfileIncIndex) is supported
  3. memory: memory (in bytes) to pre-allocate (def = 96000000).
  4. position: store position information (def = 1).
  5. stopwords: name of file containing the stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.
  6. acronyms: name of file containing the acronym list.
  7. countStopWords: If true, count stopwords in document length.
  8. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words

4. BuildPropIndex

This application builds an index for a collection of documents with properties associated with terms.

Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...

* data files can be specified on the command line OR in a metafile specified as the dataFiles parameter

The parameters are:

  1. index: name of the index to create (don't include extension)
  2. indexType:the type of index to create. Currently only "key" (KeyfileIncIndex) is supported
  3. memory: memory (in bytes) to pre-allocate (def = 96000000).
  4. stopwords: name of file containing the stopword list.
  5. acronyms: name of file containing the acronym list.
  6. countStopWords: If true, count stopwords in document length.
  7. docFormat:
    • "brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
    • "identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.
  8. stemmer:
    • "porter" Porter stemmer.
    • "krovetz" Krovetz stemmer.
    • "arabic" arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  9. dataFiles: name of file containing list of datafiles to index.