Lemur Indexing Applications
Contents
- BuildIndex
- BuildDocMgr
- BuildPropIndex
- IndriBuildIndex
- PassageIndexer
- IncIndexer
- IncPassageIndexer
This application builds an Inv(FP)Index, KeyfileIncIndex, or IndriIndex for a collection of documents.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the
extension. use full path information here to use index later from
other directories. i.e. /lemur/indexes/myindex
- indexType: the type of the index you want to build
- inv for Inv (.inv) or InvFP (.ifp)
- key for KeyfileIncIndex (.key)
- indri for IndriIndex (.ind)
- memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
- position: store position information (def = 1), applicable only for inv indexes. Keyfile and Indri always store positions.
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
2. BuildDocMgr
BuildDocMgr builds a document manager. A DocumentManager is necessary for
later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.
Summary of required parameters:
- manager:required name of the document manager (without extension)
- managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- dataFiles: name of file containing list of names datafiles
(one line per datafile name, use full path)
The following parameters are optional for building an index
- index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
- indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (Inv(FP)Index). default is inv
- memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
- position: store position information (def = 1).
- stopwords: name of file containing the stopword list.
Words in this file should be one per line.
If this parameter is not specified, all words
are indexed.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
3. BuildPropIndex
This application builds an InvFPIndex for a collection of documents with
properties associated with terms.
Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...
* data files can be specified on the command line OR in a metafile specified as
the dataFiles parameter
The parameters are:
- index: name of the index to create (don't include extension)
- indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (InvFPIndex). default is inv
- memory: memory (in bytes) of InvFPPushIndex cache (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- "brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
- "identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
This application builds an Indri Repository for a collection of documents. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from
a file. The parameter file uses an XML format. The command line uses dotted
path notation. The top level element in the parameters file is named parameters.
Repository construction parameters
- memory
- an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as
-memory=100M on the command line.
- corpus
- a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
- path
- The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as
-corpus.path=/path/to/file_or_directory on the command line.
- class
- The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as
-corpus.class=trecweb on the command line. The known classes are:
-
html -- web page data.
-
trecweb -- TREC web format, eg terabyte track.
-
trectext -- TREC format, eg TREC-3 onward.
-
doc -- Microsoft Word format (windows platform only).
-
ppt -- Microsoft Powerpoint format (windows platform only).
-
pdf -- Adobe PDF format.
-
txt -- Plain text format.
- annotations
The pathname of the file containing offset annotations for the documents specified in path. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file and as -corpus.annotations=/path/to/file on the command line.
More information on using offset annotations can be found here.
- metadata
- The pathname of the file or directory containing offset metadata for the documents specified in
path. Specified as <corpus><metadata>/path/to/file</metadata></corpus> in the parameter file and as -corpus.metadata=/path/to/file on the command line.
Combining the first two of these elements, the parameter file would contain:
<corpus>
<path>/path/to/file_or_directory</path>
<class>trecweb</class>
</corpus>
- metadata
- a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options
-
field -- Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.
-
forward -- Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and as metadata.forward=fieldname on the command line.
-
backward -- Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and as metadata.backward=fieldname on the command line.
- field
- a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:
- name
- the field name, specified as <field><name>fieldname</name></field> in the parameter file and as
-field.name=fieldname on the command line.
- numeric
- the symbol
true if the field contains numeric data, otherwise the symbol false, specified as <field><numeric>true</numeric></field> in the parameter file and as -field.numeric=true on the command line. This is an optional parameter, defaulting to false. Note that 0 can be used for false and 1 can be used for true.
- stemmer
- a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as
-stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.
- stopper
- a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as
-stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.
This application builds an FP passage index for a collection of documents.
Documents are segmented into passages of size passageSize with an
overlap of passageSize/2 terms per passage.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the
.ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
- passageSize: Number of terms per passage.
This application builds an FP index for a collection of documents. If the
index already exists, new documents are added to that index, otherwise a
new index is created.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the
.ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
This application builds an FP passage index for a collection of documents.
If the index already exists, new documents are added to that index,
otherwise a new index is created. Documents are segmented into passages of
size passageSize with an overlap of passageSize/2 terms
per passage.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the
.ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
- passageSize: Number of terms per passage.
The Lemur Project
Last modified: Wednesday, 01-Nov-2006 14:39:02 EST