Lemur Indexing Applications
Contents
- BuildIndex
- BuildDocMgr
- BuildPropIndex
- IndriBuildIndex
- PassageIndexer
- IncIndexer
- IncPassageIndexer
This application builds an Inv(FP)Index, KeyfileIncIndex, or IndriIndex for a collection of documents.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the
extension. use full path information here to use index later from
other directories. i.e. /lemur/indexes/myindex
- indexType: the type of the index you want to build
- inv for Inv (.inv) or InvFP (.ifp)
- key for KeyfileIncIndex (.key)
- indri for IndriIndex (.ind)
- memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
- position: store position information (def = 1), applicable only for inv indexes. Keyfile and Indri always store positions.
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
2. BuildDocMgr
BuildDocMgr builds a document manager. A DocumentManager is necessary for
later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.
Summary of required parameters:
- manager:required name of the document manager (without extension)
- managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- dataFiles: name of file containing list of names datafiles
(one line per datafile name, use full path)
The following parameters are optional for building an index
- index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
- indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (Inv(FP)Index). default is inv
- memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
- position: store position information (def = 1).
- stopwords: name of file containing the stopword list.
Words in this file should be one per line.
If this parameter is not specified, all words
are indexed.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
3. BuildPropIndex
This application builds an InvFPIndex for a collection of documents with
properties associated with terms.
Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...
* data files can be specified on the command line OR in a metafile specified as
the dataFiles parameter
The parameters are:
- index: name of the index to create (don't include extension)
- indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (InvFPIndex). default is inv
- memory: memory (in bytes) of InvFPPushIndex cache (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- "brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
- "identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
This application builds an Indri Repository for a collection of documents. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from
a file. The parameter file uses an XML format. The command line uses dotted
path notation. The top level element in the parameters file is named parameters.
Repository construction parameters
- memory
- an integer value specifying the number of bytes to use for the
indexing process. The value can include a scaling factor by adding a
suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G =
1000000000. So 100M would be equivalent to 100000000. The value should
contain only decimal digits and the optional suffix. Specified as
<memory>100M</memory> in the parameter file and as
-memory=100M on the command line.
- corpus
- a complex element containing parameters related to a corpus. This
element can be specified multiple times. The parameters are
- path
- The pathname of the file or directory containing documents to
index. Specified as
<corpus><path>/path/to/file_or_directory</path></corpus>
in the parameter file and as
-corpus.path=/path/to/file_or_directory on the command
line.
- class
- The FileClassEnviroment of the file or directory containing
documents to index. Specified as
<corpus><class>trecweb</class></corpus> in the
parameter file and as -corpus.class=trecweb on the command
line. The known classes are:
- html -- web page data.
- trecweb -- TREC web format, eg terabyte track.
- trectext -- TREC format, eg TREC-3 onward.
- doc -- Microsoft Word format (windows platform only).
- ppt -- Microsoft Powerpoint format (windows platform only).
- pdf -- Adobe PDF format.
- txt -- Plain text format.
Combining each of these elements, the parameter file would contain:
<corpus>
<path>/path/to/file_or_directory</path>
<class>trecweb</class>
</corpus>
metadata
a complex element containing one or more entries
specifying the metadata fields to index, eg title, headline.
There are three options
- field -- Make the named field available for retrieval as
metadata. Specified as
<metadata><field>fieldname</field></metadata>
in the parameter file and as metadata.field=fieldname on the
command line.
- forward -- Make the named field available for retrieval as
metadata and build a lookup table to make retrieving the value more
efficient. Specified as
<metadata><forward>fieldname</forward></metadata>
in the parameter file and as metadata.forward=fieldname on the
command line.
- backward -- Make the named field available for retrieval
as metadata and build a lookup table for inverse lookup of documents
based on the value of the field. Specified as
<metadata><backward>fieldname</backward></metadata>
in the parameter file and as metadata.backward=fieldname on
the command line.
field
a complex element specifying the fields to index as data, eg
TITLE. This parameter can appear multiple times in a parameter file.
If provided on the command line, only the first field specified will
be indexed. The subelements are:
- name
- the field name, specified as
<field><name>fieldname</name></field> in the
parameter file and as -field.name=fieldname on the command
line.
- numeric
- the symbol true if the field contains
numeric data, otherwise the symbol false, specified as
<field><numeric>true</numeric></field> in the
parameter file and as -field.numeric=true on the command
line. This is an optional parameter, defaulting to false. Note that 0
can be used for false and 1 can be used for true.
stemmer
a complex element specifying the stemming algorithm to use in the
subelement name. Valid options are Porter or Krovetz (case
insensitive). Specified as
<stemmer><name>stemmername</name></stemmer> and
as -stemmer.name=stemmername on the command line. This is an
optional parameter with the default of no stemming.
stopper
a complex element containing one or more subelements named word,
specifying the stopword list to use. Specified as
<stopper><word>stopword</word></stopper> and
as -stopper.word=stopword on the command line. This is an
optional parameter with the default of no stopping.
This application builds an FP passage index for a collection of documents.
Documents are segmented into passages of size passageSize with an
overlap of passageSize/2 terms per passage.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the
.ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's ste
mmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
- passageSize: Number of terms per passage.
This application builds an FP index for a collection of documents. If the
index already exists, new documents are added to that index, otherwise a
new index is created.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the
.ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's ste
mmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
This application builds an FP passage index for a collection of documents.
If the index already exists, new documents are added to that index,
otherwise a new index is created. Documents are segmented into passages of
size passageSize with an overlap of passageSize/2 terms
per passage.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the
.ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's ste
mmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
- passageSize: Number of terms per passage.
The Lemur Project
Last modified: Monday, 13-Jun-2005 13:09:54 EDT