Specificly, we can collect term frequency, term position, and document length statistics because those are most commonly needed for information retrieval. For example, from the index, you can find out how many times a certain term occurred in the collection of documents, or how many times it occurred in just one specific document. Retrieval algorthms that decide which documents to return for a given query use the collected information in the index in their scoring calculations.
Lemur is primarily a research system so the included parsers were designed to facilitate indexing many documents that are in the same file. In order for the index to know where the document boundaries are within files, each document must have begin document and end document tags. These tags are similar to HTML or XML tags and are actually the format for NIST's Text REtrieval Conference (TREC) documents.
The 2 most frequently used parsers are the TrecParser
and WebParser.
TrecParser: This parser recognizes text in the
TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. For example:
<DOC>WebParser: This parser removes HTML tags, text within SCRIPT tags, as well as text in HTML comments. Document boundaries are specified with NIST style format:
<DOCNO> document_number </DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC>
<DOC>In addition to these parsers, Lemur also provides parsers for Chinese (GB2312 encoding) and Arabic (CP1256 encoding). (See "Parsing in Lemur" for more information.)
<DOCNO> document_number </DOCNO>
Document text here could be in HTML.
</DOC>
If your documents are not from NIST, these are the methods you can take to parse and index your documents:
| Index Name | Extension |
File Limit |
Stores positions |
Loads fast |
Disk space usage |
Application |
Add* documents to Index |
| InvIndex |
.inv |
no |
no |
no |
less |
BuildIndex |
no |
| InvFPIndex |
.ifp |
no |
yes |
no |
more |
BuildIndex | yes, use IncIndexer |
| KeyfileIncIndex | .key |
no |
yes |
yes |
even more |
BuildIndex | yes, use BuildIndex |
| IndriIndex |
|
no |
yes |
yes |
most (automatically stores compressed version of original documents) |
BuildIndex or IndriBuildIndex |
yes |
The Lemur Project
Last modified: Monday, 13-Jun-2005 12:52:55 EDT