Note: These tutorials are out of date, please see the Lemur Wiki instead.
Lemur Project Tutorials:
Starting Out
Indexing: Creating a Simple Index
The three most common document formats used by Lemur include standard TREC text format, TREC web format, and HTML.
TREC Text Documents
The most common document is a plaintext document in TREC format. A document in TREC format must have a <DOC> tagset surrounding the document. The document must also at a minimum include a <DOCNO> tagset enclosing the document ID and a <TEXT> tagset enclosing the text to be indexed. As an example:
<DOC> <DOCNO>document_id</DOCNO> <TEXT> Index this document text. </TEXT> </DOC>
TREC Web format
The "trecweb" format is similar to TREC text with the cavaet that the main body text contains HTML formatted text, and it may also contain an optional <DOCHDR> tagset at the beginning that hold the header information from the HTTP request. This information may include the original URL, the date and server from when the page text was gathered, and other various information.
Also note that if you are using the IndriBuildIndex application with TRECWeb text, the URL of the page gathered from the header information will automatically be pushed into a field named "url" in the document.
HTML Format
HTML document are standard, well-formed HTML pages. HTML documents need no pre-processing before indexing and are believed to include only one document per file.
Other document types
Currently, the Indri index builder knows of the following other document types:
| xml | XML formatted data, one document per file (same as html, but without link processing) | |
| mbox | Unix mailbox files | |
| doc | Microsoft Word documents (Windows only, requires Microsoft Office) | |
| ppt | Microsoft PowerPoint documents (Windows only, requires Microsoft Office) | |
| Adobe PDF | ||
| txt | Text documents |
![]() |
![]() |
![]() |
| [Previous: Creating a Simple Index] | [Back to TOC] | [Next: Stopword Lists and Stemmers] |
The Lemur Project
Last modified: June 21, 2007. 09:14:12 am




