Note: These tutorials are out of date, please see the Lemur Wiki instead.
Lemur Project Tutorials:
Starting Out
Offset Annotations: Overview
In-line Annotations
Most people are familiar with in-line annotations, or in-line field definitions. These can be found in tagged text such as HTML or XML. For example, if you have the following HTML snippet:
<h1>The Lemur Toolkit</h1> <h2>for Language Modeling and Information Retrieval</h2> <p>Language modeling has recently emerged as an attractive new framework for text information retrieval, leveraging work on language modeling from other areas such as speech recognition and statistical natural language processing.</p>When this document is parsed and readied to be indexed, the text within any of the HTML markup tags (<>) is indexed, but we can also consider the indexing of the markup tags themselves. For instance, in HTML, the <h1> tag is typically used for title text. If we can tell the indexer to mark where any <h1> fields exist, then we can perform queries based on that field.
Offset Annotations
This is definitely useful, but what if you have a document that has no markup in it, but wish to run it through, say, a named entity tagger or a part-of-speech tagger? In this case, you can certainly have the tagger edit and mark up the original document, and then index that. Alternatively, you can use an "offset annotation" file to tell the indexer where the fields would exist if they were in the source text.
An offset annotation file contains annotations to be used with a document, but are not in-lined with the document itself. That is to say, the original document does not have to be modified, rather you can create an "offset annotation" file to tell the indexer what tag and attribute annotations to add to a document while indexing.
Note: Currently, storing and indexing annotations and fields only works with Indri-style indexes. For a more detailed comparison of Indri indexes versus KeyFile indexes, please see this page.
![]() |
![]() |
|
| [Back to TOC] | [Next: Preparing Text for Offset Annotations] |
The Lemur Project
Last modified: June 21, 2007. 09:14:12 am



