Offset annotation support

IndriBuildIndex accepts the parameter annotations to specify a file containing offset annotations for the documents in a collection. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file. There may be only one annotations file per corpus entry.

Offset Annotation File Format

Format of the offset annotation file: 8-column, tab-delimited. From left-to-right, those columns are:

docno
external document id corresponding to the document in which the annotation occurs.
type
TAG or ATTRIBUTE
id
an id number for the annotation; each line should have a unique id >= 1.
name
for TAG, name or type of the annotation for ATTRIBUTE, the attribute name, or key
start
start and length define the annotation's extent. The values should be in token position offsets
length
meaningless for an ATTRIBUTE. The number of tokens the annotation spans.
value
for TAG, an INT64 for ATTRIBUTE, a string that is the attribute's value
parentid
for TAG, refers to the id number of another TAG to be considered the parent of this one; this is how hierarchical annotations can be expressed. a TAG that has no parent has parentid = 0 for ATTRIBUTE, refers to the id number of a TAG to which it belongs and from which it inherits its start and length. *NOTE: the file must be sorted such that any line that uses a given id in this column must be *after* the line that uses that id in the id column.
debug
ignored by the OffsetAnnotator; can contain any information that is beneficial to a human reading the file


The Lemur Project
  The Lemur Project
  Last modified: Thursday, 06-Oct-2005 09:14:45 EDT