Offset annotation support
IndriBuildIndex accepts the parameter annotations to specify
a file containing offset annotations for the documents in a
collection. Specified as
<corpus><annotations>/path/to/file</annotations></corpus>
in the parameter file. There may be only one annotations file per
corpus entry.
Offset Annotation File Format
Format of the offset annotation file: 8-column, tab-delimited.
From left-to-right, those columns are:
- docno
- external document id corresponding to the document in
which the annotation occurs.
- type
- TAG or ATTRIBUTE
- id
- an id number for the annotation; each line should have a
unique id >= 1.
- name
- for TAG, name or type of the annotation
for ATTRIBUTE, the attribute name, or key
- start
- start and length define the annotation's
extent. The values should be in token position offsets
- length
- meaningless for an ATTRIBUTE. The number of
tokens the annotation spans.
- value
- for TAG, an INT64
for ATTRIBUTE, a string that is the attribute's value
- parentid
- for TAG, refers to the id number of another TAG to be
considered the parent of this one; this is how hierarchical
annotations can be expressed.
a TAG that has no parent has parentid = 0
for ATTRIBUTE, refers to the id number of a TAG to which
it belongs and from which it inherits its start and length.
*NOTE: the file must be sorted such that any line that uses
a given id in this column must be *after* the line that
uses that id in the id column.
- debug
- ignored by the OffsetAnnotator; can contain any information
that is beneficial to a human reading the file
The Lemur Project
Last modified: Thursday, 06-Oct-2005 09:14:45 EDT