Offset annotation support

Offset annotation support

IndriBuildIndex accepts the parameter annotations to specify a file containing offset annotations for the documents in a collection. Specified as:

<corpus>
  <annotations>
    /path/to/file
  </annotations>
</corpus>

in the parameter file. This parameter may be either a single annotations file or the name of a directory containing a separate annotations file for each input file in the corpus path entry. For numeric fields given in offset annotations, the field parameter for the given field needs to specify a different parserName parameter, eg:

<parserName>OffsetAnnotationAnnotator</parserName>

Offset Annotation File Format

Format of the offset annotation file: 9-column, tab-delimited. From left-to-right, those columns are:

docno
external document id corresponding to the document in which the annotation occurs.
type
TAG or ATTRIBUTE
id
an id number for the annotation; each line should have a unique id >= 1.
name
for TAG, name or type of the annotation for ATTRIBUTE, the attribute name, or key
start
start and length define the annotation's extent. The values should be byte offsets relattive to the start of the document.
length
meaningless for an ATTRIBUTE. The number of bytes the annotation spans.
value
for TAG, an optional INT64 (for numeric values) for ATTRIBUTE, a string that is the attribute's value
parentid
for TAG, refers to the id number of another TAG to be considered the parent of this one; this is how hierarchical annotations can be expressed. a TAG that has no parent has parentid = 0 for ATTRIBUTE, refers to the id number of a TAG to which it belongs and from which it inherits its start and length. *NOTE: the file must be sorted such that any line that uses a given id in this column must be *after* the line that uses that id in the id column.
debug
ignored by the OffsetAnnotator; can contain any information that is beneficial to a human reading the file