Using and Implementing a Sifaka Document Parsers

Sifaka comes with two document parsers: a plain text file parser and a simplified TREC parser. It is also easy to implement a document parser for additional document types. To use any of these parsers to build a Sifaka index, read the instructions in: Quick start.

Additionally, there is a sample pre-built Sifaka index with reuters (sampleReutersIndex.zip) data available on SourceForge: SourceForge Lemur Project Page

  1. How to use an existing document Parser.
  2. How to implement a new document Parser.
    1. Create a new class in org.lemurproject.sifaka.buildindex.lucene.documentparser package which implements DocumentParser.
    2. Implement the required methods
    3. Add the new parser class to the docParserMap in the constructor of org.lemurproject.sifaka.buildindex.lucene.factory.DocumentParserFactory with a short but descriptive key value.
    4. To use the new parser with sifakaBuildIndex, define documentType=[NEW_KEY_VALUE] in index.properties. The NEW_KEY_VALUE should match the key value that was added to the docParserMap in the DocumentParserFactory.