Word Entity Duet: Indexing Documents

The Word Entity Duet project provides an indexing application, which can parse ClueWeb or Wall Street Journal documents into an Elasticsearch index. Entities tagged in each document as described in Tagging Documents are indexed in the entity field for each document. Follow these steps to index.

Indexing Steps

  1. Increase the heap space used by Elasticsearch. In elasticsearch-6.1.2/config/jvm.options, set -Xmx and -Xms to at least 2G (preferably 4g - 16g if possible.)
  2. Start Elasticsearch.
  3. ClueWeb indexing has the option of filtering spam. Spam scores for ClueWeb can be downloaded from the Waterloo Spam Rankings for the ClueWeb09 Dataset page. Filter this spam file to contain only the documents that you would like to be considered spam.
  4. Create a properties file for indexing.
  5. Start indexing with this command: java -jar -Xmx4G indexer-1.0-jar-with-dependencies.jar index.properties. Use at least 2G of heap space (preferably 4G - 8G).