Word Entity Duet: Tagging Documents

To create learning to rank features with entity information, documents must be tagged with entities before they are indexed. The Word Entity Duet project used TagMe for document and query tagging, but any entity tagger can be used as long as the tag output is in the format the indexer can read. To use the Freebase API entity information for entity name, descriptions, and aliases, the tags must be wikipedia IDs.

Tagging Output

Tagged files must have the same name and directory structure as the document files. For example: if the input is the file: /ClueWeb09/en0000/00.warc.gz, the output file should be: /ClueWeb09_TaggingOutput/en0000/00.warc.gz. The top level directory of the tagged output can be different, but all directories and files underneath must have the same names. There can be one or several documents in each file. The tagged output for each document needs to be in json form with this structure.
{
  "docno": "[DOCUMENT_ID]"
  "tagme": "[WIKIPEDIA_ID_1] [WIKIPEDIA_ID_2]..."
}

TagMe Sample Scripts

If you are interested in using TagMe, the source code can be downloaded from github. Java samples for tagging ClueWeb and Wall Street Journal documents as well as a list of queries are provided with the Word Entity Duet Project on sourceforge.

To use the sample tagging code, copy the TagMe*.java files to the samples directory in the downloaded TagMe project.

Compile the samples using javac.

javac -cp lib/*:libgg/*:ext_lib/*:bin/:samples/ samples/TagMe[SCRIPT_NAME].java
It is recommended to run TagMe with as much memory as possible to make it run faster. Run TagMe with this command.
java -cp lib/*:ext_lib/*:libgg/*:bin/:samples/ -Xmx128G -Dtagme.config=config.full.xml 
  TagMe[SCRIPT_NAME] [DATA_DIRECTORY] [RESULTS_DIRECTORY]
There is an option of tagging only certain documents in the ClueWeb directory since it would take several months to try to tag all ClueWeb documents using TagMe. We suggest tagging only the top N documents for each query. To tag only certain documents, create a text file with one document ID to tag on each line. Then run the TagMe application.
java -cp lib/*:ext_lib/*:libgg/*:bin/:samples/ -Xmx128G -Dtagme.config=config.full.xml 
  TagMeClueWeb [DATA_DIRECTORY] [RESULTS_DIRECTORY] [TOP_N_FILES.txt]