Main Page | Namespace List | Class Hierarchy | Class List | File List | Namespace Members | Class Members | File Members | Related Pages

IdentifinderParser.hpp File Reference

#include "Parser.hpp"
#include "TextHandler.hpp"
#include "LinkedPropertyList.hpp"

Go to the source code of this file.


namespace  lemur
namespace  lemur::parse


#define BEGIN_PREFIX   "B_"
#define END_PREFIX   "E_"
#define PREFIX_LEN   2

Define Documentation

#define BEGIN_PREFIX   "B_"

Parses documents in with similar document separation tags NIST's Web format. <DOC></DOC> around documents and <DOCNO></DOCNO> around docids. This parser recognizes named entity tags from the Identifinder tagger and passed them along as properties. For each tag X, also adds in b_X and e_X to the first and last token of each entity. For example, "Carnegie Mellon University" was identified as a place, it would be parsed with the following properties: Carnegie [b_place] [place] Mellon [place] University [e_place] [place] A single token entity, like Madonna would be Madonna [b_person] [person] [e_person] Does case folding for words that are not in the acronym list. Contraction suffixes and possessive suffixes are stripped.

U.S.A., USA's, and USAs are converted to USA. Does not recognize acronyms with numbers.

#define END_PREFIX   "E_"

#define PREFIX_LEN   2

Generated on Tue Jun 15 11:02:56 2010 for Lemur by doxygen 1.3.4