Parsing in Lemur
Contents
- Overview
- The Parser Architecture
- Supporting classes
- TextHandler classes
- The Parser Applications
This document discusses the parsing utilities provided by
the Lemur toolkit. They have been
designed with flexibility and extendibility in mind. If the functionality
required is not currently implemented by the toolkit, it should be easy to add
the functionality and plug it into the parser framework.
The first section describes the parser
applications and their options. The
other section describes the parser architecture or API for developers.
The Lemur parser architecture revolves around one class,
TextHandler, that allows for the chaining or pipelining of
common parser components. A TextHandler may be a stop-word
list, stemmer, indexer, or parser. Information is passed from a
source, through TextHandlers that modify information and pass
it on, to a destination TextHandler. An example of a source
TextHandler would be a parser. A stemmer would modify text and
pass the information on to other TextHandlers. A destination
TextHandler might write parsed data to a file or push build
an index.
The TextHandler class enforces chaining through its
interface. Functions of the
TextHandler class are described below.The next TextHandler
in a chain is set using the setTextHandler function. For
example calling the Parser's setTextHandler function with an
argument of the Stop-word list would cause information to be passed from
the Parser to the Stop-word list. TextHandlers may modify the
information it receives before passing the information on to the next
TextHandler.
Although a TextHandler may modify the information it receives, it also
passes along the original information. It can also pass a list of Property objects
associated with that token. Base implementations of all functions are provided by the
TextHandler class; a subclass will only need to override the
functions that it needs.
The TextHandler class provides the basis for most of the
classes used by Lemur for parsing. The hope is that this class will
provide a flexible base for extending parser functionality.
The following subsections discuss important members and supporting classes related to the
TextHandler.
TextHandler::TokenType used with
void foundToken(TextHandler::TokenType type, char * token, char * orig, PropertyList * properties);
TokenType is an enumeration including words, tags, and document boundary markers.
You may add to this list of types for your own tools. For example,
you may wish to use a parser identifies sentence boundaries. An appropriate way to
pass this information along the TextHandler chain would be to add types for
beginning of sentence and end of sentence boundaries. Here's a list of the current
types:
- WORDSTR
Calling foundToken with TextHandler::WORDSTR as the token type is
equivalent to the foundWord call of the old TextHandler class.
- BEGINDOC
The BEGINDOC type is reserved for signaling the beginning of a document.
The token and orig arguments to foundToken should contain the document number.
(This call is equivalent to the old foundDoc function.)
- ENDDOC
The ENDDOC type is used to signal the end of a document. Classes using the
TextHandler expect this call; make sure your parsers produce it.
- BEGINTAG
This type has been added for support of XML. This could also be
used for HTML or SGML parsers. Or even more generally, it could be
used to represent hierarchical structure boundaries.
The token and original arguments should contain only the type of the
tag. If the tag is <h3 align="center">, then token and orig
should contain "h3". The properties argument to the foundToken call
should align information.
- ENDTAG
This type has also been added for support of XML.
The token argument should contain just the type of the tag (i.e.
"h3").
Property
A Property object will generally have a name (so it can be retrieved from
a list>, a data type, a data size, and the data value. Any data type can be added to
a Property through the use of the overloaded setValue() method.
However, you have to modify the class if you want your own type be recognized and not
be returned as a Property::UNKNOWN type (when getType() is called).
Name and values are copied when set so the Property has its own memory
management.
PropertyList
A PropertyList is a container for properties of tokens. Example
properties may be the byte offset of the token in the file, attributes associated with a
tag, document properties, and so on. Items in the property list are
(name, value) pairs.
A PropertyList object is owned by its creator. That is, you should not
assume that the properties in it will be the same in subsequent calls
to TextHandler::foundToken. The creator is also responsible for
freeing the memory associated with the list.
TextHandlerManager
This class facilitates the creation of Parser, Stemmer, and Stopper objects. Any new TextHandler class can be added just to the TextHandlerManager to be utilized by all existing applications that use the TextHandlerManager. It accepts what type to create as a parameter, but will check the parameter stack if nothing is specified.
The
following subsections discuss TextHandler classes used by the Lemur
applications. The only one of the following classes that does not extend the
TextHandler class is the WordSet class.
WordSet
The WordSet class is a simple wrapper to a set. It is useful
for stop-word lists or acronym lists. It can load a list from a
file. The file format is one word per line. WordSet does NOT
remove white space on either side of the word be careful when editing
these files. The contains function is used to check the presence of a
word in the set.
Parser
The Parser class is a generic interface for the parsers in
the toolkit. It assumes subclasses implement a parse function,
which takes a filename. The acronym list is WordSet, and some
of the toolkit parsers check uppercase words and recognized acronyms
against this list. If the word is in the acronym list, it is left
uppercase. Otherwise, the word is converted to lowercase. If you do not
wish to support the acronym list when you design your parser, that is
fine. You can simply ignore the acronym list.
Both the TrecParser and the WebParser remove
contractions and possessives, have a simple acronym recognizer, and
convert words to lowercase.
The parsers assume that there is some SGML style markup seperating documents and specifying document number. The format for web documents is
<DOC>
<DOCNO> document_number </DOCNO>
document text
</DOC>
and the format for trec formatted documents is
<DOC>
<DOCNO> document_number </DOCNO>
<TEXT>
document text
</TEXT>
</DOC>
These document formats allow the inclusion of multiple documents in
the same text file.
TrecParser
The TrecParser provides a simple but effective parser for
NIST's TREC document format. It recognizes text in the TEXT,
HL, HEAD, HEADLINE, TTL, and
LP fields.
WebParser
The WebParser behaves very similarly to the
TrecParser. It parses HTML documents in the NIST TREC format
used for the Web Tracks.The parser removes HTML tags. Text within SCRIPT
tags is removed, as is text in HTML comments.
ReutersParser
The ReutersParser extracts the TEXT,
HEADLINE, and TITLE fields and removes other
tags.
BrillPOSParser
Similar to WebParser in terms of tags to separate document but recognizes
terms with "/" slashes in them. This is the usual output from a Brill part of
speech tagger: term/pos. Use in combination with a BrillPOSTokenizer, which
tokenizes at the separator and pass the part of speech along as a Property.
IdentifinderParser
Similar to WebParser in terms of tags to separate document but recognizes.
Extracts named entities from tags output by Indentifinder and passes them along as a
Property objects. Prefixes are added to the tags to indicate the begin and
end of multi-token entities.
For example, if "Carnegie Mellon University" was identified as a place,
it would be parsed with the following properties:
Carnegie [place] [b_place] Mellon [place] University [place] [e_place]
A single token entity, like Madonna would be
Madonna [person] [b_person] [e_person]
InQueryOpParser
The ArabicParser provides parsing for the InQuery structured
query language structured queries.
ArabicParser
The ArabicParser provides a simple but effective parser for
NIST's TREC document format for Arabic documents encoded in Windows
CodePage 1256 encoding (CP1256). It recognizes text in the TEXT,
HL, HEAD, HEADLINE, TTL, and
LP fields.
InqArabicParser
The InqArabicParser provides parsing for the InQuery
structured query language structured queries in Arabic encoded in
Windows CodePage 1256 encoding (CP1256).
ChineseParser
The ChineseParser provides a simple but effective parser for
NIST's TREC document format for Chinese documents encoded in GB encoding
(GB2312). It recognizes text in the TEXT, HL,
HEAD, HEADLINE, TTL, and LP
fields. This parser is suitable for parsing segmented (tokenized)
documents.
ChineseCharParser
The ChineseCharParser provides a simple but effective parser for
NIST's TREC document format for Chinese documents encoded in GB encoding
(GB2312). It recognizes text in the TEXT, HL,
HEAD, HEADLINE, TTL, and LP
fields. This parser is suitable for parsing unsegmented documents,
producing one token per Chinese character.
Stemmer
The Stemmer class provides an interface for stemmers. All
that is required of a subclass is that it implement the
stemWord function. The stemWord function may overwrite
the current word, but should return the stem as its return
value. Currently, the toolkit provides three subclasses;
PorterStemmer, KStemmer, and ArabicStemmer.
PorterStemmer
PorterStemmer uses Porter's official stemmer (in c) to stem
words. The PorterStemmer class does not stem words beginning
with an uppercase letter. This is to prevent stemming of acronyms or
names.
KStemmer
KStemmer uses Krovetz' stemmer (in c) to stem
words. This is a less aggressive stemmer than the Porter stemmer.
ArabicStemmer
ArabicStemmer uses one of Larkey's Arabic stemmers (in c) to
stem Arabic words. It provides five different stemming functions:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light stemming
- arabic_light10_stop : light stemming with stopping
Stopper
The Stopper
class is a subclass of the WordSet
class and the TextHandler
class. It replaces words in the
stop-word list with a NULL pointer.
QueryTextHandler
The QueryTextHandler
checks to see if a word in the query occurs more often in uppercase than
original form in an Index. If the
uppercase form is more common than the original form, the word is added to the
query. This is to handle cases where
acronyms are not capitalized in the query,
WriterTextHandler
The WriterTextHandler
class writes information from a TextHandler
chain to a file. This file is in a
format compatible with BuildBasicIndex.
WriterInQueryHandler
The WriterInQueryHandler
class writes information from a TextHandler
chain processing the InQuery structured query language to a file. This
file is in a format compatible with BuildBasicIndex.
InvFPTextHander
The InvFPTextHandler takes information from a
TextHandler chain and uses InvFPPushIndex to build an
InvFPIndex. Stop-words are not counted in the document
length.
There are some parser applications provided in the
toolkit. ParseToFile
writes parsed text to a file, ParseQuery
parses queries and writes output to file, and ParseInQueryOp
parses InQuery structured query language queries and writes output to file.
ParseToFile
ParseToFile
parses documents and writes output in BasicDoc format.
The program uses one of the toolkit's Parser classes to parse.
Usage: ParseToFile paramfile datfile1 datfile2 ...
Summary of parameters in paramfile:
-
outputFile Name of file to output parsed documents to.
- stopwords
Name of file containing stopword list.
Words in this file should be one per line. If this parameter is not
specified, all words are output to the
file.
- acronyms Name of file containing acronym list (one word
per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs
USA's U.S.A.) are left uppercase if in the acronym list. If no acronym
list is specified, acronyms will not be recognized.
- docFormat:
- "trec" for standard TREC formatted documents
- "web" for web TREC formatted documents
- "chinese" for segmented Chinese text (TREC format, GB encoding)
- "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
- "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
ParseQuery
ParseQuery
parses queries using one of the toolkit's Parser classes and an
Index.
Usage: ParseQuery paramfile datfile1 datfile2 ...
Summary of parameters in paramfile:
-
qryOutFile The name of the file to write the parsed queries to.
- index
Name of the index (with the extension).
- stopwords
Name of file containing stopword list. Words in this file should be one
per line. If this parameter is not specified, all words are left in the
query.
- acronyms
Name of file containing acronym list (one word per line). Uppercase words
recognized as acronyms (eg USA U.S.A. USAs USA's U.S.A.) are left uppercase as
USA if USA is in the acronym list. If
no acronym list is specified, acronyms will not be recognized.
- docFormat:
- "trec" for standard TREC formatted documents
- "web" for web TREC formatted documents
- "chinese" for segmented Chinese text (TREC format, GB encoding)
- "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
- "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
ParseInQueryOp
ParseInQueryOp
parses queries using the InQueryOpParser class.
Usage: ParseQuery paramfile datfile1 datfile2 ...
The parameters are:
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- docFormat:
- "trec" for standard TREC formatted documents
- "web" for web TREC formatted documents
- "chinese" for segmented Chinese text (TREC format, GB encoding)
- "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
- "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- outputFile: name of the output file.
The Lemur Project
Last modified: Friday, 17-Jun-2005 10:53:14 EDT