Overview of the Lemur Toolkit
Contents
- What is Lemur?
- What kinds of things can Lemur do?
- How can it be useful?
- What have people used Lemur for?
- How can I use Lemur?
- What does Lemur come with?
- What was Lemur written in, and what platforms does it work on?
Lemur is a toolkit designed to facilitate
research in language modeling and information retrieval (IR), where IR
is broadly interpreted to include such technologies as ad hoc and
distributed retrieval, with structured queries, cross-language IR,
summarization, filtering, and categorization. The system's
underlying architecture was built to support the technologies
above. We provide many useful sample applications, but have
designed the toolkit to allow you to easily program your own
customizations and applications.
The Lemur toolkit supports the construction of basic text retrieval
systems using language modeling methods, as well as traditional methods
such as those based on the vector space model and Okapi. As the toolkit
evolves, it is expected that it will support research in a broader range
of information technologies such as filtering, and even question
answering.
Lemur is particularly useful for researchers in language modeling and
information retrieval who do not want to write their own indexers but
would rather focus on developing new techniques and algorithms. However,
in addition to indexing, we provide some baseline retrieval algorithms,
such as Okapi and KL Divergence for use and comparisons.
You can use Lemur to build your own search systems. We have
implemented and included basic ad hoc IR, distributed IR, IR using
structured queries, IR using distributed indexes, clustering documents,
and summarization. Others have used Lemur for filtering tasks, webpage
finding, passage finding, and web search engines.
The toolkit has been used to carry out experiments on several different
aspects of language modeling for ad hoc retrieval. For example, it has
been used to compare smoothing strategies for document models, and query
expansion methods to estimate query models on standard TREC collections;
for examples of its use see the SIGIR 2001 paper "A study of smoothing
methods for language models applied to ad hoc information
retrieval."
The toolkit has also been used for tasks at TREC, including filtering
and web page-finding. It has been used in classrooms for instruction
about information retrieval and web search engines. It also supports
research projects in various other aspects of IR, such as question
answering and distributed networks.
Lemur has many applications for indexing and retrieval that are fully
functional for many purposes, so you can use them "out of the box". In
addition, since Lemur was written to facilitate research on LM and IR,
the design allows you to try out new retrieval methods by subclassing
abstract interfaces, or write new applications based on existing
methods.
The source code is provided to encourage users to modify the
toolkit in support of their own research, development, or teaching
activities. All are welcome and encouraged to submit their
modifications to the Lemur project developers, so that they can
be considered for inclusion in subsequent versions of the toolkit.
Lemur comes with all the source code and makefiles necessary to build
the libraries for indexing and retrieval (under a CMU and UMass licensing agreement). For windows,
you can download the pre-compiled libraries and executables.
Lemur currently supports the following features:
- Indexing:
- English, Chinese and Arabic text
- word stemming (Porter and Krovetz stemmers)
- omitting stopwords
- recognizing acronyms
- token level properties, like part of speech and named entities
- passage indexing
- incremental indexing
- Retrieval:
- ad hoc retrieval (TFIDF, Okapi, and InQuery)
- passage retrieval
- cross-lingual retrieval
- language modeling (KL-divergence)
- query model updating for pseudo feedback
- two-stage smoothing
- smoothing with Direchlet prior or Markov chain
- relevance feedback
- structured query language
- Distributed IR:
- query-based sampling
- database ranking (CORI)
- results merging (CORI, single regression and multi-regression merge)
- Document Clustering
- Summarization
- Simple text processing
Available as separate downloads, you can get CGI code that uses Lemur
indexes and a stand-alone GUI that does retrieval using methods included
with Lemur. For a full list of applications that come with Lemur, see Lemur Applications page.
The Lemur toolkit download also includes a small sample data file
with test scripts that use our applications. Expected results from these scripts are available on our
website.
Lemur was written primarily in C++. (The GUI is written with Java/Swing.)
It is compatible with UNIX (linux and solaris) and Windows
XP. Although we don't currently support them officially, people also run
it on cygwin, Windows 2000, and Windows NT.
The Lemur Project
Last modified: Wednesday, 14-Dec-2005 09:06:21 EST