Installing and Running Lemur (Version 3.1)


Contents

  1. Installation on Unix

  2. Installation on Windows (NT and XP)

  3. Running Applications

  4. Testing the Toolkit on Sample Data

  5. Using the API to Write Your Own Application

  6. Modifying the Toolkit Libraries



1. Installation on Unix

After downloading the Unix Lemur package, follow the following steps to install it:
  1. Unpack the source

  2. On the command line, type in the following commands to unpack the package. This should create a directory named lemur-3.1.

    > gunzip lemur-3.1.tar.gz 
    > tar -xvf lemur-3.1.tar
    
  3. Configure the makefiles
  4. Go to directory lemur-3.1 and run the configuration script configure. This will generate a file named "MakeDefns", which has some customized definitions to be used in makefiles.
    configure accepts the following arguments:

    --enable-distrib compiles and installs the distributed retrieval components. Default is disabled.
    --enable-summarization compiles and installs the summarization components. Default is disabled.
    --enable-cluster compiles and installs the clustering components. Default is disabled.
    --enable-assert Enable assert statements in the code. Default is disabled.
    --prefix= Specifies the directory for the installed toolkit. Default is /usr/local.

    For example, to configure Lemur with the default libraries:

    lemur-3.1>./configure 
    

    Or to configure Lemur with some modules:
    lemur-3.1>./configure --enable-distrib --enable-summarization
    

  5. Compile Lemur
  6. With directory lemur-3.1 as the current working directory, type in "make". This will compile the whole Lemur toolkit and link all the Lemur applications.

    lemur-3.1> make 
    
  7. Install Lemur library
  8. After compiling Lemur, type in "make install". This will install the Lemur library and include files according to the directory specified by the prefix option of the configure script.

    For example, with C-shell,

    lemur-3.1> ./configure --prefix=/usr0/mydir-for-lemur
    lemur-3.1> make install
    
    will create /usr0/mydir-for-lemur/lib/liblemur.a and a bunch of ".hpp" (C++ header files) and ".h" (C header files) in /usr0/mydir-for-lemur/include/. The application executables will be all in /usr0/mydir-for-lemur/bin.

    For users who are only interested in using Lemur as a library and application suite, the original source tree (i.e., the lemur-3.1 directory) can be removed after this step.

  9. Problems with installation

    We have dropped the support for any version of gcc older than 3.0. Solutions to some problems with installing Lemur have been posted on the Lemur Forum.


2. Installation on Windows (NT and XP)

After downloading the source, follow these instructions to build the toolkit using Visual Studio .NET.
  1. Extracting toolkit files


  2. After downloading the toolkit, uncompress it. Even though it's a .tar.gz, winzip can do this if you simply double-click on the file. Choose a directory to extract the files into. The extraction process will create the lemur directory and all folder structures for you.

  3. Building the libraries and applications


  4. There are a number of .vcproj files in the lemur directories and a solution (.sln) file in the main Lemur directory. Opening the Lemur.sln file will open all projects for building Lemur. There is a separate project file for each library and for each application in Lemur.

    By default the project configurations are built in "Debug" mode. To change this so that it compiles with fewer warnings and runs at higher efficiency, change the configuration setting in the "Build" menu. Then choose "Configuration Manager". In the menu for "Active Solution Configuration", choose "Release".

Alternatively, you can download the precompiled windows executables that does not require Visual Studio. Simple extract the files to your directory of choice. All applications are in the lemur-v#.#\bin directory.

3. Running Applications

Most Lemur Applications

The executables for Lemur applications are generated in the directory app/obj; they will be copied to LEMUR_INSTALL_PATH/bin after running "make install'.

The usage for different applications may vary, but most applications tend to have the following general usage.

Create a parameter file with value definitions for all the input variables of an application. Terminate each line with a semicolon. Note that the use of the semicolon is mandatory.For example,

dataFiles = /usr3/web/sourcelist;
index = /usr3/web/myindex;
indexType = inv;
memory = 128000000;
docFormat = web;
position = 1;

In general, all the file paths must be absolute paths in accordance to your operating system. Lemur does not have the capability of searching for files along different paths.

Run the application program with the parameter as the only argument, or the first argument, if the application can take other parameters from the command line. Most applications only recognize parameters defined in the parameter file, but there are some exceptions.

For example, if the parameter file above is named buildparam in the directory /usr3/web, then just do:

/usr3/web> BuildIndex buildparam
For new versions of gcc, you would need a shared library to run an application, and you will need to set the environment variable "LD_LIBRARY_PATH" to the path to the corresponding shared library. See gcc documentation for more details about this.

Most applications will display a usage or a list of required input variables, if you run it with the "--help" option. For details about how to use each of the applications in the Lemur toolkit, see the Lemur Modules and Applications.


Indri Applications

Some applications that make use of the IndriIndex use a different parameter file format. These are applications with names that are prepended by the word "Indri", ie IndriBuildIndex and IndriRunQuery.

These applications accept parameter files in XML format. The top level element in the parameter file is named parameters. Or parameters may also be specified on the command line using dotted path notation.

For example, to specify the index parameter for the index name, a parameter file would contain:

<parameters>
<index> /usr3/web/myindex </index>
</parameters>

Or on the command line, specify -index=/usr3/web/myindex

For more information about the specific applications and their parameters, please see Lemur Modules and Applications .

4. Testing the Toolkit on Sample Data

The Lemur Toolkit comes with a sample data directory which includes a small public information retrieval testing collection (i.e., the CACM collection available from the Cornell ftp site ftp://ftp.cs.cornell.edu/pub/smart/). This sample data is to let you easily try the toolkit and will help you to understand the capabilities of the toolkit as well as how to use them.

The directory has three some test scripts, including test_indri_index.sh, test_pos_index.sh, test_key_index.sh,and test_struct_query.sh. The test index scripts use the specified indexes and demonstrates most of the functionality of Lemur, i.e., from formatting a database, building an index, to running various kinds of retrieval experiments. clean.sh cleans up any files generated by any of the testing scripts. For more information about the indexes and how they differ, please see the indexing guide.

Your output should not be too different from the output contained in the sample output files listed here. Roundoff error should only lead to minor deviations from these results.

Basically, the scripts would start from a source database file and a query file with some simple SGML format, and build an index of the database and a support file that is necessary to make some retrieval algorithms fast, and then, they will run different retrieval experiments with different parameter files. The retrieval results are evaluated with a perl script ( ireval.pl ) in the app/src directory. A precision recall summary file is generated for each experiment.

You can try to change some of the settings in the parameter files and see how it will affect the retrieval performance.

Windows users might be able to run these scripts under cygwin. Even if they can not run the scripts directly, they should still be able to repeat the commands in these shell scripts manually or in some other automatic way.


5. Using the Lemur API to Write Your Own Application

To use the Lemur API on Unix, you will need both the Lemur library file and all the header files. The installation script of Lemur generally puts the library file in LEMUR_INSTALL_PATH/lib/liblemur.a and all the header files in LEMUR_INSTALL_PATH/include/. Header files in C have the extension of .h, while a C++ header file has an extension of .hpp. You will use the Lemur library exactly in the same way as you would use any other C++ library. This means you generally do the following: An application level Makefile that you can use for your own applications has been included. To use it:
  1. Copy Makefile.app from the top level lemur directory to the directory with your application's source code. Edit the file and fill in values for the following:
    OBJS -- list of each of the object files needed to build your application.
    PROG -- name for your application.

  2. Use make -f Makefile.app to build your application.

6. Modifying the Toolkit Libraries

  1. The directory structure and makefiles

    The toolkit has two types of directories: (1) module directories ("code" directories) (e.g., index, retrieval , and app) (2) others (e.g., data). The code exists in all the module directories. Each module directory has four subdirectories: include, src, depend, and obj. "include" has all the header files, while "src" has all the implementation files. "depend" is to store a depend file for each source file. "obj" is to hold the compiled object code. obj and depend are created automatically by the makefiles. The data directory stores a sample testing collection and testing scripts. There are two types of modules: library module and application module. A library module will be compiled and linked as a module library, whereas an application module has a bunch of application programs. Each of them has a main() function. Thus, all the applications must be linked individually. For example, "utility", "index", "langmod", and "retrieval" are library modules; only "app" is an application directory.

  2. To modify an existing file or add a file to an existing library module:

    1. Make the changes
    2. Go to the Lemur root directory
    3. Type in "make" to update the whole toolkit.

  3. To add a new (library) module to the toolkit:

    1. Create the module subdirectory in the lemur root directory.
    2. Put all include files in a subdirectory named "include" under the new module directory
    3. Put all implementation files in a subdirectory named "src" under the new module directory.
    4. Add the module directory name to the Makefile variable LIBDIRS Note: If you rerun configure, you will have to make this change again. Advanced users should edit configure.ac and add an AC_ARG_ENABLE for the new module (see the distrib entry in configure.ac) and then use autoconf to generate a new configure script.
    5. Copy a Makefile from an existing module directory (e.g, index/src/Makefile) to /src, and change the variable MODULE to the name of the new module

The Lemur Project
Last modified: Tue Nov 2 14:22:01 EST 2004