News
Features
The Lemur Toolkit
Indri Search Engine
Lemur Query Log Toolbar
Lemur Wiki
Download
People
Discussion
Archived Forums
Tutorials
Sign Up

 
CMU - Language Technologies Institute
Carnegie Mellon University
CIIR, University of Massachusetts Amherst
University of Massachusetts
 

The Lemur Project is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation.


Note: These tutorials are out of date, please see the Lemur Wiki instead.


Lemur Project Tutorials:
Starting Out

IInstalling & Compiling: Creating a Site-Search for Your Website


Contents

  1. Gathering your Web Pages
  2. Create a Parameter File
  3. Build the Index
  4. Install and Configure the Lemur CGI

Gathering your web pages

You can use any method of crawling to gather your web pages, but they should be mirrored somewhere locally for indexing and retrieval.
Some publicly available web spidering tools include:

Name / Homepage Description
Heritrix Crawler
http://crawler.archive.org/
The Hertitrix crawler is the crawler that the Internet Archive (www.archive.org) uses. Heritrix is a Java-based system that is extremely extensible and can be configured to handle just about any crawling situation.
WebSPHINX
http://www.cs.cmu.edu/~rcm/websphinx/
WebSPHINX is Java-based development environment for creating webcrawlers.
wget
http://www.gnu.org/software/wget/
GNU WGet is typically bundeled with any many Linux distribution. Although not quite a flexible as many other software packages, WGet can still spider a small site with no problems. To spider a site, preserving the structure, into the current directory, you can issue the command:
 
wget -r -w1 -Q1000m -v -t3 -nH -np -l10 -Dwww.yoursite.com http://www.yoursite.com/
 
Where www.yoursite.com is the domain and url to your site. The other command line parameters tell wget to: recuse directories (-r); wait 1 second between pages (-w1); allow a maximum of 1,000 MB of data to downloaded total (-Q1000m); verbosely (-v); try 3 times per page at the most (-t3); don't include the host name in the directory (-nH); do not ascend to the parent directory (-np); descend at most 10 levels (-l10); and finally stay within the "www.yoursite.com" domain.
 
For more features of wget, see the wget homepage.
 
Also, there is a GUI available for Windows-based platforms to launch WGet from http://www.jensroesner.de/wgetgui/index.php
 
Note: The Lemur Project is not affiliated nor endorses any of the above software packages, but the Lemur Toolkit source package does come bundled with a customized version of Heritrix.

Just a word of forewarning and caution: when spidering web sites, proper web spider etiquette should be followed. A simple list of rules include:

Create a Parameter File

After gathering the web pages you wish to index, you need to create a parameter file for the indexer. A basic parameter file looks like the following:

<parameters>
  <corpus>
    <path>/path/to/mirrored/documents/</path>
  </corpus>
  <memory>256m</memory>
  <index>/path/to/index</index>
</parameters>

Basically, the parameter file tells the indexer where your coprus of mirrored documents are located, the amount of memory to use, and where to place the index. For more information see the creating a simple index section.

Build the Index

This is the easy part. From a command line, run "IndriBuildIndex [parameter_file]" where [parameter_file] is the parameter file you created in the last step. Let the indexer run. If there are no errors encountererd, you can move on to the next step...

Install and Configure the Lemur CGI

To install the LemurCGI, see the Installing and Compiling the LemurCGI page. Once the LemurCGI is installed and in place, you can edit the LemurCGI configuration to point to your newly created index. That's it!

 


Previous: LemurCGI and the Lemur GUI Back to TOC  
[Previous: LemurCGI and the Lemur GUI] [Back to TOC]  

 


The Lemur Project The Lemur Project
Last modified: June 21, 2007. 09:14:12 am