Note: These tutorials are out of date, please see the Lemur Wiki instead.
Lemur Project Tutorials:
Starting Out
IInstalling & Compiling: Creating a Site-Search for Your Website
Contents
- Gathering your Web Pages
- Create a Parameter File
- Build the Index
- Install and Configure the Lemur CGI
Gathering your web pages
You can use any method of crawling to gather your web pages, but they should be mirrored somewhere locally for indexing and retrieval.
Some publicly available web spidering tools include:
| Name / Homepage | Description | |
|---|---|---|
|
Heritrix Crawler http://crawler.archive.org/ |
The Hertitrix crawler is the crawler that the Internet Archive (www.archive.org) uses. Heritrix is a Java-based system that is extremely extensible and can be configured to handle just about any crawling situation. | |
![]() |
||
|
WebSPHINX http://www.cs.cmu.edu/~rcm/websphinx/ |
WebSPHINX is Java-based development environment for creating webcrawlers. | |
![]() |
||
|
wget http://www.gnu.org/software/wget/ |
GNU WGet is typically bundeled with any many Linux distribution. Although not quite a flexible as many other software packages,
WGet can still spider a small site with no problems. To spider a site, preserving the structure, into the current directory, you can issue the command: wget -r -w1 -Q1000m -v -t3 -nH -np -l10 -Dwww.yoursite.com http://www.yoursite.com/ Where www.yoursite.com is the domain and url to your site. The other command line parameters tell wget to: recuse directories (-r); wait 1 second between pages (-w1); allow a maximum of 1,000 MB of data to downloaded total (-Q1000m); verbosely (-v); try 3 times per page at the most (-t3); don't include the host name in the directory (-nH); do not ascend to the parent directory (-np); descend at most 10 levels (-l10); and finally stay within the "www.yoursite.com" domain. For more features of wget, see the wget homepage. Also, there is a GUI available for Windows-based platforms to launch WGet from http://www.jensroesner.de/wgetgui/index.php |
|
Note: The Lemur Project is not affiliated nor endorses any of the above software packages, but the Lemur Toolkit source package does come bundled with a customized version of Heritrix.
Just a word of forewarning and caution: when spidering web sites, proper web spider etiquette should be followed. A simple list of rules include:
- Follow the Robots Exclusion Protocol.
- Do not overwhelm the web server - try to have a delay between page requests - if you issue rapid-fire requests to a web server it may cause a denial of service.
- Watch out for infinite loops - the crawler should recognize when it loops back upon itself.
- The crawler should always identify itself and a point of contact in case there are any problems.
Create a Parameter File
After gathering the web pages you wish to index, you need to create a parameter file for the indexer. A basic parameter file looks like the following:
<parameters>
<corpus>
<path>/path/to/mirrored/documents/</path>
</corpus>
<memory>256m</memory>
<index>/path/to/index</index>
</parameters>
Basically, the parameter file tells the indexer where your coprus of mirrored documents are located, the amount of memory to use, and where to place the index. For more information see the creating a simple index section.
Build the Index
This is the easy part. From a command line, run "IndriBuildIndex [parameter_file]" where [parameter_file] is the parameter file you created in the last step. Let the indexer run. If there are no errors encountererd, you can move on to the next step...
Install and Configure the Lemur CGI
To install the LemurCGI, see the Installing and Compiling the LemurCGI page. Once the LemurCGI is installed and in place, you can edit the LemurCGI configuration to point to your newly created index. That's it!
![]() |
![]() |
|
| [Previous: LemurCGI and the Lemur GUI] | [Back to TOC] |
The Lemur Project
Last modified: June 21, 2007. 09:14:12 am




