The Lemur CGI Application
From LemurWiki
The Lemur CGI is a CGI executable that runs under a HTTP server (web server) that allows access into indexes and general search capabilities.
[edit] Compilation and Installation
Beginning with version 4.3 of the Lemur Toolkit, the Lemur CGI is included as part of the site search package, and is built and installed by default on unix-like systems (linux, solaris, OS/X).
The CGI files will be installed in ${prefix}/share/cgi. Copy the contents of this folder to the location accessible via your webserver. Be sure that your webserver configuration will allow executables to be run. Consult your webserver documentation or system administator if you are uncertain how to ensure this.
Using the sources, you can also build the CGI from Microsoft Visual Studio:
1. Install the lemur source code when running the lemur installer (choose custom) 2. Open the Lemur.sln solution in the source directory 3. Select the LemurCGI project. 4. Select either Debug or Release mode. 5. Right click the LemurCGI project and choose build.
Copy the created executable, LemurCGI.exe, along with the entire contents of the C:\Program Files\Lemur\Lemur 4.4\src\lemur-4.4\site-search\cgi\bin folder to the location accessible via your webserver. Be sure that your webserver configuration will allow executables to be run. Consult your webserver documentation or system administator if you are uncertain how to ensure this.
Before the initial execution, edit the "lemur.config" file (which should stay in the same directory as lemur.cgi) to reflect your configuration.
[edit] Configuration Elements
The configuration file is a well-formed XML file with the opening tag <lemurconfig>. There are two required elements within the configuration file:
- templatepath: this should reflect the path (either relative or absolute) to the template files.
- indexes: this section contains information about what indexes are available, and can contain as many indexes as needed. For each <index> item, there should be two elements. First, a <path> element must be set pointing at where the index is located. Secondly (and optionally), a <description> tag can be set to be a description of the pointed index. The path should be the full path to the index constructed by the crawl-index script.
There are also some optional elements:
- rootpaths: this element defines if the original path in the search result exists, then to strip it out of the URL. This is most useful for enabling a site-search capability where there are locally mirrored versions of the indexed web pages. For example, if your local cache of your website is at "/var/cache/mirrored_site/", if you do not have the LemurCGI set to strip paths, the original URLs displayed would include the prefix "/var/cache/mirrored_site/" in front of every result. This option is not necessary for indexes built with the crawl-index script.
- supportanchortext: If this element is set to true to, this tells the CGI to include support for retrieval of inlinks if you have used the harvestlinks program to gather these from your corpus. This is the default setting for indexes built with crawl-index.
- querylog: This tells the CGI to log every query that is given to it.
A sample configuration containing the above elements might look like:
<lemurconfig>
<templatepath>./templates/</templatepath>
<rootpaths strippath="true">
<path>/home/lemur/data/</path>
</rootpaths>
<supportanchortext>true</supportanchortext>
<querylog>./logging/lemurlog.txt</querylog>
<indexes>
<index>
<path>/home/lemur/indexes/sampleIndex</path>
<description>Sample Lemur Index</description>
</index>
<index>
<path>/home/lemur/indexes/sampleIndex_2</path>
<description>A Second Sample Lemur Index</description>
</index>
</indexes>
</lemurconfig>
Edit the file help-db.html to describe the contents of the text database(s) being searched by the Lemur search engine. You can describe the documents in whatever way you feel is most helpful to your users.
If you wish to use the default HTML templates, no modifications are necessary, but if you want to modify the HTML templates for your own uses, be sure to read the "README_Templates.txt" file for instructions on available commands that you can use within the templates.
[edit] Using the LemurCGI program
The LemurCGI has several classes of functions that allows interactive access into an index. To see the list of functions and a description of what they do, in your web browser, execute "http://[your_path]/lemur.cgi?h=?" where [your_path] is the path (via http) to your lemur.cgi installation.
Programmatically, you can invoke the CGI interface by calling "lemur.cgi?name=value&name=value&..." etc. The name/value pairs are processed from left to right, in order.
A list of the various parameters is as follows:
| Parameter | Function | Description |
|---|---|---|
| termstats=<term> | prints corpus statistics for term | This command will return the total number of times the term is used in the corpus and the total number of documents that the term occurs in. |
| datasource=n | sets the database to the n'th database (index) | This command sets the database to use for the current call |
| listdatasources | lists the available databases (indexes) | This will return a listing of the valid index IDs and the descriptions |
| datasourcestats=n | displays the statistics for the database ID | This will display statistics for the given database such as the number of documents, the number of words, the number of unique terms, and the average document length |
| getdocext=<string> | fetches the document with external id <string> | This will return the unparsed document from the external ID given |
| setoutput=debug | sets the CGI interface to Diagnostic mode | This causes all output to be in plaintext. |
| setoutput=interactive | sets the CGI interface to Interactive mode | This is the default mode which allows interactivity |
| setoutput=program | sets the CGI interface to Program mode | This causes all output to stream back without regard to formatting |
| help | prints the help message | |
| getdoc=<integer> | fetches the document with internal id <integer> | Returns the unparsed document to the user |
| getparseddoc=<integer> | fetches the parsed form of the document with internal id <integer> | Returns the parsed document (generally a bag of words) to the user |
| getterm=<string> | shows the lexicalized (stopped and stemmed) form of <string> | |
| maxresults=x | sets the number of documents to retrieve to x | |
| query=<string> | uses the query <string> to search the database | The query can use the Indri or InQuery command language (see parameter "t" below to set the query language type) |
| start=n | starts the query results at rank n | |
| querytype=<query_type> | sets the query type to use. | Can be one of "indri" (default) or "inquery" |
| invlist=<term> | returns the inverted list for term | Use term.field for field specific list |
| invposlist=<term> | returns the inverted list for term with positions | Much like the lowercase 'v', but also the positions that the term occurs in within the document. Use term.field for field specific list |
