Indri Retrieval Model
From LemurWiki
Taken from Don Metzler's original overview at: http://ciir.cs.umass.edu/~metzler/indriretmodel.html
Contents |
[edit] Introduction
This document is meant to give a broad, yet detailed, overview of the retrieval model that Indri implements. The model is based on a combination of the language modeling [1] and inference network[2] retrieval frameworks. Both frameworks, on their own, have been widely studied, applied, and found to be very effective for a wide range of retrieval tasks. Therefore, it is only natural to combine the benefits of these two frameworks into a unified model. The combined model was first introduced in [3], but has been enhanced and expanded since then to handle a wide range of applications, such as the TREC Terabyte Track[4]. For a less technical overview of Indri, please refer to [5].
Here are some of the features of the model:
- Easily handles phrases (ordered and unordered)
- Can make use of multiple document representations
- Explicit term weighting
- Robust query language
- Formally well-grounded
- Highly effective
- Can be efficiently implemented
The following figure is a graphical model depiction a typical instantiation of the model. The graph represents an inference network (also known as a Bayesian network), which is a succinct way of defining a joint probaility distribution over a collection of random variables. Each node in the graph represents a random variable, with the shaded nodes being 'observed' and the unshaded nodes being 'hidden'. The edges in the graph define a set of independence assumptions over the random variables. There are many other interesting properties of inference networks that are out of the scope of this overview.
The network consists of the following types of nodes:
- Document node (D)
- Smoothing parameter nodes (alpha, beta)
- Model nodes (M)
- Representation concept nodes (r)
- Belief nodes (q)
- Information need node (I)
We first briefly describe each of the node types and it's meaning within the network and then detail how the model is used to rank documents.
[edit] Document node
The document node is a random variable over document representations. A document may be represented in may different ways. However, following the work laid out in [6] (Model B), we represent documents as multisets of binary vectors, where each entry in the binary vector represents the presense or absence of some feature of the text. One such vector is 'extracted' for every term position within the document. This treats the document as a sample from a Multiple-Bernoulli distribution.
An easier way to think of this, is to consider running some indicator function over the sequence of text contained in the document for each feature. Whenever the feature is present, the indicator emits a 1, otherwise it emits a 0. This is done for every position within the document and for every feature under consideration. This results in a binary vector, with an entry for every feature, at every position within the document. Features can be ANY binary, extractable concept, such as the occurrence of a term, phrase, named entity, etc.
The following illustrates a simple example that considers 1- and 2-gram features. There are a total of 12 features, with 5 feature vectors extracted, one for each text position.
Document: A B C A B
assume feature order = [ A B C AA AB AC BA BB BC CA CB CC ]
D = { [ 1 0 0 0 1 0 0 0 0 0 0 0 ],
[ 0 1 0 0 0 0 0 0 1 0 0 0 ],
[ 0 0 1 0 0 0 0 0 0 1 0 0 ],
[ 1 0 0 0 1 0 0 0 0 0 0 0 ],
[ 0 0 1 0 0 0 0 0 0 0 0 0 ] }
[edit] Smoothing parameter nodes
These nodes represent model hyperparameters, which correspond to smoothing parameters in our framework. There is one set of smoothing parameter nodes for each model node. More discussion about smoothing is given below in the estimation and smoothing sections.
[edit] Model nodes
The model nodes represent so-called feature language models. In our framework, they are smoothed multiple-Bernoulli distributions that model document representations. There may be more than one model node in the network. In the example above, there are three model nodes, corresponding to different representations of the same document. For example, the title model corresponds to the pseudo-document made up of all of the title text from the original document, whereas the body model corresponds to all of the text within the main body of the original document. This allows models to be estimated across various document representations and evidence from the different representations to be combined.
[edit] Representation concept nodes
The representation concept nodes are binary random variables that are directly related to the features extracted in the document representation. Examples events associated with these nodes are:
- "the term dog occurred"
- "the term Mike appeared within a title"
- "the exact phrase 'white house' occurred"
See the query language overview for a complete list of the features Indri handles.
The likelihood of an event occurring depends entirely on the parent model node. The same representation concept node may appear multiple times in the network, each with a different parent model node. This allows us to distinguish between the events "the term dog occurred in the title document representation" and "the term dog occurred in the main body document representation".
[edit] Belief nodes
Belief nodes are binary random variables that are used to combine beliefs (probabilities) within the network. Each belief node corresponds to a different conditional probability table, allowing beliefs to be combined in a number of ways. The belief nodes may be used to combine beliefs about representation concept nodes or other belief nodes (as long as the graph remains acyclic). For example, a belief node could be used to combine our beliefs that the events "the term dog occurred" and "the term leash occurred". Weighted versions of some belief operators are available, as well, allowing explicit weighting of beliefs.
The belief nodes are dynamically added to the network based on a structured Indri query. Therefore, the network topology / structure changes for every query. This makes the framework very powerful, as it allows a wide range of scoring functions, all based on different query formulations.
See the combining beliefs section below for more information on how the beliefs are computed and the query language overview for a full list of belief operators supported in Indri.
[edit] Information need node
Finally, the information need node is simply a belief node that combines all of the evidence within the network into a single belief. This belief is used as the basis for ranking documents. That is, documents are ranked according to P(I = 1 | D, alpha, beta).
[edit] Summary of Indri retrieval process:
- Given a structured query, construct inference network.
- For each document in the collection, compute P( I = 1 | D )
- Rank documents according to P( I = 1 | D )
[edit] Example Networks
It is often easier to understand the inference network framework by looking at a few example networks produced from different types of query formulations. In this section we give three example structured Indri queries and the inference network that corresponds to each.
Query:
#combine( #1( abraham lincoln ) gettysburg )
Inference network:
Query:
#weight( 2.0 #or( #1( north korea ) iraq ) 1.0 policy )
Inference network:
Query:
#combine( #uw8( hurricane wind ).(title) damage )
Inference network:
[edit] Estimation
This section is highly technical and only needs to be read if you wish to understand how representation node beliefs, P(r | D), are computed within the network.
First, me must compute P( M | D ) (for notational convenience we drop the dependence on alpha and beta), which is the model posterior distribution. This is computed as:
where P( D | M ) is the likelihood of generating D from model M and P( M ) is the model prior. In our framework, we follow [6] and assume that P( . | M ) is distributed according to multiple-Bernoulli( theta_r ) and P( M ) is distributed according to multiple-Beta( alpha_r, beta_r ). Since the Beta distribution is the Bernoulli's conjugate prior, it directly follows that P( M | D ) is also a multiple-Beta. P( D | M ) is thus distributed according to multiple-Beta( alpha_r + tf_r, beta_b + |D| - tf_r )
To get an estimate of P( r | D ) we must marginalize out our model node. This can be expressed as:
This expression is simply the expectation over the posterior P( M | D ). The expectation of a Beta distribution, given in terms of it'ss parameters, is alpha / ( alpha + beta ). Therefore, given that P( D | M ) is distributed according to multiple-Beta( alpha_r + tf_r, beta_b + |D| - tf_r ), it is easy to see that computing the expectation gives us:
As we see, our representation concept beliefs depend on alpha_r and beta_r. Although these values can be set in many ways, we choose to select values for them in a manner similar to previous work in the area. One popular way to set these smoothing parameters in information retrieval is to assume that the expected value of the model prior, P( M ), is equal to P( r | C ), the collection model. This leads us to seek alpha_r and beta_r values that satisfy alpha_r / ( alpha_r + beta_r ) = P( r | C ). We choose to use the following values:
where mu is free parameter. This ultimately leads to the following form for our term representation beliefs:
This belief, denoted in the next section by b, can then be combined with the beliefs of other representation concept nodes or belief nodes. This is discussed in the next section.
[edit] Combining Beliefs
The belief nodes in the network, dynamically generated based on a query, are used to combine beliefs from representation concept nodes and other belief nodes. This allows arbitrarily complex scoring functions to be constructed via complex structured query formulations.
The belief nodes have special conditional probability tables that allow efficient marginalization. This allows the beliefs within the network to be propagated very efficiently. The query language overview contains a full list of belief operators supported, but the following, especially #combine and #weight, are the most useful and important. Care should be taken when using the #wsum operator, since it behaves very differently using language modeling estimates instead of the tf.idf estimates that were used in InQuery.
The following formulas show how the beliefs are computed at belief nodes corresponding to various query operators. This assumes that there are n parent nodes, each with belief b_i and, for weighted operators, weight w_i. The #not operator only handles a single, unweighted parent node. Given a structured query, these formulas, and the representation concept estimates from above, it is easy to construct the exact ranking function used.
[edit] Smoothing
Smoothing is a method used to overcome both the 'zero probability' and data sparseness problem. For a good overview of smoothing applied to language modeling see [7]. Within our framework, it is used when estimating P(r | D), the beliefs of the representation concept nodes.
The following types of smoothing are available in Indri:
- Jelinek-Mercer
- Dirichlet
- Two-Stage[8]
We advocate using Dirichlet smoothing (it is the default), as it is the formally 'correct' model to use within our framework (see estimation above). However, the other options do work, and can be considered on a case-by-case basis. In all cases, the smoothing parameters must be specified, as there is currently no automatic tuning available.
Here are the forms of P(r | D) for the different types of smoothing:
Indri also supports the ability to do field- and operator- specific smoothing. For field-specific smoothing, it is possible to smooth each model node within the network differently. This is useful when making use of multiple document representations, each of which needs to be smoothed differently. In addition, operator-specific smoothing allows 'term' and 'window' representation concept nodes to be smoothed differently. Experiments have shown that applying different smoothing to 'window' nodes can be beneficial.
Here is the smoothing documentation for runquery from Indri's README:
rule: specifies the smoothing rule (TermScoreFunction) to apply. Format of the rule is: ( key ":" value ) [ "," key ":" value ]*
For example: -rule=method:linear,collectionLambda:0.2,field:title This corresponds to Jelinek-Mercer smoothing with background lambda equal to 0.2, only for items in a title field.
If nothing is listed for a key, all values are assumed. So, a rule that does not specify a field matches all fields. This makes -rule=method:linear,collectionLambda:0.2 a valid rule.
Valid keys:
method: smoothing method (text)
field: field to apply this rule to
operator: type of item in query to apply to { term, window }
Valid methods:
dirichlet: (also 'd', 'dir') (default mu=2500)
jelinek-mercer: (also 'jm', 'linear') (default collectionLambda=0.4,
documentLambda=0.0), collectionLambda is also known as just "lambda",
either will work
twostage: (also 'two-stage', 'two') (default mu=2500, lambda=0.4)
NOTE: If the rule doesn't parse correctly, the default is Dirichlet, mu=2500.
[edit] Pseudo-Relevance Feedback
Indri's pseudo-relevance feedback mechanism is an adaptation of Lavrenko's relevance models[9] .
The following is a basic summary of the process:
- Retrieve documents using original query, which results in a ranked list ordered by P( I | D )
- Compute relevance model, P(r | I), over representation concepts (features) using top fbDocs documents from original ranked list
- Sort representation concepts by P(r | I) and keep top fbTerms
- Construct query Q_RM as: #weight( P(r_1 | I) r_1 ... P(r_fbTerms | I ) r_fbTerms )
- Construct expanded query as: #weight( fbOrigWeight Q 1-fbOrigWeight Q_RM )
- Retrieve documents based on expanded query
The relevance model, P(r | I), is computed as:
where the summation goes over the top fbDocs documents and P( D ) is assumed to be uniform. In this case, P(r | D) is always estimated using Dirichlet smoothing, with fbMu used to specify the smoothing paramater, mu. We recommend using fbMu = 0, which is the default behavior.
Finally, we note that the current implementation of Indri only computes the relevance model over term representation concepts. However, given the general framework, it is possible to compute a relevance model over any possible representation concept, such as phrases. Support for such relevance models should be available in a future release.
Here is the pseudo-relevance feedback documentation for runquery from Indri's README:
Pseudo-Relevance Feedback Parameters
fbDocs: an integer specifying the number of documents to use for feedback.
fbTerms: an integer specifying the number of terms to use for feedback.
fbMu: a floating point value specifying the value of mu to use for feedback.
fbOrigWeight: a floating point value in the range [0.0..1.0] specifying the
weight for the original query in the expanded query.
[edit] References
- ↑ Ponte, J. M. and Croft, W. B., "A language modeling approach to information retrieval," Proceedings of the 21st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (SIGIR '98), 275-281, 1998. [pdf]
- ↑ Turtle, H. and Croft, W.B.,"Evaluation of an Inference Network-Based Retrieval Model," ACM Transactions on Information System, in 9(3),187-222, 1991.
- ↑ Metzler, D. and Croft, W.B., "Combining the Language Model and Inference Network Approaches to Retrieval," Information Processing and Management Special Issue on Bayesian Networks and Information Retrieval, 40(5), 735-750, 2004. [pdf]
- ↑ Metzler, D., Strohman T., Turtle H., and Croft, W.B., "Indri at TREC 2004: Terabyte Track," Online Proceedings of 2004 Text REtrieval Conference (TREC 2004), 2004. [pdf]
- ↑ Strohman, T., Metzler, D., Turtle, H., and Croft, W.B., "Indri: A language-model based search engine for complex queries (extended version)" CIIR Technical Report, 2005. [pdf]
- ↑ 6.0 6.1 Metzler, D., Lavrenko, V., and Croft, W. B., "Formal Multiple-Bernoulli Models for Language Modeling," Proceedings of ACM SIGIR 2004, 540-541, 2004. [pdf]
- ↑ Zhai, C. and Lafferty, J., "A study of smoothing methods for language models applied to Ad Hoc information retrieval," Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (SIGIR '01), 334-342, 2001. [pdf]
- ↑ Zhai, C. and Lafferty, J., "Two-stage language models for information retrieval," Proceedings of the 25th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (SIGIR '02), 49-56, 2002. [pdf]
- ↑ Lavrenko, V. and Croft, W.B., "Relevance-Based Language Models," Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (SIGIR '01), 120-127, 2001.[pdf]











