Where do good query terms come from ?

 


Gheorghe Muresan

School of Communications, Information and Library Studies

Rutgers University

New Brunswick, NJ 08901, USA

muresan@scils.rutgers.edu

 

Dmitri Roussinov

W.P. Carey School of Business

Arizona State University

Tempe, AZ 85287, USA

dmitri.roussinov@asu.edu


 

 


This paper describes a framework for investigating the quality of different query expansion approaches, and applies it in the HARD TREC experimental setting. The intuition behind our approach is that each topic has an optimal term-based representation, i.e. a set of terms that best describe it, and that the effectiveness of any other representation is correlated with the overlap that it has with the optimal representation. Indeed, we find that, for a wide number of candidate topic representations, obtained through various query-expansion approaches, there is a high correlation between standard effectiveness measures (R-P, P@10, MAP) and term overlap with what is estimated to be the optimal representation.

An important conclusion of comparing different query expansion approaches is that machines are better than humans at doing statistical calculations and at estimating which query terms are more likely to discriminate documents relevant for a given topic. This explains why, in the HARD track of TREC 2005, the overall conclusion was that interaction with the searcher and elicitation of additional information could not over-perform automatic procedures for query improvement. However, the best results are obtained from hybrid approaches, in which human relevance judgments are used by algorithms for deriving terms representations. This result suggest that the best approach in improving retrieval performance is probably to focus on implicit relevance feedback and novel interaction models based on ostention or mediation, which have shown great potential.

Introduction

A key issue in current information retrieval (IR) research is how to take account of the searcher’s profile and context in order to personalize the list of documents estimated by the system to be relevant (Ingwersen and Jarvelin, 2004). The research work reported here was triggered by our interest in investigating how implicit or explicit forms of relevance feedback can generate extra information, and how such information can be employed to alter the search results and boost search performance.

The High Accuracy Retrieval from Documents (HARD) track of the Text Retrieval Conference (TREC[1]), organized by the National Institute for Standards and Technology (NIST[2]), was introduced with the aim of exploring methods for improving the accuracy of document retrieval systems based on various forms of personalization (Allen 2004, 2005), so it provides an appropriate framework for conducting such work. The 2003-2004 runs of HARD TREC supported a simulation of implicit relevance feedback by including information which could be, in principle, obtained through logs or observations of previous user interactions with a retrieval system. This was implemented via metadata which specified the searcher’s familiarity with a topic, as well as her preference with regard to document genre, geographic coverage or granularity. A simulation of explicit relevance feedback was also supported via clarification forms, through which the retrieval systems could get extra information via a brief interaction with the human searcher.

In 2005, clarification forms were the only means to obtain personalization information: for each topic submitted by an information seeker, simulated by a NIST assessor[3], participating research groups were allowed to generate a clarification form and to try to leverage the additional information elicited, in order to improve retrieval effectiveness (compared to a baseline obtained without such information). Typical questions submitted by HARD TREC participants in these forms were aimed at reducing query ambiguity (they asked whether some document titles seemed relevant, which of a number of cluster labels seemed a better match for the topic, or which of a number of candidate terms were more related to the topic) or at query expansion (they asked for additional descriptions of the information need).

Like most other participants in HARD TREC 2005, our group used a range of query expansion approaches, some automatic (based on query clarity, or on mining the web), some interactive (by asking the searcher, in the clarification forms, to provide extra information) and some mixed (we asked the searcher to filter expansion terms coming from automatic methods). The relevance judgments were made available only at the end of the TREC experiment, so it was not possible for participants to analyze the results and estimate which approaches worked better before submitting the personalized search runs to NIST. We submitted the runs that our informed guess estimated to be significant improvements to our baseline. When the relevance judgments were made available and we were able to compute effectiveness measures, we were unpleasantly surprised to realize that, while our sophisticated approaches did better than the baseline (simple search based on the standard topic representation), they were not better than pseudo relevance feedback (PRF). PRF is a simple automatic procedure implemented as standard functionality in most IR toolkits, which assumes that the top ranking documents returned from a search are relevant, extracts the most representative terms from these documents and uses them for expanding the original query, and re-runs the search with the expanded query. At the TREC conference[4] it became rather clear than other participants in the HARD TREC had a similar experience: sophisticated expansion techniques and simulations of interactions with the searcher (which are expensive in terms of time spent and cognitive effort) did not show a significant improvement over the standard PRF.

Those results triggered the work described here. We are systematically investigating sets of expansion terms coming from different sources of evidence and to assess their quality and potential to improve retrieval effectiveness. We must clarify that, while we are employing the HARD TREC experimental setting (document collection, topics, relevance judgments, clarification forms), the kind of investigation described here was not possible during the TREC experiment, when relevance judgments were not available. Based on these judgments, we can now establish upper-bounds of performance and compare them with a number of approaches to query improvement.

The objective of this work is two-fold. Firstly, and more importantly, we are interested in comparing methods for evaluating the quality of expansion term sets. Specifically, we investigate whether the level of term overlap with the estimated optimal set correlates with, and can predict, higher retrieval effectiveness. Secondly, we are comparing sets of expansion terms from different sources in order to estimate the quality of these sources and to inform future design of IR systems.

The rest of the paper is structured as follows: after presenting the experimental setting, we describe the methodology for our investigation, build optimal upper-bounds, analyze the quality of the original topic representations, analyze a number of query expansion term sets, from various sources, and discuss the results.

Experimental Setting

The corpus used in HARD TREC 2005 and in our experiment is the AQUAINT Corpus, produced by the Linguistic Data Consortium (LDC), catalog number LDC2002T31 and ISBN 1-58563-240-6[5]. It consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People’s Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service. The corpus is roughly 3GB of text and includes 1,033,461 documents (about 375 million words of text). All documents in the collection were used for the HARD evaluation.

The 50 topics were selected from among existing TREC topics on which automatic approaches had produced low retrieval effectiveness in previous years; the intention was to verify if the simulated interaction with the human searcher could improve performance. Because those old topics were to be judged on a new corpus they were manually vetted to ensure that at least three relevant documents existed in the AQUAINT corpus for each of them. As the original authors of the adopted topics were not available to perform relevance judgments, at least some degree of consistency in judging was insured by having the same NIST assessor answer clarification forms and judge the relevance of the submitted documents for each topic. No attempt was made to ensure that the assessor’s notion of relevance matched that of the original topic author.

Documents were judged as either not relevant, somewhat relevant, or highly relevant. For consistency, we adopted the same interpretation for the judgments as used officially in HARD TREC: for evaluation purposes of this experiment, judgments of somewhat relevant and highly relevant were both treated as relevant.

While R-precision (R-P) was the official effectiveness measure of the track, the participants were also encouraged to report precision at 10 retrieved documents (P@10) and mean average precision (MAP), in order to provide a better picture of the effect of the techniques employed (Buckley and Voorhees, 2005). For computing these effectiveness measures, we employed the standard trec_eval tool provided by NIST[6]. It is worth noting that this was a precision-oriented experiment, with particular focus on top-ranking documents.

In terms of software tools, we used the Lemur open source IR toolkit[7], widely used in TREC, which provides functionality for indexing and searching, as well as additional functions for computing query clarity, clustering, etc. Based on preliminary tests to compare the effect of various parameters, we chose a combination of indexing parameters and pre-processing tools that tend to be effective, including the Krovetz stemmer[8] (Krovetz, 1993) and the SMART stopword list[9]. As retrieval model we adopted the popular TfIdf model. It tends not to yield best effectiveness, but it is the only model implemented in Lemur’s modules for retrieval based on both flat, unweighted queries (RetEval, RelFBEval) and structured, weighted queries (StructQueryEval). Therefore, we were able to consistently use the same underlying retrieval model throughout our experiments, and thus avoid the potential effect of the model compounding the results

We obtained our baseline run by running Lemur on queries that comprised the topic titles and descriptions (see the Appendix). For comparing the effectiveness of different sets of results (called “runs” in TREC), and thus the quality of the expansion terms used to produce them, we used matched-pairs Wilcoxon tests[10]. This is justified on grounds that the difference in effectiveness scores may be larger between individual topics than the difference that we want to observe, between methods or sources of query expansion terms. Also, this non-parametric test is appropriate because many of the effectiveness scores are not normally distributed.

Methodology

First, we use relevance judgments to build optimal expansion term sets; details are given in the next section. We then build candidate expansion term sets, based on a variety of sources and with a variety of methods and parameters. By evaluating the quality of these sets, we will be able to conclude which sources and which methods of query expansion work better.

We propose two ways to estimate the quality of the expansion sets or, more generally, of queries as representations of topics. The first is simply to compare them with the optimal sets by applying set operations and looking at the overlap. Because the sizes of the expansion sets vary, measuring the overlap is justified only if some form of normalization is applied. Considering that some of our sets were generated by the human searchers during the search interaction, and thus relatively short, we measure the overlap at cut-offs of 10, 20 and 30 terms. Note that virtually all automatic approaches to generating candidate terms for query expansion assign a weight to each term. We use the weights to rank the terms before applying the cutoff and computing the overlap. In the future we plan to actually consider correlation of weights rather than overlap, hoping for more accuracy in estimating the quality of topic representations.  

 Text Box:  
Figure 1. Standard query expansion procedure
 Text Box:  
Figure 2. Expansion based on score combination

The second approach looks at the actual effectiveness improvement effected by the expansion terms. There are two approaches to using expansion terms. The standard approach, depicted in Figure 1, is to combine the original query terms with the expansion terms either by a simple concatenation, or by applying a weighting scheme such as the Rocchio formula (Rocchio, 1971). In practice negative feedback is rarely used, so the formula becomes:

expandedQuery = originalQuery + w * expansionTerms,

where expandedQuery, originalQuery and expansionTerms are vectors of term weights, and w is a weighting coefficient that controls the contribution of the new expansion terms, typically based on the confidence assigned to a certain source of evidence. In practice it is common that w = 1, i.e. the expansion terms are simply appended to the original query.

The alternative approach, depicted in Figure 2, uses the expansion terms as a query and it generates an “expansion run", a list of documents that match the “expansion query”. If a retrieval model with a linear weighting formula is employed, then the same effect as the standard approach can be obtained by combining the scores in the baselines with those in the expansion run, and using the same weighting coefficient:

experimentalRun = baselineRun + w * expansionRun.

The standard approach is implemented in most experimental IR systems and is therefore very easy to use. In Lemur, for example, the researcher can specify a file with relevance judgments, together with the weighting coefficient for relevance feedback (the function that implements this is RelFBEval). Alternatively, the researcher can use the standard retrieval function (RetEval), but set the flag for the optional pseudo relevance feedback effect and specify the number of top ranked documents and the number of top weighting terms to be considered.

However, this approach is not necessarily convenient for IR experiments, as it confounds the quality of the expansion terms with the effect of the weighting scheme and with the quality of the original query. The alternative approach allows one to look not only at the effectiveness of the final experimental run and to compare it to the baseline, but also to isolate the effectiveness of the expansion run, which directly reflects the quality of the expansion terms.

Lemur does not provide functionality for combining runs, based on a linear combination of their scores, so we have to write scripts to that effect. In fact, combining the baselineRun with the expansionRun is only necessary if we want to compare the two combinations of evidence approaches. For evaluating the quality of expansion term sets, it is sufficient to compute the expansion run scores.

Optimal Upper-Bounds

Choosing Upper-Bounds

 

An ideal query would be one which, for a certain information need, would retrieve all the relevant documents and no non-relevant documents. Although such an ideal query is probably impossible to create, we attempt to use the existing relevance judgments to build “optimal queries”, which achieve upper-bounds of performance. The reasoning is as follows: if an IR system with relevance feedback (RF) capability is given a set of documents judged relevant, it performs a statistical analysis of the documents and it produces a weighted terms representation of those documents. The terms can be ranked based on their weights, and the top ranking terms can be used for query expansion. If all and only those documents that are known to be relevant for a topic are used in this relevance feedback process, the obtained representation is optimal in terms of retrieval effectiveness.

Therefore, we use Lemur’s relevance feedback module, RelFBEval, to compute weighted terms representations for the sets of documents judged relevant for each of the 50 test topics, using as input empty queries and the set of all documents judged relevant by the NIST assessors. In NIST terminology, these judgments are called the “qrels”; therefore, expansion term sets obtained based on them will be labeled as “qrels” in some of the tables included in the paper. The execution of RelFBEval produces two outcomes: (i) the optimal term representation of each topic; and (ii) the results of searching the corpus based on the optimal representations, i.e. our “optimal run”, which constitutes our upper-bound of performance.

The term “optimal” is relative, as the outcome of executing RelFBEval dependends on a set of parameters, the most important of which being: (1) feedbackDocCount, which specifies how many documents judged relevant should be considered; and (2) feedbackTermCount, which specifies how many query terms should be generated, and used for searching. Exploring the effect of those two parameters is made more difficult by the distribution of the number of documents judged relevant over the topics: this number varies between 9 (for topic 345) and 376 (for topic 354). Therefore, in choosing different values for feedbackDocCount, we considered different percentages of the number of relevant judgments for each topic: all 100%, 75%, 50%, 25% and 10%. For feedbackTermCount we took absolute values: 10, 20, 30, 50, 75, 100, 150, 200, 10000 (the intent for the last value was to cover all the terms that may be extracted from the documents used for feedback).

25 %

 

75 %

 

10 %

 

50 %

 

100 %

 

Figure 3. The effect of relevance judgment parameters on retrieval effectiveness

 

Figure 3 depicts the retrieval effectiveness, as measured by MAP, P@10 and R-P, of different candidates to the title of optimal run. The five clearly visible groups of results correspond to the distinct values for feedbackDocCount, decreasing from 100% to 10%; within each group, feedbackTermCount increases from 10 to 10,000. First of all, as each group ends with a dramatic drop in performance, it is clear that representing each topic by all the terms that appear in a sample of relevant documents is rather disastrous; representing each topic by a “reasonable” number of terms gives much better performance. For P@10, it appears that performance peaks at 50-75 terms, while for MAP and R-P (which are highly correlated) it is best to consider just the best 10-30 terms.

While Figure 3 shows only summaries and not the variability of the data, it does suggest choosing as optimal run the one obtained based on all relevant judgments, and the most representative 30 terms. This result is encouraging for our experiment, because it makes it justified to compare manual representations with the optimal representation and to use term overlap at cutoffs of 10, 20 and 30 as measures of query quality: human searchers typically submit short queries and, even when prompted to give more details, they cannot be expected to generate more than 10-30 query terms.

Table 1 shows more details: for each of the 9 runs based on the full set of judgments, it shows the mean and standard deviation of the derived run, as well as the result of a Wilcoxon test that compares that run with the candidate optimal run, obtained with 30 terms.

Table 1. Comparing effectiveness of qrels runs

Expansion term count

R-P

P@10

MAP

10

0.415 (0.149)

0.716 (0.266)

0.387 (0.185)

 

F = 492, p = 0.446

F = 43, p = 0.000

F = 462, p = 0.090

20

0.427 (0.137)

0.778 (0.245)

0.409 (0.172)

 

F = 470, p = 0.971

F = 51.5, p = 0.075

F = 590, p = 0.647

30

0.428 (0.132)

0.798 (0.242)

0.411 (0.168)

50

0.427 (0.132)

0.798 (0.239)

0.409 (0.167)

 

F = 450.5, p = 0.604

F = 83.5, p = 0.642

F = 591.5, p = 0.657

75

0.422 (0.135)

0.816 (0.231)

0.404 (0.168)

 

F = 435, p = 0.484

F = 169, p = 0.583

F = 526.5, p = 0.284

100

0.415 (0.138)

0.810 (0.235)

0.395 (0.168)

 

F = 413, p = 0.164

F = 124.5, p = 0.753

F = 474, p = 0.114

150

0.406 (0.141)

0.810 (0.220)

0.381 (0.172)

 

F = 375, p = 0.045

F = 192, p = 0.942

F = 395, p = 0.019

200

0.398 (0.145)

0.806 (0.227)

0.369 (0.177)

 

F = 355, p = 0.027

F = 208, p = 0.909

F = 345, p = 0.005

10000

0.238 (0.157)

0.550 (0.298)

0.180 (0.159)

 

F = 34, p = 0.000

F = 136.5, p = 0.000

F = 24, p = 0.000

 

It is clear that our candidate run is the best in terms of MAP and R-P and close to the top (compared to the best run, the difference is not statistically significant) in terms of P@10. Therefore, we adopted this run as the optimal run, and refer to it as such in the rest of the paper.

Table 2. Comparison between the baseline and the optimal run

 

R-P

M (sd)

P@10

M (sd)

MAP

M (sd)

Baseline run

0.222 (0.158)

0.338 (0.281)

0.156 (0.150)

Optimal run

0.428 (0.132)

0.798 (0.242)

0.411 (0.168)

 

F = 35, p < 0.001

F = 19, p < 0.001

F = 14, p < 0.001

 

The optimal run is significantly better than our baseline run on all three measures of effectiveness used in HARD TREC, as depicted in Table 2. It is particularly relevant to observe the high precision of the optimal run, beneficial for interactive retrieval sessions, simulated in our experiment.

Optimal Run Versus Approximations

Another interesting conclusion that can be drawn from Figure 3 is that, while effectiveness decreases with the size of the relevance judgment sample, the decrease is not substantial: even when only 10% of the relevant documents are used for feedback, we obtain close to optimal retrieval effectiveness. In other words, good performance can be obtained “on the cheap”, by using just a relatively low number of positive judgments. A natural question that arises is whether it is possible to manage with no judgments at all, simply relying on the pool of top ranking documents known to be retrieved by a number of search engines. We can simulate that approach by considering NIST’s file of relevance judgments, but ignoring the actual judgments and simply using all the documents in the file for relevance feedback.

Figure 4. Comparing the baseline, the optimal run, and the cheap run

 

Figure 4 depicts the results, which are extremely interesting. The baseline represents the retrieval effectiveness based on the human searcher articulating an information need. It can be argued that this is a rather generous baseline: in practice searchers submit queries much shorter than the topic title and description generated by NIST assessors for the TREC experiment. The results labeled “All” simulate a cheap form of implicit relevance feedback in which the human user’s searches are logged and all the top ranking documents are recorded and analyzed statistically, whether the searcher opens them and uses them or not. In other words, this is a very naïve form of implicit relevance feedback, in which no attempt is made to interpret user’s actions such as opening, bookmarking, saving or printing documents; a document is viewed as possibly relevant based on the fact that it was ranked highly in a search performed by the user. The results labeled “Rel” simulates an explicit form of relevance feedback, in which the user has marked each document retrieved in the past as relevant or non-relevant, so that a more precise models of the topics of interest to the use can be built.

Not surprisingly, “Rel” performs much better than the baseline and than “All”. What is very interesting and extremely promising for research in personalization of IR, is the fact that “All” is better than the baseline. It means that an intelligent agent that logs the user’s searches and records the retrieved documents, can generate better topic representations and obtain better results than the user submitting queries, after a reasonable amount of training (the TREC qrels file contains several hundred documents for each topic). If the agent learns to interpret user actions that indicate document relevance, the potential for improving performance is enormous.

Table 3 gives a more refined view of the same result: it captures the means and standard deviations of the three measures of effectiveness (R-P, P@10 and MAP) for runs obtained based on queries consisting of the best 10, 20, 30 , …, 200 terms representing the documents judged relevant (label “Rel”), respectively all the high-ranking documents retrieved, whether they are relevant or not (label “All”). The means of the effectiveness values and the output of the matched-pairs Wilcoxon tests indicate that the Rel runs constitute a substantial improvement, the All runs are better overall better than the baseline, but this is not a statistically significant conclusion.

 

Table 3. Comparing the baseline, the optimal run, and the cheap run

Feedback term count

All

Rel

R-P

P@10

MAP

R-P

P@10

MAP

10

0.263 (0.171)

0.422 (0.331)

0.215 (0.189)

0.415 (0.149)

0.716 (0.266)

0.387 (0.185)

 

F = 775, p = 0.055

F = 529, p = 0.052

F = 913, p = 0.008

F = 1219, p = 0.000

F = 1010, p = 0.000

F = 1251, p = 0.000

20

0.251 (0.167)

0.450 (0.319)

0.207 (0.184)

0.427 (0.137)

0.778 (0.245)

0.409 (0.172)

 

F = 701, p = 0.246

F = 645, p = 0.038

F = 867.5, p = 0.026

F = 1234, p = 0.000

F = 1026.5, p = 0.000

F = 1261, p = 0.000

30

0.247 (0.166)

0.454 (0.347)

0.198 (0.186)

0.428 (0.132)

0.798 (0.242)

0.411 (0.168)

 

F = 639, p = 0.282

F = 696, p = 0.019

F = 837, p = 0.054

F = 1190, p = 0.000

F = 1157, p = 0.000

F = 1261, p = 0.000

50

0.228 (0.168)

0.420 (0.353)

0.184 (0.189)

0.427 (0.132)

0.798 (0.239)

0.409 (0.167)

 

F = 572.5, p = 0.928

F = 639, p = 0.093

F = 756, p = 0.253

F = 1182, p = 0.000

F = 1022, p = 0.000

F = 1266, p = 0.000

75

0.214 (0.163)

0.394 (0.354)

0.168 (0.182)

0.422 (0.135)

0.816 (0.231)

0.404 (0.168)

 

F = 513, p = 0.442

F = 512, p = 0.291

F = 664, p = 0.798

F = 1215, p = 0.000

F = 1023.5, p = 0.000

F = 1263, p = 0.000

100

0.202 (0.164)

0.370 (0.348)

0.156 (0.175)

0.415 (0.138)

0.810 (0.235)

0.395 (0.168)

 

F = 430, p = 0.156

F = 490.5, p = 0.625

F = 587, p = 0.626

F = 1157, p = 0.000

F = 1063, p = 0.000

F = 1256, p = 0.000

150

0.188 (0.159)

0.346 (0.343)

0.140 (0.168)

0.406 (0.141)

0.810 (0.220)

0.381 (0.172)

 

F = 355, p = 0.027

F = 422, p = 0.872

F = 462, p = 0.134