Semantic Dimensionality and Effective Data Fusion
in Information Retrieval.Funded by Information and Data Management Program of National Science
Foundation
Grant Number IIS-9812086
PI: Paul Kantor
4 Huntington Street, SCILS, New Brunswick, NJ 08901
Phone: (732) 932-1359; Fax: (732) 932-1504
Email: kantor@scils.rutgers.edu ;
URL: scils.rutgers.edu/~kantor
Co-PI: Kwong Bor Ng
Graduate School of Library and Information Studies, Queens College, CUNY
65-30 Kissena Blvd. Flushing, New York 11367
Phone: (718) 997-3613; Fax : (718) 997-3797
Email: kbng@qc.edu ; URL: qcunix1.qc.edu/~kbng
Project Summary
The proposed research will investigate the effectiveness of data fusion schemes for information retrieval. Data fusion techniques combine the estimates of relevance or usefulness provided by several different schemes, to produce a richer and more refined set of documents for examination by the human seeking information. This problem will become ever more important as information must be found in complex networked environments, by scientists, business people, students, and ordinary citizens. It is unlikely that any one system will solve the problem of finding the best and most useful sources and documents. Data fusion, widely used in image processing and signal detection, has been shown in other settings, where the noise is largely random, to give substantial performance improvements. The proposed work will be an empirical study, based on theories developed by the investigators, and using a large collection of existing data developed at the Text Retrieval Conferences at NIST, to find laws or rules which predict which retrieval schemes should be combined, and how, to provide improved performance.
Goals, Objectives, and Targeted Activities
The raw material consists of ranked lists of documents L(t,s) prepared
for each of more than 250 topics t, by each of the schemes or
systems s participating in TREC in a given year. For each year, and
for each topic t, we will compute a generalization of Kendalls tau
coefficient, which is appropriate for lists which may not contain the same entities.
The resulting measure z(s,s) can be used as one of the predictive
variables. In addition we have available individual performance measures w(t,s) for
every topic-system combination. The results of this research will be predictive models for
deciding when two schemes (or more than two) can be effectively used in DF, to improve
IR. Research will be conducted by applying discriminant analysis, non-linear
clustering techniques, and other statistical methods to identify the form of the function
f, and the most powerful predictive variables. Extensive use is made of the
Receiver Operating Characteristic concept to determine whether one model is
absolutely, or only conditionally, more powerful than another.
Mid Term Report
Mid Term Report of Research Project APLab/RP-98/01
Grant Report 2000
Grant Report of Research Project APLab/RP-00/01
Technical Reports
Four Online Related Writing:
Kantor, P.B. (1995) Decision level data fusion for routing of documents in TREC3 context: A Best case analysis of worst case results. In D. Harman (ed.) Proceedings of the 3rd Text Retrieval Conference. Washington. DC: GPO.
Ng, K.B. and Kantor, P.B. (1996). Two experiments on retrieval with corrupted data and clean queries in TREC 4 adhoc task environment: Data fusion and pattern scanning. In D. Harman (ed.) Proceedings of the 4th Text Retrieval Conference. Washington. DC: GPO.
Ng, K.B. , Loewenstern, D., Basu, C., Hirsh, H. & Kantor, P. (1997). Data fusion of machine learning methods for the TREC-5 routing task (and other works). In D. Harman (ed.) Proceedings of the 5th Text Retrieval Conference. Washington. DC: GPO.
Ng, K.B., Kantor, P.B. (1998). An Investigation of
the Conditions for Effective Data Fusion in IR: A Pilot Study. Proceedings of the 61th
Annual Meeting of the American Society for Information Science.

Last Revision: 4/12/01