Information Retrieval
[17:610:551]
Pre-requisites: 17:610:510 and 550.
Description: Theory, design, use, and evaluation of information retrieval (IR) systems. Design principles for IR systems and their implementation, characteristics of operational and experimental retrieval systems, and evaluation of information retrieval systems.
Synopsis: Course Objectives
To give students a solid understanding of:
- the genesis and variety of information retrieval situations;
- the variety of information retrieval models and techniques;
- design principles for information retrieval systems;
- methods for implementing information retrieval systems;
- characteristics of operational and experimental information retrieval systems;
- methods and principles for the evaluation of information retrieval systems.
Organization of the Course
1. "Textbook IR" covers the accepted current wisdom in IR research and practice, i.e. topics that have been thoroughly studied and are relatively well understood. The purpose of this part is to give all students a grasp of what IR is, what problems it tries to solve, which approaches have proved successful and which not.
Each meeting day, there will be a lecture and discussions on the scheduled topic. Students are expected to read the relevant chapters or sections from the recommended books before the class and to participate in discussions on that topic. In each class, a couple of students will give a short presentation, doing a critical analysis of a paper or topic. Other students are expected to critically discuss the presentations.
2. "Advanced IR" covers current research topics in IR.According to their preferences, students will choose topics of interest from among a list of proposed topics, and will do a more thorough investigation and give a longer presentation and write a report. The number of topics that each student has to present and the depth and coverage of the analysis will depend on the number of students taking the course.
Websites. All the lecture notes will be available online. The course websites will contain links to students websites. Students are expected to maintain websites for the course on their eden accounts, in a subfolder of public_html called 551. They will use the websites to publish their presentations and reports, and to upload their homework.
Major Assignments
Students will be graded based on reports, presentations, term project, homework and class participation. The final grade will be a weighted average of the partial grades.
- The report and presentation of a paper (or topic) should make clear: (1) what problem the paper addresses; (2) what relation it has to prior cited literature and to the current topic discussed in class; (3) what idea it proposes to solve or improve the problem; (4) what was done to implement that idea; (5) what results were found and how well the rseults were interpreted, and (6) what suggestions were made for further work.
The report should have a clear and logical structure of the topic and of the argument and an appropriate title or indication of the topic (it can help focus and structure the argument), and to display critical analysis. In addition, the presentation should make good use of visual aids to help the listeners understand the topic and the argument.Critical analysis is essential. Over the semester, each student is expected to give at least 3 presentations: (i) one on a classic paper; (ii) one on a recent paper; (iii) one on an IR topic, based on at least 2 papers (one classic, one recent); this presentation should be accompanied by a report.
- Practical project work - is expected to display:
- Good understanding of the topic chosen and of the IR context
- Creativity
- Substantial work
Below are suggestions of possible projects, but you can propose your own if you have a particular interest. Students are expected to choose a project soon after mid-term, to discuss their progress with the instructor, and to give a presentation at the end of the course.
The project may be real research (when some research hypothesis is proposed and tested), or mock research (when some of the data, such as subjects' responses to questionnaires, are generated automatically rather than real data), or literature review or practical implementation f a system.
Possible projects:
-
Conduct a user experiment on a research problem and analyze the results. Examples:
- Compare the effectiveness of two systems or user interfaces for a certain task
- Test the usability of a system
- Repeat an experiment described in a TREC or SIGIR paper.
Required: a report discussing the research problem and or the purpose of the experiment, the experimental design and setting, the results, conclusions and implications or recommendations for designing IR systems.
For a realistic experiments that involves 6-16 sujects, the use of questionnaires / interviews and/or the statistical analysis of the experimental results, teamwork is not only acceptable, but encouraged. On the other hand, suitable for individual students would be, for example, to compare two Web search engines based on functionality ("what we expect from an IR system") and the support that their user interfaces offers. Note that the functionality of commercial system is not well / fully documented (order and weighting of query terms, for example), so an informed guess, based on exploration such as observing the output for various inputs, is necessary.
- Give a detailed description and critique of some operational information retrieval system.
Discuss functionality, usability advantages and diadvantages over similar systems. Draw conclusions and formulate recommendations with regard to the use of the system. Examples:
- AntWorld
- Onix
- Inquery, from CIIR, University of Massachusetts, Amherst
- MG
I.H. Witten and A. Moffat and T.C. Bell, ?Managing Gigabytes: Compressing and Indexing Documents and Images?, 2nd ed, 1999. Other bibliography indicated in the MG webpage.
- Lemur, at Carnegie-Mellon University
- Cheshire, at UC Berkeley
- Okapi
More info on the Okapi project at City University.
- Lucene
Required: Cover the installation of the system, the usefullnes and usability of the system in a range of tasks.
- Constructing and documenting (a significant part of) an information retrieval system, or adding new modules to an existing system, and evaluating it by demonstration (this could be an individual or a group project for students with programming skills).
- Class participation - apart from active, thoughtful and creative participation in class discussions, critical discussion of presentations and most homeworks are seens as class participation. Although not graded as such, class participation can determine the adjustment of the final grade by one position.
- Graded homework may be assigned occasionally in the form of exercises designed to re-inforce learning.
Methods of Assessment
|
Class Participation |
10% |
|
Presentations |
30% |
|
Homework |
30% |
|
Project |
30% |
Bibliography
Recommended texts
- Sparck Jones, Karen & Willett, Peter eds. (1997) Readings in Information Retrieval. San Francisco: Morgan Kaufmann.
A collection of fundamental research papers in information retrieval, plus very good introductory material to the book as a whole, and to each section, by the editors. A basic resource for understanding information retrieval.
- Richard K. Belew (2000) Finding Out About - A cognitive perspective on search engine technology and the WWW.
Although more targeted at a Computer Science audience, it does a good job of introducing IR concepts and principles. It comes with a book website, which contains an HTML copy of the book with figures and tables missing. Here's a local copy, including figures and tables.
- Baeza-Yates, R. & Ribeiro-Neto, B. (1999) Modern information retrieval. New York: ACM Press.
Although it has a slight Computer Science, the text does not enter in details of data structures and algorithms but instead addresses concepts, principles, and the mathematical model underlying Information Retrieval. My only gripe is that the chapters have different authors, so the level of detail and the notation vary a lot, plus there is some overlapping between chapters.
- Charles T. Meadow, Bert R. Boyce and Donald H. Kraft: Text Information Retrieval , 2nd edition - Academic Press, 2000.
This book won the ASIS Best Information Science Book Award for 2000. It offers a good overview of IR, maybe with too much technical detail.
- Ingwersen, P. (1992) Information retrieval interaction. London: Taylor Graham. Reprinted in electronic form in 2002. Here's a local copy.
The overview of the "cognitive approach" to Information Retrieval.
- Korfhage, R. (1998) Information storage and retrieval. New York: Wiley, ISBN: 0-471-14338-3.
This book won the ASIS Best Information Science Book Award for 1998. It is a general overview of IR, and offers good references.
- Hersh, William R. (2002) Information retrieval: A health care perspective, 2nd ed. New York: Springer Verlag, ISBN: 0-387-95522-4 .
Good general introductory text to all of information retrieval, well written and at an appropriate level for this course. The examples are from health care applications, but the book covers general principles and pays particular attention to the problem of designing good evaluation experiments.
- David Grossman and Ophir Frieder (2004) Information Retrieval - Algorithms and Heuristics, 2nd ed, Springer.
Although the subtitle suggests that this is a textbook targetted at CS students, the algorithms are explained in plain English and with many examples, so they are relatively easy to understand by anyone.
- Gerald J. Kowalski and Mark T. Maybury (2000) Information Storage and Retrieval. Boston: Kluwer.
A quite recent book that seems to cover well most aspects of building IR systems.
- Maristella Agosti, Fabio Crestani and Gabriella Pasi (2001) Lectures on Information Retrieval. Berlin: Springer.
A collection of lectures on IR given at the Third European Summer-School, ESSIR 2000.
- Anderson, James D. and Perez-Carballo, Jose (2003) Information Retrieval Design: Principles and Options for Information Description, Organization, Display, and Access in Information Retrieval Databases, Digital Libraries, Catalogs, and Indexes, Ometeca Institute.
It discusses the decisions that need to be taken when building an Information Retrieval system, manual or automatic. Used by the first author as textbook for his course on "Organizing Information". A book webpage is under construction.
- Belkin, N.J. & Vickery, A. (1985) Interaction in information systems. London: The British Library.
Not a text on IR as a whole, but useful for several of its chapters. Provides a general review of the topic.
- Frakes, W.B. & Baeza-Yates, R., eds. (1992) Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice Hall.
A very technical book on data structures and algorithms for IR. Much of the code (in C) is available at ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/. Recommended if you want to build an IR system. The chapters are written by various authors, so the level of detail and the notation varies widely.
- Marchionini, Gary (1997) Information seeking in electronic environments, Cambridge University Press, ISBN: 0-521-58674-7.
The book discusses how the electronic technology has changesd the skills and strategies used for manipulating, storing, and retrieving information.
- Kuhlthau, Carol Collier (2004) Seeking meening - A process approach to Library and Information Sciences, 2nd ed, Libraries Unlimited, ISBN: 1-59158-094-3.
The recent edition of a classic book on the Information Search Process.
- Soergel, Dagobert (1985) - Organizing Information: Principles of Data Base and Retrieval Systems, Morgan Kaufmann, 0-126-54261-9.
ASIS Best Information Science Book Award for 1998. A bit old, but highly cited.
IR classics
- van Rijsbergen, C.J. (1979) Information retrieval, 2nd ed. London: Butterworths.
A standard, good text on the topic, somewhat technically and theoretically oriented, but with excellent introductions to some essential IR concepts. An electronic version is available on the Web. Here's a local copy.
- Salton, G. (1989) Automatic text processing: The transformation, analysis and retrieval of information by computer. Reading, MA: Addison-Wesley.
Has some useful sections on automatic IR systems, and integrates IR within an overall text-processing framework.
- Salton, G. & McGill, M. (1983) Introduction to modern information retrieval. New York: McGraw-Hill.
Quite old now, but still a very standard text for IR, heavily focused on technical issues of representation and retrieval techniques. It introduces many of the ideas that are otherwise only to be found in the papers and technical reports of Salton's group at Cornell.
- Sparck Jones, K. ed. (1981) Information retrieval experiment. London: Butterworths.
The classic work on this topic. Very good chapters on different aspects of experimentation in IR, by very good people.
Background reading
Information Retrieval is not an isolated field, but relies on knowledge and research results from different domains such as Human-Computer Interaction, Artificial Intelligence, Computational Linguistics, Statistics and Probability Theory. You are encouraged to investigate at least the domains relevant to your project.
- Preece, Rogers and Sharp (2002) Interaction design - Beyond human-computer interaction, ISBN: 0-471-49278-7, Wiley.
An IR system is an interactive system, so a book on the design and evaluation of interactive systems is essential background reading. Also see the complementary website.
- Manning, Christopher D. and Schutze, Hinrich (2001) Foundations of statistical natural language processing, Cambridge: MIT Press.
It explains the mathematical foundations of the statistical approach to Information Retrieval.
- Jurafsky, Daniel and Martin, James H. (2000) Speech and language processing, Prentice-Hall, ISBN: 0-13-095069-6.
As the subtitle says, it's an introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.
- Meziane, Farid; Métais, Elizabeth (Eds.) (2004) Natural Language Processing and Information Systems.
- Witten, Ian; Franks, Eibe (2005) Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed., 2005, Morgan Kaufmann/Elsevier, ISBN: 0-12-088407-0.
- Gravetter, Frederick J. and Wallnau, Larry B. (1996) Statistics for the Behavioral Sciences, West Publishing Company.
Actually, any Statistics books will do. Note that the ones for Social Sciences typically give clearer examples and use more intuition, the ones for Engineering use more math and more obscure examples. In addition, if you're going to do a user study, you can use a book on "Experimental Design".
- Babbie, E. (2004). The practice of social research (10th ed.). CA: Wadsworth.
- Creswell, J. W. (2003). Research design: Qualitative, quantitative, and mixed methods approaches (2nd ed.). CA: Sage.
- Norman, Donald A. (1988) The Psychology of Everyday Things, BasicBooks.
Reprinted as "The Design of Everyday Things". A wonderful classic on designing "things". (An IR system is a thing :-).)
- Galvan, Jose L. (2004) Writing Literature Reviews - A Guide for Students of the Social and Behavioral Sciences, 2nd ed, Pyrczak Publishing
Journals
Note. You get free access to most of these via the Rutgers library.
- ACM Transactions on Information Systems.
This is a standard journal for substantial, archival work in IR, with a computer science emphasis. Experiment and research in IR. Quarterly.
- Information Processing and Management.
A standard international journal, with significant work in IR in every issue. Experiment and research in IR. Bi-monthly.
- Information Retrieval.
A new journal, edited by Paul Kantor and Stephen Robertson, whose stated aims are to publish high quality technical work in IR. Quarterly.
- Journal of the American Society for Information Science.
A good, standard publication source for information science, with much good work in IR. Experiment and research in IR. About 14 issues/year.
- Journal of Documentation.
Published by Aslib. Long one the standard and most important journals in IR, which avoids the US bias of JASIS. Experiment and research in IR. Quarterly.
- Information Research
Conferences
- ACM SIGIR International Conference on Research and Development in Information Retrieval.
This is the standard place for publication of research papers with a computer science orientation toward IR. Very strictly refereed. Proceedings available at the ACM Digital Library (via Rutgers library webpage).
- Annual Conference of the American Society for Information Science.
The quality of the papers in this meeting is highly variable, but it is always worth looking at it. It is just about the only place where person-oriented research in IR is reported.
- Digital Libraries `nn. The Nth ACM Conference on Digital Libraries.
The first of this series was in 1996. Although the standard of papers is variable, this meeting is developing into the standard source for papers on digital libraries (which often means papers on IR). There is some emphasis on reporting on systems and prototypes, which makes it different from the SIGIR Conferences.
- IEEE ADL `nn. IEEE Forum on Research and Technology Advances in Digital Libraries.
This is another conference on digital libraries, but with less high quality papers than in the ACM DL `nn series. More focus on policy and economic issues, and also on database and other technical issues. Some IR content.
- TREC-n. Proceedings of the nth Text REtrieval Conference. (Here's a local copy)
The eighth in this series will be published later this spring. Although not refereed, it has become a standard place for publication of high-quality IR evaluation papers. The most important new results in computer-oriented IR are now first published in this forum.
Reviews and Indexes
Annual Review of Information Science and Technology.
The standard review source in the field.
Perspectives on...: Journal of the American Society for Information Science.
An irregular series of grouped articles on special topics within the Journal. A number of articles on some topic of current interest are put together by a special editor for that topic.
Progress in Documentation: Journal of Documentation.
An irregular series of review articles published in this journal. Very high quality.
Trends in ...: Information Processing and Management.
A good, irregular series of review articles..
The home page for ACM SIGIR has a great deal of information on it, with links to many other resources in information retrieval.
Web resources
Semantic Web
Statistics / Research methods
Information Visualization
You can get more information about this course at http://scils.rutgers.edu/~muresan/551_IR . |
|