UMBC CMSC 491/691-I Spring '01 CSEE  |  491/691-I  |  News  |  Syllabus  |  Lectures  |  Project   ]
Last updated: 7 Mar 2001

Syllabus - Information Retrieval

Week Topic Readings Assignment
1: 1/29 Introduction to IR, course overview
User vs. System-centered IR
Ch. 1 and papers HW 1 out (due week 2)
2: 2/5 Text processing Sec. 7.1-2 and papers Project Phase I out
3: 2/12 Inverted indices Sec. 8.1-2 and papers
4: 2/19 Compression of indices and text Rest of Ch. 7 and papers
5: 2/26IR Models: Boolean, Vector space Ch. 2 and papers
6: 3/5IR Models: Vector Space (continued), Probabilistic Ch. 2 and papersProject Phase II out
7: 3/12Probabilistic (cont), Inference nets
8: 3/19SPRING BREAK
9: 3/26 Evaluation, Test CollectionsCh. 3, TREC web site HW 2 out, Project Phase III out
10: 4/2Relevance feedbackCh. 5, papers
11: 4/9Software Agents and IRPapers
12: 4/16Information filtering; collaborative filtering Papers
13: 4/23Web SearchCh. 13, papers
14: 4/30User interfaces for IR Ch. 10, papers
15: 5/7Presentations Project due
16: 5/14Presentations

Readings

The primary text for the course is Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto. This is a new text that covers a lot of ground and is pretty up-to-date. To fill in the details, there will also be a number of papers assigned; they will be made available either from this web page (accessible from UMBC only) or handed out in class. These papers are either seminal reading on the topic, or cutting-edge results from recent conferences and journals. They are required reading for graduate students, and recommended reading for undergraduates.

Homework

Homework assignments will be few, and will mostly ask you to write a short program to demonstrate something discussed in class, run it on some sample data and look at the results. Homeworks require an afternoon or less to complete.

Homework 1 is to generate some corpus statistics. The solutions are here.

Homework 2 is to evaluate the retrieval effectiveness of two web search engines.

Project

In-depth information on the course project is here. A brief description follows.

For the course project you will design and implement your own information retrieval system. The project will have three phases. In Phase I, you will build the indexing component, which will take a large collection of text and produce a searchable, persistent data structure. In Phase II, you will add the searching component, according to one of the models discussed in class. In Phase III, you will add some advanced functionality of your choice. The project is due at week 15.

Phases I and II are required. They will consist of a small set of milestones which you should tackle in order. To encourage you to not put off the project until the last week of class (and who would do that?), there will be periodic benchmarks, where you can test your system against everyone else's using some standard data sets. A specific project specification will be given out at each phase.

Phase III is required of graduate students, and "encouraged", but not strictly required for undergraduates (You'll see why once you get started! Phases I and II are a lot of work.)

You will be graded on what you complete, given that Phases I and II are required.

Grading

Project70%
(each phase in equal proportion)
Homework20%
Class Participation10%

Late Policy

The project is due on May 9th (the second session in the second-to-last week of class). That doesn't leave much time for lateness; if you're planning an incomplete, please see me before this point or I may not allow it.

Homework is generally due on the Monday after it is assigned. Late homework will not be accepted without prior arrangement. If you know you will miss a class, check the web page to make sure nothing is due. If you are unexpectedly absent, email me.

Papers to Read

Week 1

The Seven Ages of Information Retrieval, by Michael Lesk (1995). A brief, systems-oriented history of IR.

As We May Think, by Vannevar Bush (1945). A visionary article that makes a lot of interesting predictions.

Information Retrieval, 2nd Edition, Chapter 1, by Keith van Rijsbergen (1979). Along with Salton's 1983 book, this was the textbook on IR for a long time. Dr. van Rijsbergen has made it accessible on the web since it's out of print. Chapter 1 gives an interesting counterpoint to our textbook's introduction.

Week 2

Chapters 7 and 8 from Information Retrieval: Data Structures and Algorithms, by W. Frakes and R. Baeza-Yates, were handed out in class to supplement the material in chapters 7 and 8 of our textbook. The link above is the book homepage, which has all the source code from the book. For example, you can download Frakes' DFA generator from there and use it in your project.

"An algorithm for suffix stripping", by M. F. Porter, is the original paper on the eponymous Porter Stemmer. You can implement the stemmer from his paper, if you like, or you can download some code (in C, Java, Perl, and BCPL!) from Porter's own page. I will make a version available that fixes several known idiosyncracies.

Week 3

Chapter 3 from Frakes and Baeza-Yates. This chapter, by Harman et al., discusses B-trees, inverted index construction, and a sophisticated inversion algorithm.

Ternary Search Trees (Dr. Dobbs Journal, April 1998) by Bentley and Sedgewick, is a very nice article on ternary digital search trees with code examples. It has a link to their homepage where you can find the longer symposium paper, which will be of interest to graduate students. TSTs are a compromise between (n-ary) tries and binary search trees.

"Prefix B-Trees", by Bayer and Unterauer.

Week 4

A section from Managing Gigabytes, Chapter 3, on Huffman coding and computing canonical Huffman codes.

Week 5

"Ranking Algorithms" (first half), by Donna Harman, is chapter 14 in the Frakes and Baeza-Yates book. It gives a good discussion of the tradeoffs and choices among different term-weighting strategies.

Week 6

Please read Okapi at TREC-3, by Roberston et al. This presents the "BM25" weighting scheme, which is currently the best performing "classical" probabilistic ranking algorithm. Their TREC-5 paper gives some later perspective on the parameter settings. These papers give concrete performance numbers for the "2-Poisson" paper handed out in class.


Ian Soboroff
Last modified: Wed Mar 28 15:27:44 EST 2001