UMBC CMSC 491/691-I Spring '01

Information Retrieval

Note: Information on the course during Spring '01 was disseminated both here and on BlackBoard. The information on BlackBoard is no longer available. Also, some links may be broken.

This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.

This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project.

You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts. Undergraduates will be expected to cover the basic material in the textbook and the programming project. Grad students will also be expected to read additional papers (indicated in class), and implement something in their project from at least one of them.

News

28 Mar	Homework 2 has been released and is due Monday. In it, you will search for documents on the web to evaluate the performance of two search engines.
15 Mar	I have written some example code for computing Okapi weights for the examples done in class (although the code can certainly be used to compute arbitrary weights... just change the constants!). The code is in Emacs Lisp; load the file into Emacs and follow the directions in the file.
7 Mar	Phase II of the project is now "out" (the benchmarks have been defined, so the information on the Project page is complete). Also, I have made some slight changes to the phase I benchmarks. Web-browsable versions of the course slides are now available from the Syllabus. They aren't as beautiful as they are during class, but they're certainly usable if you miss a lecture.
20 Feb	The Project Specification has been updated in two ways. First, the milestones now include suggested target dates for you to aim for. Second, some resources have been made available for Phase I: a stemmer and several stoplists.
14 Feb	The syllabus has changed slightly to reflect where we are in the course. This week we will finish discussing inverted index construction, and begin retrieval models next week.
7 Feb	The readings handed out on Monday (three chapters from Frakes and Baeza-Yates and Porter's 1980 paper) are now listed on the syllabus. Please read them before Monday. My solutions to Homework 1 have been posted. The project spec (v1.0) has been released. We will discuss it on Monday.
1 Feb	Attention GL users: Apparently AFS wasn't a big fan of my directory in `~ian/ir`. I've moved the data for HW1 to `~ian/../pub/ir`.
31 Jan	Homework 1 has been released. It's due on Monday, but shouldn't take more than an afternoon. I've also fixed the grading scheme in the syllabus, which didn't take into account the homework. I've also added some historical material to the syllabus. Please take the time to read at the paper by Lesk; the chapter from van Rijsbergen and the Vannevar Bush article are recommended.

Vital statistics:

Instructor: Dr. Ian Soboroff
Time: MW 5:30-6:45pm
Room: LH1
Office Hours: ECS 214, MW 4:00-5:00pm or by appointment
Texts:
- Required:Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
- Required for grad students: papers as assigned, see syllabus
- Not required but a Darn Fine (and quite possible Useful) Book: Managing Gigabytes, by Witten, Moffat, and Bell

Teaching Assistant: Fang Huang
Office Hours:Wednesdays, 3-5pm in ECS 334.

Is there still room for me?

Information Retrieval

News

28 Mar

15 Mar

7 Mar

20 Feb

14 Feb

7 Feb

1 Feb

31 Jan