CMSC 476/676

Information Retrieval

Spring 2009

For lots of up-to-date information, see the course blog http://cmsc476sp09.blogspot.com/
You'll need permission to post to the blog, but anybody can read it. I've invited all students in the class as of Thursday 1/29/09 to join the blog as authors, using their UMBC computer accounts.

Charles Nicholas nicholas@umbc.edu

ITE 325G
Office Hours: Tuesdays and Thursday 4:30-5:30.
410-455-2594 Contact Ms. Gethmann at gethmann@umbc.edu, or x. 52713, to make an appointment at other times.
TA: Mr. Don Dimitroff
Don's URL http://www.csee.umbc.edu/~dondim1/
ITE 351 Don's office hours are 2:30-5:30 on Tuesdays and Thursdays.

The class meets in ACIV 014

This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.

This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project, details to follow.

You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts.

Text and Handouts

The text will be Manning et al, available at the UMBC bookstore (at least it's been ordered) as well as Amazon. Details about which chapters will be covered, and when, will follow. The text is available on line:
http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
The slides to be used in class will be based on those provided by the authors of the textbook. You can see those original slides at http://www-csli.stanford.edu/~hinrich/newslides.html
In class I will use a local version of these slides, which I will modify as needed! It'd be a good idea to study the slides BEFORE each class. Other papers and resources are available

Grading

There will be a multi-phase programming project, details to be announced, worth 50% of the grade. The programming assignments can be found in this term project directory. The corpus of text documents is available as this compressed tarfile, and this directory.

Exams and quizzes will be another 30%. There will also be a writing project, worth 20%. Presentations on the programming and/or writing projects will take the place of the final exam. No final exam is planned.

Academic Integrity

"By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook, the Faculty Handbook, or the UMBC Policies section of the UMBC Directory [or for graduate courses, the Graduate School website]."

Schedule

We will follow the text fairly closely. Each of the topics listed below will require a lecture or two, based on how much time I want to devote to that topic, and the availability of additional readings. This course was last taught two years ago. Here's the web site from that semester. Topics to be covered include most if not all of the following:

What Happens Day by Day

Tuesday 1/27
first day of class
Chapter 1 slides
Thursday 1/29
discuss first phase of programming project
Thursday 2/5
Chapter 2 slides
Tuesday 2/10
Release phase 2 of project
Thursday 2/12
Levenshtein distance worksheet (pdf)
tf.idf spreadsheet (xls), UPDATED 2/23
tolerant retrieval Chapter 3 slides
Tuesday 2/17
index construction Chapter 4 slides
Thursday 2/19
the marked-up MapReduce paper (pdf)
Tuesday 2/24
Phase 2 of project is due
compression Chapter 5 slides
Thursday 2/26
Release Phase 3 of project
tf.idf Chapter 6 slides
Tuesday 3/3
more on Phase 3 and tf.idf
Thursday 3/5
cover Chapter 7 slides
list some exercises that will help with the midterm, scheduled for March 26
Tuesday 3/10
begin Chapter 8 slides
Thursday 3/12
finish the slides from Chapter 8. I recommend the TREC overview paper (pdf)
Tuesday 3/17 and Thursday 3/19
Spring Break
Tuesday 3/24
More on Chapter 8, discuss the writing project,
Thursday 3/26
Midterm exam
Tuesday 3/31
More on writing project, and start Chapter 9 slides.
Thursday 4/2
Finish Chapter 9, return exams
Tuesday 4/7
White paper topics due
Thursday 4/9
Begin discussion of text classification. The example involving Naive-Bayes and email
Tuesday 4/14
Class canceled due to power problems at UMBC
Thursday 4/16
Discuss assignment 5, and talk about clustering. The clustering tutorial. Start some slides from Chapter 17.
Tuesday 4/21
More about text classification and clustering. Use some slides from Chapter 10.
Thursday 4/23
Converse with Mike Wiacek at Google. He mentions the use of SQL in FriendFeed, and this paper.
Tuesday 4/28
Briefly discuss vector classification from Chapter 11 Maybe more slides from Chapter 17 . Discuss Damashek's paper.
Thursday 4/30
Discuss McNamee and Mayfield on n-grams for cross-language IR. Their paper, and with my notes. Start on latent semantic analysis. The slides from Chapter 18. The JASIS paper Deerwester et al . My example. As a source for information on authorship, plagarism, and duplicate detection, see SEPLN´09 Workshop PAN. Uncovering Plagiarism, Authorship and Social Software Misuse.
Tuesday 5/5
More on LSA, and cross-language IR. The Dumais paper on LSA and CLIR. I mentioned the use of IR and visualziation techniques in information assurance, and recommend the botnet paper from Kemmer et. al. at UCSB.
Thursday 5/7
Crawling and link analysis. Slides from Chapter 13, Chapter 14, and Chapter 15.
Tuesday 5/12
You may find this video called Life at the Googleplex interesting. Last day of class, white papers due.