CMSC 476/676

Information Retrieval

Spring 2014

Charles Nicholas nicholas@umbc.edu

ITE 356
Office Hours: Monday and Wednesday 1:30-2:30, or by appointment.
410-455-2594  
TA: Prachi Bora
 
TA office: ITE 340 TA office hours: Tuesday and Thursday noon-1pm

The class meets in Sherman 151

This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Did you ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.

This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project, details to follow.

You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts.

Text and Handouts

The text will be Modern Information Retrieval, second edition, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto. The book should available at the UMBC bookstore (at least it's been ordered) as well as Amazon. Details about which chapters will be covered, and when, will follow. The slides to be used in class will be based on those provided by the authors of the textbook. You can see those original slides at http://www.mir2ed.org. It'd be a good idea to study the slides BEFORE each class. Other papers and resources are available. Suggestions to add to this list are welcome.

Grading

There will be a multi-phase programming project, details to be announced, worth 50% of the grade. There will be a mid-term exam, worth 25% of the grade. There will also be a writing project, worth 25%.

Graduate students (i.e. stduents enrolled in CMSC 676) will be expected to write a paper of the depth that might lead to a Master's Writing Project or Thesis. Graduate stduents will also be expected to present their writing projects at the end of the semester, and undergraduates are welcome to do so. These presentations will take the place of the final exam, and no final exam as such is planned.

Academic Integrity

"By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook, the Faculty Handbook, or the UMBC Policies section of the UMBC Directory [or for graduate courses, the Graduate School website]."

What Happens Day by Day in the Spring 2014 Semester

As you can see, we will follow the textbook closely. I reserve the right to make minor changes along the way, but the basic structure will be as follows. Some chapters are long enough or important enough to warrant coverage over two lectures.

Tuesday 1/28
Chapter 1 Introduction
Thursday 1/30
discuss writing project: a topic that interests you, which you can describe in a few sentences, and 3-4 sources of information
students in 676 are expected to write a paper of the depth that could be expanded into a M.S. Writing Project or Thesis
students in 676 are also expected to present their work to the class at the end of the semester
the writing project takes the place of the final exam
Chapter 2 User Interfaces for Search
we covered the first 1/3 of these slides
Appendix A Search Engine Comparison
divided the class into small groups to look at Indri, MG4J, Terrier, and Zettair.
Tuesday 2/4
discuss search engine experience
not much luck! so try again, with a different engine this time
finish Chapter 2 slides
Thursday 2/6
begin the slides from Chapter 3 Modeling
I was able to build zettair using VirtualBox and Ubuntu 11.04. Note that you'll need to install zlib before zettair will compile to completion.
I gave a short demo of Zettair's indexing and retrieval operations
discuss first phase of programming project
Tuesday 2/11
continue the slides from Chapter 3 Modeling
Hints for term project: 1. you are interested 2. describe in a few sentences 3. some references, including textbook mir2e, and seminal papers
Please send your term project idea to me in an email to nicholas@umbc.edu by Thursday of next week, which is February 20.
finding information: Google, Google Scholar or its Bing-workalike, find seminal papers, don't forget to look at patents
You may find it helpful to look at this spreadsheet, which demonstrates some tf.idf concepts (xls)
Thursday 2/13
This is a SNOW DAY, so we'll push these topics into next week.
Finish Chapter 3. We'll skip several topics.
The coverage of Latent Semantic Analysis is a little thin. So add Ian's LSI slides (pdf). The seminal paper on LSI is Deerwester et al . My example.
Tuesday 2/18
Release Phase 2 of Project
Thursday 2/20
Continue Chapter 3
Tuesday 2/25
Continue Chapter 3
Thursday 2/27
Begin Chapter 4 Retrieval Evaluation
Tuesday 3/4
Phase 2 of project is due, but an extension until noon on Wednesday 3/5 is made.
Finish Chapter 4
Thursday 3/6
Chapter 5 Relevance Feedback and Query Expansion
Release phase 3 of project
Tuesday 3/11
Chapter 6 Documents: Languages and Properties
Thursday 3/13
UMBC is closed due to water main problems!
The mid-term exam I gave in 2009 (pdf) This is an example of the kind of questions to expect. We have not covered exactly the same topics this semester.
The exam on April 1 will be open book and open notes
As you study, be advised that I'll be asking question drawn form only part of the textbook, as follows: Chapter 1, sections 1.1-1.4; Chapter 2, sections 2.1-2.3; Chapter 3, sections 3.1-3.2, 3.4.2, and 3.5.1; Chapter 4, sections 4.1-4.3.2, and 4.4; Chapter 5, sections 5.1-5.3; Chapter 6, sections 6.1-6.4.

Spring Break 3/16-3/23

Tuesday 3/25
More Chapter 6, and discuss the upcoming exam, now scheduled for April 1
Project 4 is now available
Levenshtein distance worksheet (pdf)
Thursday 3/27
Project 3 is due
More Chapter 6
Tuesday 4/1
Midterm exam, open book and open notes
Thursday 4/3
Discuss Project 4
Brief discussion of clustering
Tuesday 4/8
Term paper update: typical structure, formatting guidelines
Format for student presentations: You can use your own, but I can suggest:
Chapter 7 Queries: Languages and Properties
Thursday 4/10
return exams
Tuesday 4/115
Chapter 8 Text Classification
Thursday 4/17
Project 5 is available
The handout illustrating Naive-Bayes classification and email
Chapter 9 Indexing and Searching
Exercise: Implement the Naive-Bayes computation from the handout, using the spreadsheet package of your choice. See if my arithmetic is correct! Feel free to make up your own test documents and classify them as "ham" or "spam". Due Tuesday, April 29.
Tuesday 4/22
Chapter 10 Parallel and Distributed IR
Thursday 4/24
Chapter 11 Web Retrieval
Chapter 12 Web Crawling
You may find this video called Life at the Googleplex interesting.
Tuesday 4/29
A special topic: authorship attribution. The talk Who Wrote This Document?
Thursday 5/1
To give people extra time to work on their term papers and presentations, NO CLASS TODAY
Topics for remaining lectures may include:
Tuesday 5/6
Student Presentations: Abishek Sethi (pdf), Mihir Kelkar (pdf), Jacob Rettig (pdf), Primal Pappachan (pdf)
Thursday 5/8
Student Presentations: Jihad Ashkar (pdf), David Harris (pdf), Kaavya Srinivasan (pdf), Shrinivas Kane (pdf)
Tuesday 5/13
CLASS WILL START AT 1PM TODAY
Student Presentations: Hang Gao (pdf), Babur Khan (pdf), Ryan Murphy (pdf), John Seymour (pdf), Zhiguang Wang (pdf)
Tuesday 5/20
Term Papers are due. I prefer PDF submitted by email. Don't send .doc files.
 
No final exam is planned.