CMSC 476/676
Information Retrieval
Spring 2014
Charles Nicholas |
nicholas@umbc.edu |
ITE 356 |
Office Hours: Monday and Wednesday 1:30-2:30, or by appointment. |
410-455-2594 |
|
TA: Prachi Bora
|
|
TA office: ITE 340 |
TA office hours: Tuesday and Thursday noon-1pm |
The class meets in Sherman 151
This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Did you ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.
This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project, details to follow.
You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts.
Text and Handouts
The text will be Modern Information Retrieval, second edition, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto. The book should available at the UMBC bookstore (at least it's been ordered) as well as Amazon. Details about which chapters will be covered, and when, will follow. The slides to be used in class will be based on those provided by the authors of the textbook. You can see those original slides at http://www.mir2ed.org. It'd be a good idea to study the slides BEFORE each class. Other papers and resources are available. Suggestions to add to this list are welcome.
Grading
There will be a multi-phase programming project, details to be announced, worth 50% of the grade. There will be a mid-term exam, worth 25% of the grade. There will also be a writing project, worth 25%.
Graduate students (i.e. stduents enrolled in CMSC 676) will be expected to write a paper of the depth that might lead to a Master's Writing Project or Thesis. Graduate stduents will also be expected to present their writing projects at the end of the semester, and undergraduates are welcome to do so. These presentations will take the place of the final exam, and no final exam as such is planned.
Academic Integrity
"By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook, the Faculty Handbook, or the UMBC Policies section of the UMBC Directory [or for graduate courses, the Graduate School website]."
What Happens Day by Day in the Spring 2014 Semester
As you can see, we will follow the textbook closely. I reserve the right to make minor changes along the way, but the basic structure will be as follows.
Some chapters are long enough or important enough to warrant coverage over two lectures.
- Tuesday 1/28
- Chapter 1 Introduction
- Thursday 1/30
- discuss writing project: a topic that interests you, which you can describe in a few sentences, and 3-4 sources of information
students in 676 are expected to write a paper of the depth that could be expanded into a M.S. Writing Project or Thesis
students in 676 are also expected to present their work to the class at the end of the semester
the writing project takes the place of the final exam
Chapter 2 User Interfaces for Search
we covered the first 1/3 of these slides
Appendix A Search Engine Comparison
divided the class into small groups to look at Indri, MG4J, Terrier, and Zettair.
- Tuesday 2/4
- discuss search engine experience
not much luck! so try again, with a different engine this time
finish Chapter 2 slides
- Thursday 2/6
- begin the slides from Chapter 3 Modeling
I was able to build zettair using VirtualBox and Ubuntu 11.04. Note that you'll need to install zlib before zettair will compile to completion.
I gave a short demo of Zettair's indexing and retrieval operations
discuss first phase of programming project
- Tuesday 2/11
- continue the slides from Chapter 3 Modeling
Hints for term project: 1. you are interested 2. describe in a few sentences 3. some references, including textbook mir2e, and seminal papers
Please send your term project idea to me in an email to nicholas@umbc.edu by Thursday of next week, which is February 20.
finding information: Google, Google Scholar or its Bing-workalike,
find seminal papers, don't forget to look at patents
You may find it helpful to look at this spreadsheet, which demonstrates some tf.idf concepts (xls)
- Thursday 2/13
- This is a SNOW DAY, so we'll push these topics into next week.
- Finish Chapter 3. We'll skip several topics.
- The coverage of Latent Semantic Analysis is a little thin. So add Ian's LSI slides (pdf). The seminal paper on LSI is Deerwester et al . My example.
- Tuesday 2/18
- Release Phase 2 of Project
- Thursday 2/20
- Continue Chapter 3
- Tuesday 2/25
- Continue Chapter 3
- Thursday 2/27
- Begin Chapter 4 Retrieval Evaluation
- Tuesday 3/4
- Phase 2 of project is due, but an extension until noon on Wednesday 3/5 is made.
Finish Chapter 4
- Thursday 3/6
- Chapter 5 Relevance Feedback and Query Expansion
Release phase 3 of project
- Tuesday 3/11
- Chapter 6 Documents: Languages and Properties
- Thursday 3/13
- UMBC is closed due to water main problems!
The mid-term exam I gave in 2009 (pdf) This is an example of the kind of questions to expect. We have not covered exactly the same topics this semester.
The exam on April 1 will be open book and open notes
As you study, be advised that I'll be asking question drawn form only part of the textbook, as follows: Chapter 1, sections 1.1-1.4; Chapter 2, sections 2.1-2.3; Chapter 3, sections 3.1-3.2, 3.4.2, and 3.5.1; Chapter 4, sections 4.1-4.3.2, and 4.4; Chapter 5, sections 5.1-5.3; Chapter 6, sections 6.1-6.4.
Spring Break 3/16-3/23
Tuesday 3/25
- More Chapter 6, and discuss the upcoming exam, now scheduled for April 1
Project 4 is now available
- Levenshtein distance worksheet (pdf)
- Thursday 3/27
- Project 3 is due
More Chapter 6
- Tuesday 4/1
- Midterm exam, open book and open notes
- Thursday 4/3
- Discuss Project 4
Brief discussion of clustering
- Tuesday 4/8
- Term paper update: typical structure, formatting guidelines
Format for student presentations: You can use your own, but I can suggest:
- Brief introduction, and how you got interested
- Survey of related work, "concept by concept" better than "paper by paper"
- Discussion of what still needs to be done in this area.
- IF you use slides, get the PDF file to me in advance.
- Presentations will be limited to ten minutes.
Chapter 7 Queries: Languages and Properties
- Thursday 4/10
- return exams
- Tuesday 4/115
- Chapter 8 Text Classification
- Thursday 4/17
- Project 5 is available
The handout illustrating Naive-Bayes classification and email
Chapter 9 Indexing and Searching
- Exercise: Implement the Naive-Bayes computation from the handout, using the spreadsheet package of your choice. See if my arithmetic is correct! Feel free to make up your own test documents and classify them as "ham" or "spam". Due Tuesday, April 29.
- Tuesday 4/22
- Chapter 10 Parallel and Distributed IR
- Thursday 4/24
- Chapter 11 Web Retrieval
Chapter 12 Web Crawling
You may find this video called Life at the Googleplex interesting.
- Tuesday 4/29
- A special topic: authorship attribution. The talk Who Wrote This Document?
- Thursday 5/1
- To give people extra time to work on their term papers and presentations, NO CLASS TODAY
- Topics for remaining lectures may include:
- Tuesday 5/6
- Student Presentations: Abishek Sethi (pdf), Mihir Kelkar (pdf), Jacob Rettig (pdf), Primal Pappachan (pdf)
- Thursday 5/8
- Student Presentations: Jihad Ashkar (pdf), David Harris (pdf), Kaavya Srinivasan (pdf), Shrinivas Kane (pdf)
- Tuesday 5/13
- CLASS WILL START AT 1PM TODAY
Student Presentations: Hang Gao (pdf), Babur Khan (pdf), Ryan Murphy (pdf), John Seymour (pdf), Zhiguang Wang (pdf)
- Tuesday 5/20
- Term Papers are due. I prefer PDF submitted by email. Don't send .doc files.
-
- No final exam is planned.