CMSC 476/676

Information Retrieval

Spring 2015

Charles Nicholas nicholas@umbc.edu

ITE 356
Office Hours: Monday and Wednesday 2-3:30 or by appointment.
410-455-2594  
TA:
Chi Zhang (chzhang1@umbc.edu)
TA office: ITE 353 TA office hours: Tuesdays and Thursdays 2:30-3:30

The class meets in ITE 231

This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Did you ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.

This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project, details to follow.

You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts.

Text and Handouts

The text is Search Engines: Information Retrieval in Practice, by Croft, Metzler and Strohman. If the book is available in electronic format, that should be fine. The slides provided by the authors are available. The book should available at the UMBC bookstore (at least it's been ordered) as well as Amazon. Details about which chapters will be covered, and when, will follow. The slides to be used in class will be based on those provided by the authors of the textbook, but I may modify them from time to time. It'd be a good idea to study the slides BEFORE each class. Other papers and resources are available. Suggestions to add to this list are welcome.

The text from last year, Modern Information Retrieval, second edition, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto., may be useful as a reference. You can see the slides for that book at http://www.mir2ed.org. The web page for last year's course is also available.

Grading

There will be a multi-phase programming project, details to be announced, worth 50% of the grade. There will be a mid-term exam, worth 25% of the grade. There will also be a writing project, worth 25%.

Graduate students (more precisely, students enrolled in CMSC 676) will be expected to write a paper of the depth that might lead to a Master's Writing Project or Thesis. Graduate stduents will also be expected to present their writing projects at the end of the semester, and undergraduates are welcome to do so. These presentations will take the place of the final exam, and no final exam as such is planned.

Academic Integrity

"By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook, the Faculty Handbook, or the UMBC Policies section of the UMBC Directory [or for graduate courses, the Graduate School website]."

 

What Happens in the Spring 2015 Semester

We will follow the textbook closely. I reserve the right to make minor changes along the way, but the basic structure will be as follows. Some chapters are long enough or important enough to warrant coverage over two lectures.

We will cover the chapters in the text in order, at the rate of approximately one chapter per week. The 676 presentations will take place in early May. The following is subject to change as progress of the class warrants.

Week Dates in 2015 Topics/Activities
1 1/27, 1/29 Introduction
Chapter 1 (ppt)
discuss writing project: a topic that interests you, which you can describe in a few sentences, and 3-4 sources of information
students in 676 are expected to write a paper of the depth that could be expanded into a M.S. Writing Project or Thesis
students in 676 are also expected to present their work to the class at the end of the semester
2 2/3, 2/5

Chapter 2 (ppt)
try out a search engine, such as Galago (used as an example in the textbook). or swish, or Terrier, or zettair.
Short demo of the search engine at uspto.gov
discuss first phase of programming project

3 2/10, 2/12 Chapter 3 (ppt)
Hints for term project:
1. pick a topics in which you are interested
2. describe it in a few sentences
3. find 3-5 references, including the course textbook(s) (mir2e covers some additional material), and seminal papers
Please send your term project idea to me in an email to nicholas@umbc.edu by Thursday of next week, 2/19/2015
finding information: Google, Google Scholar or its Bing-workalike, find seminal papers, don't forget to look at patents
4 2/17, 2/19 Chapter 4 (ppt)
Release Phase 2 of Project
5 2/24,2/26

Chapter 5 (ppt)
Slides from Modern Information Retrieval (Baeza-Yates) Chapter 3 (pdf)
You may find it helpful to look at this spreadsheet, which demonstrates some tf.idf concepts (xls)

6 3/3, 3/5 a few topics from Chapter 6 (ppt)
A snow day on Thursday!
7 3/10, 3/12 Phase 2 of project is due
Release phase 3 of project
Levenshtein distance worksheet (pdf)
  3/17. 3/19 Spring Break
8 3/24, 3/26 a few more topics from Chapter 6 (ppt)
Chapter 7 (ppt)
The midterm exam will really be on Thursday of this week
Topics include material from the slides presented in class and textbook Chapters 1-5, PLUS the Levenshtein distance.
The mid-term exam I gave in 2009 (pdf)
The mid-term exam I gave in 2014 (pdf)
The midterm exam will be open book and open notes
9 3/31, 4/2

Finish Chapter 7
Begin Chapter 8 (ppt)
Project 4 is now available,and is due April 16

10 4/7, 4/9
Finish Chapter 8
11 4/14, 4/16

Chapter 9 (ppt)
The handout illustrating Naive-Bayes classification and email
Project 5 is now available.

12 4/21, 4/23

Chapter 10 (ppt) as time permits
Chapter 11 (ppt) as time permits

On Tuesday, a special topic: authorship attribution. The talk Who Wrote This Document

No class on April 23, CCDC since Dr. Nicholas will be in San Antonio as the Cyberdawgs compete at the National CCDC!

Schedule your presentations using this Google Spreadsheet

Format for student presentations: You can use your own, but I can suggest:

  • Title and your name
  • Brief introduction, and how you got interested (1-2 slides)
  • Discussion of basic definitions and concepts (1-2 slides)
  • Survey of related work, "concept by concept" better than "paper by paper" (1-2 slides)
  • Some basic example (2-3 slides, combine with next bullet)
  • Discussion of what still needs to be done in this area.
  • Last slides for references
  • Allow time for Q&A
  • Assuming you use slides, you WILL get the PDF file to me in advance.
  • Presentations will be limited to 10-12 minutes.
  • No more than ten slides

13 4/28, 4/30

Student presentations for Tuesday

Student presentations for Thursday

For each speaker, fill in this presentation feedback form.

14 5/5, 5/7

Project 5 is due on May 5. Projects submitted on or before May 1 are eligible for extra credit.

Student presentations for Tuesday

Student presentations for Thursday

For each speaker, fill in this presentation feedback form.

White papers are now due Monday May 11.

Papers emailed to me on or before May 4 are eligible for extra credit.
I will (for once) accept both PDFs and .doc files.
Approximately ten pages, double-space, judicious use of figures and tables

15 5/12 Student presentations - reserve day
classes ends May 12
    NO FINAL EXAM, the writing project takes the place of the final exam
    The coverage of Latent Semantic Analysis is a little thin. So add Ian's LSI slides (pdf). The seminal paper on LSI is Deerwester et al . My example.
   

Topics for remaining lectures may include:
Cross-Language IR (See McNamee and Mayfield on n-grams for cross-language IR. Their paper, and with my notes.
Duplicate Document Detection. As a source for information on authorship, plagarism, and duplicate detection, see SEPLNĀ“09 Workshop PAN. Uncovering Plagiarism, Authorship and Social Software Misuse.