CMSC 676

Spring 2007


1/29 class meets in AC IV 015
This course was taught some years ago by Dr. Ian Soboroff. Here's the web site from that semester.
2/5 The Reuters 21578 corpus is available as a compressed file reuters.tar.gz or uncompreessed in a directory
Slides from Chapter 1

More on vector space model
Some interesting papers: Salton '75, Singhal '96, Singhal '97, Harman '95
A PDF file with Salton and Bucley's 1988 IPM article. This postscript file is an earlier version of that paper, and some postscript viewers have problems with it.
Version 2.0.0 of lucene has been installed. A gzipped tarfile is available.

2/12 Lucene demo, Slides from Chapter 2 on the vector space model .
Homework 1 assigned, due in a week
2/14 class cancelled due to snow
2/19 More on Lucene, Homework 1 was due.
2/21 Talked about writing project topics.
2/26 More from Chapter 2 on the probabilistic model of retrieval.
2/28 A few pages taken from Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto. My handwritten notes on this excerpt. Homework 2 is assigned.
3/5 For an introduction to inference networks in IR, we'll look at Turtle's thesis. For an introduction to language models in IR, look at Ponte and Croft from SIGIR'98. Writing project proposals are due.
3/7 More about language models. My handwritten notes have been corrected.
3/12 Start covering latent semantic analysis. The seminal paper is Deerwester et al, from 1990 . I have some powerpoint slides on LSI from Ian, and some notes.
3/14 More on LSI, and an introduction to n-gram analysis. A seminal paper on n-grams is Damashek.
3/26 Slides on relevance feedback The Salton and Buckley paper from 1990 is still cited, amnd may be the most lucid explanation. The gigablast search engine uses a (primitive?) RF approach.
3/28 More about n-grams. We have a paper by McNamee and Mayfield, as published and with my written notes.
4/2 Use the clustering tutorial found at
4/4 More on clustering. For an excellent description of k-means, see Andrew Moore's tutorials For a somewhat less accessible treatment of k-means and its variants, see Kogan, Teboulle and Nicholas
4/9 Slides on Passage Based Retrieval (ppt). See text pages 113-115, and a recent paper by Liu and Croft
4/11 Thesaurus Processing. Slides from Grossman and Frieder. Thesaurus processsing is a form of query expansion, and an article by Susann Gauch is well-cited. A recent article by Abdelali, Cowie and Soliman on use of semantic expansion appeared in the May 2007 issue of IP&M

Cross-language IR. Oard's tech report is an excellent overview, now somewhat dated. Dumais et. al discuss the use of LSI in CLIR. Another version of Dumais with my notes. An earlier version of their paper has color illustrations.
4/18 Writing Projects are due. Discuss programming project.
4/23 Web search. Draft of Chapter 19, Web Search Basics, from Manning, Raghavan and Schutze and their slides, parts 1 and 2
4/25 Web crawling. Draft of Chapter 20, We crawling and indexes, from Manning, Raghavan and Schutze and their slides
4/30 Distributed IR. The survey paper by Jamie Callan. Google's cluster architecture is described in Barroso, where the focus is more on computer architecture and performance than IR per se.
5/2 Four short talks on writing projects. Speaker list and format posted to the blog. Joel Goldfinger (pdf) JC Montminy (pdf) Ron Roff (pdf) Mike Wilson (pdf)
5/7 Four short talks on writing projects. Beenish Bhatia (pdf) Luke Georgalas (pdf) Chris Morris (pdf) Mansi Radke (pdf)
5/9 Four short talks on writing projects. Jason Nappier (pdf) Jinny Nguyen (pdf) Stephen Rook (pdf) Zareen Syed (pdf)
5/14 Five short talks on writing projects. Programming project due. Sandor Dornbush (pdf) Marcin Kaminski (pdf) Justin Martineau (pdf of talk, pdf of poster) Mike Michniewski (pdf) Aparna Subramanian (pdf)
