For lots of up-to-date information, see the course blog http://cmsc476sp09.blogspot.com/
You'll need permission to post to the blog, but anybody can read it. I've invited all students in the class as of Thursday 1/29/09 to join the blog as authors, using their UMBC computer accounts.
| Charles Nicholas | nicholas@umbc.edu |
ITE 325G |
Office Hours: Tuesdays and Thursday 4:30-5:30. |
| 410-455-2594 | Contact Ms. Gethmann at gethmann@umbc.edu, or x. 52713, to make an appointment at other times. |
| TA: Mr. Don Dimitroff |
Don's URL http://www.csee.umbc.edu/~dondim1/ |
| ITE 351 | Don's office hours are 2:30-5:30 on Tuesdays and Thursdays. |
This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.
This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project, details to follow.
You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts.
The text will be Manning et al, available at the UMBC bookstore (at least it's been ordered) as well as Amazon. Details about which chapters will be covered, and when, will follow. The text is available on line:
http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
The slides to be used in class will be based on those provided by the authors of the textbook. You can see those original slides at http://www-csli.stanford.edu/~hinrich/newslides.html
In class I will use a local version of these slides, which I will modify as needed! It'd be a good idea to study the slides BEFORE each class. Other papers and resources are available
There will be a multi-phase programming project, details to be announced, worth 50% of the grade. The programming assignments can be found in this term project directory. The corpus of text documents is available as this compressed tarfile, and this directory.
Exams and quizzes will be another 30%. There will also be a writing project, worth 20%. Presentations on the programming and/or writing projects will take the place of the final exam. No final exam is planned.
"By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook, the Faculty Handbook, or the UMBC Policies section of the UMBC Directory [or for graduate courses, the Graduate School website]."
We will follow the text fairly closely. Each of the topics listed below will require a lecture or two, based on how much time I want to devote to that topic, and the availability of additional readings. This course was last taught two years ago. Here's the web site from that semester. Topics to be covered include most if not all of the following: