Prerequisites: CMSC 341 or equivalent. Java and Linux experience is a plus.

Course Objectives

The goal of this course is to provide students practical hands-on experience developing distributed data-centric applications using projects within the Hadoop ecosystem. Topics include a primer on distributed computing, instruction on architecture and usage of the Hadoop Core, in-depth analysis of various MapReduce Design Patterns, and an introduction to over a dozen Hadoop Ecosystem projects. Practical usage of these ecosystem projects will be executed in completing homework assignments that work with each other, the result being a complete end-to-end data solution using Hadoop projects. Students are free to use whatever programming language they are comfortable with that is supported by the project or available on Github, though knowledge of Java will be very useful.

This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates and myself. Rather than emailing questions to me, I encourage you to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.

Find our class page at: https://piazza.com/umbc/spring2016/cmsc49105/home

Topics

  • Distributed Computing Primer
  • Hadoop Core
  • Data Formats
  • Distributed Message Queues
  • High-Level MapReduce APIs
  • MapReduce Design Patterns
  • Scalable Machine Learning
  • Bloom Filters
  • Key/Value Stores
  • Distributed Application Coordination
  • Real-Time Stream Processing
  • Workflow Management
  • SQL on Hadoop

Grading

Homework 50%
Midterm Exam 25%
Final Exam 25%

Homework - link

Students will have a handful of homework assignments that build on one another to create a full end-to-end data pipeline using Hadoop. Activities include:

Readings - link

Students will have a handful of readings to complete to strengthen their understanding of the corresponding topic. Students are expected to complete the reading and respond with one comment or question on the associated Piazza post. The cumulative readings counts as one single homework assignment (100 pts). Failure to post on a reading will deduct 100/x points from the overall grade, where x is the total number of readings.

Late Policy

Assignments are to be submitted electronically by 11:59 PM on the due date. Assignments submitted up to two days late will be penalized 15 percent of the possible score. Assignments more than two days late will receive a score of 0. Each student gets one free "late" (i.e. up to two days late without penalty, but still zero if later than two days) to apply to any of the assignments. Your free late must be claimed in writing via email.

UMBC Academic Integrity Policy

By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook, the Faculty Handbook, the UMBC Integrity web page www.umbc.edu/integrity, or the Graduate School website www.umbc.edu/gradschool.