Course Homepage

Recent Announcements

04/11 - Homework 4 is posted on the homework page
07/10 - The project 1-pager is due 10/15 in class.
23/09 - Homework 2 is posted on the homework tab
10/09 - Homework 1 is posted on the homework tab
27/08 - Join the class Slack workspace ASAP. The link expires soon.

Course Description

Data science is a field that involves data manipulation, analysis, and presentation, all at scale. It's typical for an organization to have a few terabytes of data maintained for different purposes by different business units stored in different formats, and for someone to have an idea about how the data might bring significant additional value. Data scientists are the bridge between the idea and the data and help extract latent value, often uncovering novel insights and novel beneficial ways to use the data in the process.

The goal of this class is to give students hands on experience with all phases of the data science process using real data and modern tools. Topics that will be covered include data formats, loading, and cleaning; data storage in relational and non-relational stores; data analysis using supervised and unsupervised learning, and sound evaluation methods; data visualization; and scaling up with cloud computing, MapReduce, Hadoop, and Spark.

Tools

The core concepts of data science are programming language indepdendent, but Python has a powerful set of open source tools for doing data science at scale that we will leverage, as do many organizations, both large and small. Specically, we'll use Anaconda, which bundles "over 100 of the most popular Python, R and Scala packages for data science" and provides easy access to hundreds more through the conda package manager.

The elements of Anaconda that are most relevant to the tripartite structure of this course are (1) pandas (the Python Data Analysis Library), which provides ways to load data into a dataframe for easy manipulation and analysis, (2) scikit-learn, which is a set of "simple and efficient tools for data mining and data analysis", and (3) matplotlib, which is "a python 2D plotting library [that] produces publication quality figures in a variety of hardcopy formats and interactive environments".

Two other tools that will figure prominently are Spark, an extremely powerful framework for data manipulation in cluster computing environments, and Jupyter Notebook, a "web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text". We'll use the former to explore the power of cloud computing with Amazon's EC2, and the latter to interactively explore data and present results.

Please note that you will get your hands dirty in this class! You will be required to install software, read and use online documentation, solve problems by googling for answers, read posts on Stack Overflow, and so on. Data science is a broad and rapidly changing field, so one of the most valuable skills you can cultivate is the ability to dive in and solve problems, either your own or the client's. You will by no means be on your own, with support from me, the TA, and your classmates. But the first thing I will ask when you come to me with a question is "what have you already tried?", and the list of things you've tried must have length ≥ k where k is at least 2.

Grading

Grades in the class will be based on the following:

6 homeworks, 8 points each
Project, 22 points
Midterm exam, 15 points
Final exam, 15 points

Grading is on a standard 10 point scale, so you will get an A for 90.0 or more total points, a B for 80.0 or more but less than 90.0 points, and so on.

Late policy: Homeworks are due at the start of class on the day they are due. A penalty of 10% will be imposed on anything turned in after the start of class but turned in within 24 hours. A penalty of an additional 10% will be imposed every 24 hours after that. So the penalty is 10% for one day late, 20% for two, 30% for three, and so on.

The homeworks will be a blend of practical exercises and questions that cement conceptual knowledge. In the project you must choose a dataset or datasets and explore them, describe them qualitatively and quantitatively, solve some problem of value, and explain the results in ways that are precise and clear. Imagine that you're trying to get a job as a data scientist and this project will be part of your interview portfolio.

For the project you cannot use Kaggle or other competition datasets for their originally intended purpose. My experience has been that when students use, for example, a Kaggle recommender system dataset to build a recommender system, they don't have to struggle much with the data. You can mash up Kaggle datasets and do something new, but you cannot just do what everyone else is doing in a Kaggle competition.

Academic Integrity

By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC’s scholarly community in which everyone’s academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult UMBC policies, or the Faculty Handbook (Section 14.3). For graduate courses, see the Graduate School website.

Schedule

Week	Topics	Notes
01	Course overview, introduction to data science, setting up your environment (Anaconda and Jupyter Notebook)	Class Thursday only (start of semester)
02	Introduction to Pandas and dataframes, CSV, json, and minimal visualization capabilities
03	More visualization. Data loading, cleaning, summarization, and outlier detection	Reading, Homework 1 assigned
04	SQL, NoSQL, key/value stores, connecting to a database from Python	NoSQL reading SQL slides
05	Building models, trees for classification, scikit-learn	Decision tree reading Homework 2 assigned
06	Trees for regression, linear regression
07	Project 1-pager due Logistic regression, support vector machines	Logistic regression slides SVM readings Logistic regression reading
08	Mid-term exam on Thursday, October 17th (covers weeks 1 - 7) Evaluation, cross-validation, overfitting, practical concerns	Slides on statistical tests Homework 3 assigned
09	Clustering, dimensionality reduction, practical concerns	Clustering reading Dimensionality reduction slides and PCA reading
10	Data visualization	Slides Homework 4 assigned
11	Cloud computing, scaling up, Amazon EC2
12	MapReduce and Hadoop	Map-Reduce reading Homework 5 assigned
13	Spark (the MapReduce killer)	Slides
14	Spark - part 2	Class Tuesday only (Thanksgiving)
15	Topics that spilled over from prior weeks (e.g., EC2, a little Spark)	Bias/variance slides Homework 6 assigned
Dec 3	Final exam review
Dec 5	Project presentations
Dec 10	Project presentations
Dec 12	Final exam 10:30am - 12:30pm
Dec 18	Project writeup due by 11:59pm via slack to me (Oates)

CMSC 691 — Introduction to Data Science — Fall 2019