Homework 7 In this homework you'll write a couple of simple Spark programs to compute statistics on a graph. The first thing to do is install pySpark and get it working in a Jupyter notebook.

Here's what you need to do:

Write all 3 functions in a single notebook and submit it via slack to the TA. A goal in all cases is to load the data in and keep it within Spark for as long as possible. That is, resist the temptation to, for example, do a .collect() and do the rest of the work in pure python. My code for the all three functions stayed in Spark RDDs the entire time.
Homework 6 In this homework you will get experience writing and running MapReduce jobs on Hadoop. Here are your tasks: Note that you'll implement one round of k-means in MapReduce and will thus need to wrap it in a little logic that iterates until cluster centroids don't change on two successive iterations.

Turn in your code to the grader in a Slack message as a tar file. The main routine must be contained in a file named KMeans.java and all files must extract to the current working directory when you untar them.


Homework 5 Do the following problems from chapter 10 of the Data Mining book. You can find a link to a PDF of that chapter in the syllabus. Write up your answers and submit them to the TA via slack.
Homework 4 In this homework you will implement logistic regression, and then add a term to the objective function to prevent overfitting and evaluate it's impact.

The MNIST dataset is a well-studied collection of handwritten digits. It is often used to test multi-class classification algorithms, where there is one class for each of the 10 digits (0 - 9).

I've made two files available for you:

Implement 2-class logistic regression using gradient descent as outlined in the lecture notes. You can do either batch or stochastic versions of the algorithm. You will only use your algorithm for this dataset, so you can hard-wire in the number of instances and the size of each instance. The goal is not to write a generic version of the algorithm (though you can if you wish). The goal is to understand how it works on real data.

Use your algorithm to learn a classifier that determines whether an input image is an 8 or one of the other digits and record the classification accuracy on the training set (the full dataset I provided). Note that you'll have to come up with some stopping criterion, which could be to simply run for a fixed number of iterations and then quit.

After training is complete, create a 28x28 image of the learned weights. The largest weight (most positive) should map to black, the smallest weight (most negative) to white, and the other weights should linearly interpolate between those extremes.

One form of regularization (a way to prevent overfitting) is to encourage the weight vector to be sparse. That is commonly done in logistic regression by adding a term to the objective function that is the sum of the squares of the individual weights. We'd like to minimize that while maximizing the likelihood of the data.

Modify your code from above to include this term. Note that you'll have to consider the derivative of this term when computing gradients, and the formulation in class is maximizing likelihood whereas we want to minimize the sume of squared weights. Introduce a parameter, lambda, that is a constant used to multiply the sum of squares of weights. If lambda = 0, then you are running logistic regression without regularization. If you increase lambda, then you are putting more emphasis on small weights.

Generate a plot of training set accuracy as a function of lambda for several values of lambda. Briefly explain what is happening.

Submit the following in a Jupyter notebook:


Homework 3

In this homework you'll gain experience working with MongoDB and solidify your understanding of decision trees. Here are your tasks. Install MongoDB: Click here for directions.

Get the books data and insert it into a database:

In a jupyter notebook, use the mongo python connector to write the following queries and show their output: Note that you must do "from pprint import pprint" in your notebook and only print results of queries using pprint. Decision trees: Consider the following dataset with four binary attributes and one binary class label.

Use equation 3.4 from the Mitchell chapter on decision trees to compute information gain for each attribute to choose the root split for the tree. Give the computed information gain for each attribute and indicate which attribute should be used at the root of the tree.

Draw the full, unpruned tree that would be learned from this dataset. There is no need to do the full information gain computation for the splits below the root. Just "eyeball" the data and the correct splits should be obvious.

What to turn in: Turn in the notebook with the mongo queries with their output, and a written answer to the question on decision trees. That needs to be in a file that can be uploaded to slack. Turn both in to the TA via a slack direct message by the due date.


Homework 2

In this homework you'll gain experience installing and using SQL databases. Here are your tasks.

Install MySQL: Click here for directions.

Create a MySQL user for yourself:

Install the Retailer sample database: Go here for instructions on getting the sample database. The result is a .zip file that, when extracted, gives you a .sql file named mysqlsampledatabase.sql, which is nothing more than a series of SQL commands/queries. To run it, do this: mysql -u username < mysqlsampledatabase.sql. As usual, you'll have to enter your password

To check that everything worked, do SHOW DATABASES in mysql and if you see one named classicmodels, then all is well. The web page that contains the link for the sample database has information on the tables and their fields.

Write each of the following queries: For each query, turn in the query and the result of running it on the Retail database that you just created.

Install the python connector for MySQL: In this part of the homework you'll get experience running queries from python code and write a simple program to extract the structure of a MySQL database.

Your task is to write a python program that takes a single command line argument, which is the name of a database, and prints the names of all of the tables in that database along with the number of rows in each table. Read through the documentation on the python connector here to see how to create a connection, issue a query, and walk over the results. For this exercise you'll submit your python code, which should all be in one file, along with the output of running your program on the sample database you installed earlier. Hint: The SHOW TABLES query will be useful here.

Turn in two files, one with the queries and their output, and another with the Python program that prints out information on tables. Submit both files to the TA.


Homework


Homework 1

In this homework you'll gain experience exploring data with Jupyter notebooks and pandas. Jupyter auto-saves notebooks with some regularity, but I also tend to "Save and Checkpoint" periodically on the File menu because you can always revert to a checkpoint.

You will submit your homework as a notebook by uploading it as a file into the Slack channel for the grader (Shivani Birmal) by 11:30am the day the assignment is due. To do that, click on the + icon beside "Direct Messages" and start typing her name. As some point you'll see her name in a list of users below where you are typing. Click on her name. Once you're in a chat with Shivani, click the paperclip icon next to the space where you enter a message and choose the notebook you want to submit. The system will then allow you to add a message, which you should make "Homework 1 submission for NAME".

To add comments in your notebook, which you're asked to do to explain your thinking in a few places, you'll use markdown syntax in the cell. Look at the Basics tab on the main markdown page and it will tell you everything you need to know. Type your comments in a notebook cell and then either do "Cell" - "Cell Type" - "Markdown", or type CTRL-M M in the cell.

Choose any dataset you want. The dataset that I explored in class was from the Open Baltimore website. You cannot use that dataset. You can also use Google's dataset search to find a dataset. Choose one that allows you to perform the following tasks (you may have to look at a few of them):