Homework


Homework 6 In this homework you'll explore PCA on a a real dataset.

Here's what you need to do:

Submit everything in a single Jupyter notebook by slack to the TA.
Homework 5 In this homework you'll write a couple of simple Spark programs to compute statistics on a graph. The first thing to do is install pySpark and get it working in a Jupyter notebook.

Here's what you need to do:

Write all 3 functions in a single notebook and submit it via slack to the TA. A goal in all cases is to load the data in and keep it within Spark for as long as possible. That is, resist the temptation to, for example, do a .collect() and do the rest of the work in pure python. My code for the all three functions stayed in Spark RDDs the entire time.
Homework 4 Do the following problems from chapter 10 of the Data Mining book. You can find a link to a PDF of that chapter in the syllabus. Write up your answers and submit them to the TA via slack.
Homework 3 In this homework you'll use scikit to solve some classification problems.

Load the breast cancer dataset using https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html.

Here is what you need to do:

Submit everything in a single Jupyter notebook.
Homework 2

In this homework you'll gain experience installing and using SQL databases. Here are your tasks.

Install MySQL: Click here for directions.

Create a MySQL user for yourself:

Install the Retailer sample database: Go here for instructions on getting the sample database. The result is a .zip file that, when extracted, gives you a .sql file named mysqlsampledatabase.sql, which is nothing more than a series of SQL commands/queries. To run it, do this: mysql -u username < mysqlsampledatabase.sql. As usual, you'll have to enter your password

To check that everything worked, do SHOW DATABASES in mysql and if you see one named classicmodels, then all is well. The web page that contains the link for the sample database has information on the tables and their fields.

Write each of the following queries: For each query, turn in the query and the result of running it on the Retail database that you just created.

Install the python connector for MySQL: In this part of the homework you'll get experience running queries from python code and write a simple program to extract the structure of a MySQL database.

Your task is to write a python program that takes a single command line argument, which is the name of a database, and prints the names of all of the tables in that database along with the number of rows in each table. Read through the documentation on the python connector here to see how to create a connection, issue a query, and walk over the results. For this exercise you'll submit your python code, which should all be in one file, along with the output of running your program on the sample database you installed earlier. Hint: The SHOW TABLES query will be useful here.

Put all elements of the homework into a single file and submit it via Slack to the TA by the due date/time.


Homework 1

In this homework you'll gain experience with Open Baltimore data, Jupyter notebooks, and pandas. Jupyter auto-saves notebooks with some regularity, but I also tend to "Save and Checkpoint" periodically on the File menu because you can always revert to a checkpoint.

You will submit your homework as a notebook by uploading it as a file into the Slack channel for the TA (Abbasi Koohpayegani) by 11:59pm the day the assignment is due. To do that, click on the + icon beside "Direct Messages" and start typing his name. As some point you'll see his name in a list of users below where you are typing. Click on his name. Once you're in a chat with Abbasi, click the big + next to the space where you enter a message, click on "Upload File" and then choose the notebook you want to submit. The system will then allow you to add a message, which you should make "Homework 1 submission for NAME".

To add comments in your notebook, which you're asked to do to explain your thinking in a few places, you'll use markdown syntax in the cell. Look at the Basics tab on the main markdown page and it will tell you everything you need to know. Type your comments in a notebook cell and then either do "Cell" - "Cell Type" - "Markdown", or type CTRL-M M in the cell.

Choose any dataset from the Open Baltimore collection except for variations of the Victim Based Crime Data that I explored in my DataExploration notebook. Choose a dataset that allows you to perform the following tasks: