Introduction

Using the Virtual Machine you created in Homework 1, you will be serializing Twitter's stream to Avro objects, storing the in HDFS via a Kafka topic. Make sure HDFS and Kafka are running.

This homework (and all future homeworks) will require heavy internet searching and reading! Use of these tools is best learnt by trial and error, so hit up the Googles.

You're welcome to use this gitrepo I put together for examples of how to use Avro and Kafka.


Goals

Part 1: Write Avro schema

First, create an Avro schema from the Twitter object model. Avro's documentation can be found at http://avro.apache.org/docs/1.8.0/. You'll need to obtain the avro-tools jar at a minimum. For Java, you can use Maven to capture Avro or download the tools yourself. For Python, you'll have to install it on the VM. Follow the docs there on how to install the Avro Python module.

  1. Read the documentation on Twitter's Streaming API to familiarize yourself on how it differentiates from the REST API. We'll be using the Public API to access the filter stream.
  2. Twitter's filter API provides a stream of Tweet objects. A Tweet (in the API terms) is any status update. Using the object model documentation as a reference, create an Avro schema for Twitter's "Tweet" activity. A full sample tweet can be found here. Your Avro schema must contain the below fields, and doesn't necessarily have to be identical to the Twitter object model. At a minimum, include these fields in your Avro schema. I would expect your full Avro schema to contain more than one record type, e.g. one for "Tweet" one for "User", and wherever else it makes sense. Focus your OOP skills.
  3. Compile your Avro to Java classes, even if you plan to use Python. This will help validate your schema is formatted correctly.

Part 2: Write the Kafka producer

In this part, we'll be creating a Kafka producer to publish tweets from Twitter to a topic, serialized as Avro objects. I'd suggest an incremental approach to this problem. Kafka's documentation is located at http://kafka.apache.org/documentation.html.

The Kafka command line tools are located on your VM under the /usr/hdp/current/kafka-broker/bin folder.

  1. Using Python or Java, create a program which will connect to Twitter and print JSON status updates to the screen. We'll be using the Filter API, so specify what filter terms you would like. I think the limit is 400, so go nuts. Note that the stream of tweets you receive will also have delete status updates, an example of which is here. We'll be ignoring these tweets, so make sure you validate the JSON prior to creating your Avro object. In short, if the JSON object you receive from Twitter has a "delete" entry, it is a delete and not a status update.

    For Java, I recommend using Twitter4J.
    For Python, I recommend tweepy.

  2. Once you can print tweets to the Terminal, create a Kafka topic called "jsontweets" using the command line tool. Create a Kafka producer to take the JSON object and publish them to the Kafka topic. Validate the JSON messages are being written using the kafka-console-consumer.

    For Java, use the kafka-clients Maven artifact below.
    	<dependency>
    	    <groupId>org.apache.kafka</groupId>
    	    <artifactId>kafka-clients</artifactId>
    	    <version>0.9.0.0</version>
    	</dependency>
    For Python, I recommend kafka-python. You'll need to clone the master branch and install kafka-python using setup.py. pip does not have the latest version.

  3. Once you have JSON tweets writing to the Kafka topic, modify your program to convert the JSON object to an Avro object, serialize this object to a byte array, and write it to a new kafka topic called <tweets>. You shouldn't be writing the raw JSON tweets to the jsontweets topic anymore. Again, validate your producer using the console consumer.

Part 3: Write the Kafka consumer

In this part, we'll be creating a Kafka consumer to pull messages from the <tweets> topic and write them to files in HDFS.

  1. Create a Kafka consumer to read messages from the <tweets> topic. Write files to a directory in HDFS called /in/tweets/YYYY/MM/DD/HH/tweets-S.avro, substituting the YYYY MM dd HH S with the current year, month, day, hour, and epoch time at the time the file is created.
  2. Your consumer should roll over the file, i.e. create a new file, after n messages have been written or m seconds have passed, where n and m are values provided at the command line.

Part 4: Submit your Homework

  1. Create a "hw2" folder in your git repository. Copy your Avro schema, consumer code, and producer code to this directory.
  2. Create a README.txt file that contains the following items:
  3. Create a tarball of these files and submit only the tarball. Do not include any data sets, generated code, or output! Only source code. For example:
    	[shadam1@491vm ~]$ cd shadam1-gitrepo
    	[shadam1@491vm ~]$ git pull
    	[shadam1@491vm shadam1-gitrepo]$ mkdir hw2
    	[shadam1@491vm shadam1-gitrepo]$ git add hw2
    	[shadam1@491vm shadam1-gitrepo]$ cd hw2
    	[shadam1@491vm hw2]$ # copy files
    	[shadam1@491vm hw2]$ tar -zcvf hw2.tar.gz *
    	[shadam1@491vm hw2]$ git add hw2.tar.gz
    	[shadam1@491vm hw2]$ git commit -m "Homework 2 submission"
    	[shadam1@491vm hw2]$ git push

What to do if you want to get a 0 on this project:

Any of the following items will cause significant point loss. Any item below with an asterisk (*) will give that submission a 0.