Introduction

In this homework, we'll be writing some Pig analytics to analyze the tweets stored in HDFS.Make sure HDFS, Kafka, ZooKeeper, and YARN are running.

This homework (and all future homeworks) will require heavy internet searching and reading! Use of these tools is best learnt by trial and error, so hit up the Googles.

You're welcome to use this gitrepo I put together for some example Pig scripts.


Goals

Part 0: Fix your Twitter Avro Schema because I am a mean professor

Pig's AvroStorage does not support union of 'record' object types (which is what I told you to do), so you need to flatten your Avro schema into a single record with plain or array types. For most of you, this should just be taking the fields in the "User" record and putting them in the root Tweet record. Those of you who created additional Avro records for hashtags, urls, and user mentions, you should just use an array of strings as the type of these fields. You'll need to make these changes in your Avro schema and producer code before proceeding.

Part 1: Start Kafka Twitter stream

Boot up your VM and start the Kafka/Twitter process. In general, this should always be running in the background during all homework assignments, continuously collecting data. Be sure to set your number-of-Avro-objects and rollover seconds to something larger, say 1,000 messages or five minutes, and use enough search terms in your filter stream to retrieve a significant amount of tweets.Be advised that you may want to clean up files in HDFS at this time to remove any 'bad' files that were created during your development process. It is expected that the data is "production" and it shouldn't give you a bunch of problems.

Part 2: Write Pig analytics!

  1. Read through the Getting Started documentation provided by Pig here. This covers the basics of Pig, from starting a shell, different execution modes, tips for debugging. Note that the 'pig' command is already on your PATH.
  2. From here, begin developing analytics to provide results set for the following use cases. Pig itself provides the best documentation on their basics page and built-in functions page. Note the section on the AvroStorage loader -- you'll need it to read your Avro files.
  3. The analytics you'll be creating are:

Part 3: Schedule your Pig analytics

In this part, we'll be writing a collection of bash scripts to schedule our Pig analytics to run every hour. We'll update our user's crontab on the VM to execute the analytics every hour, on the hour. See this section on some crontab docs for some more info

  1. Create a bash script, one per analytic, using the example found on the hadoop-demos Github to execute your Pig script. You'll want to "install" your scripts in some directory, e.g. /home/<glid>/anaytics/hashtags, to give a central place to run your analytic and capture the log file.
  2. Edit your crontab to execute the bash scripts.
  3. Validate your crontab is working and executing the bash scripts via the log files and viewing the files in HDFS.

Part 4: Submit your Homework

  1. Create a "hw3" folder in your git repository. Copy the contents of your install directory including ONE! log file of each successful execution, a copy of your crontab, and an HDFS listing of the /analytics directory to this directory.
  2. Create a README.txt file that contains the following items:
  3. Create a tarball of these files and submit only the tarball. Do not include any data sets, generated code, or output! Only source code. For example:
    	[shadam1@491vm ~]$ cd shadam1-gitrepo
    	[shadam1@491vm ~]$ git pull
    	[shadam1@491vm shadam1-gitrepo]$ mkdir hw3
    	[shadam1@491vm shadam1-gitrepo]$ git add hw3
    	[shadam1@491vm shadam1-gitrepo]$ cd hw3
    	[shadam1@491vm hw3]$ # copy files
    	[shadam1@491vm hw3]$ tar -zcvf hw2.tar.gz *
    	[shadam1@491vm hw3]$ git add hw2.tar.gz
    	[shadam1@491vm hw3]$ git commit -m "Homework 3 submission"
    	[shadam1@491vm hw3]$ git push

What to do if you want to get a 0 on this project:

Any of the following items will cause significant point loss. Any item below with an asterisk (*) will give that submission a 0.