Introduction
In this homework, we'll be writing some Pig analytics to analyze the tweets stored in HDFS.Make sure HDFS, Kafka, ZooKeeper, and YARN are running.
This homework (and all future homeworks) will require heavy internet searching and reading! Use of these tools is best learnt by trial and error, so hit up the Googles.
You're welcome to use this gitrepo I put together for some example Pig scripts.
Goals
- Create a series of Pig analytics, using the Avro data stored in HDFS
- Learn to develop in local mode using small sample files
- Schedule analytics to run hourly using cron
Part 0: Fix your Twitter Avro Schema because I am a mean professor
Pig's AvroStorage does not support union of 'record' object types (which is what I told you to do), so you need to flatten your Avro schema into a single record with plain or array types. For most of you, this should just be taking the fields in the "User" record and putting them in the root Tweet record. Those of you who created additional Avro records for hashtags, urls, and user mentions, you should just use an array of strings as the type of these fields. You'll need to make these changes in your Avro schema and producer code before proceeding.
Part 1: Start Kafka Twitter stream
Boot up your VM and start the Kafka/Twitter process. In general, this should always be running in the background during all homework assignments, continuously collecting data. Be sure to set your number-of-Avro-objects and rollover seconds to something larger, say 1,000 messages or five minutes, and use enough search terms in your filter stream to retrieve a significant amount of tweets.Be advised that you may want to clean up files in HDFS at this time to remove any 'bad' files that were created during your development process. It is expected that the data is "production" and it shouldn't give you a bunch of problems.
Part 2: Write Pig analytics!
- Read through the Getting Started documentation provided by Pig here. This covers the basics of Pig, from starting a shell, different execution modes, tips for debugging. Note that the 'pig' command is already on your PATH.
- From here, begin developing analytics to provide results set for the following use cases. Pig itself provides the best documentation on their basics page and built-in functions page. Note the section on the AvroStorage loader -- you'll need it to read your Avro files.
The analytics you'll be creating are:
- Extract User Information
- Executed: Hourly
- Input: Previous hour of data
- Output: File containing many rows of data, one row per user, containing all columns pertaining to a user:
- user: id
- user: screen_name
- user: location
- user: description
- user: followers_count
- user: statuses_count
- user: geo_enabled
- user: lang
- Any additional columns you've selected to include from the Twitter object model
- Directory: /analytics/userinfo
- Top 100 Chatty Users by Hour
- Executed: Hourly
- Input: Previous hour of data
- Output: File containing 100 rows of data, columns <user id> <screen name> <number of tweets>, ordered by count descending
- Directory: /analytics/chattyusers
- Ordering of Popular Users by Hour
- Executed: Hourly
- Input: Previous hour of data
- Output: File containing 100 rows of data, columns <user id> <screen name> <num followers> <num tweets>, ordered by count descending
- Directory: /analytics/popularusers
- Top 100 Hashtags by Hour
- Executed: Hourly
- Input: Previous hour of data
- Output: File containing many rows of data, columns <tweet id> <hashtag> <count>, ordered by count descending
- Directory: /analytics/tophashtags
- Reverse Index of Tweet Words to Twitter ID
- Executed: Hourly
- Input: Previous hour of data
- Output: File containing non-stop words of a reverse index, which is a mapping of a word to a Twitter ID, one word/Twitter ID pair per line. Columns are <word> <tweet id>.
- Directory: /analytics/tweetindex
Part 3: Schedule your Pig analytics
In this part, we'll be writing a collection of bash scripts to schedule our Pig analytics to run every hour. We'll update our user's crontab on the VM to execute the analytics every hour, on the hour. See this section on some crontab docs for some more info
- Create a bash script, one per analytic, using the example found on the hadoop-demos Github to execute your Pig script. You'll want to "install" your scripts in some directory, e.g. /home/<glid>/anaytics/hashtags, to give a central place to run your analytic and capture the log file.
- Edit your crontab to execute the bash scripts.
- Validate your crontab is working and executing the bash scripts via the log files and viewing the files in HDFS.
Part 4: Submit your Homework
- Create a "hw3" folder in your git repository. Copy the contents of your install directory including ONE! log file of each successful execution, a copy of your crontab, and an HDFS listing of the /analytics directory to this directory.
- Create a README.txt file that contains the following items:
- Instructions on how to execute your project for someone who has no prior knowledge about the project. This should include things like installing your scripts, setting the crontab, viewing the files in HDFS, etc. Whatever is needed for another student to check out your homework and get it started (assume all necessary software is installed -- Kafka, HDFS, Avro, your previous homework(s), etc.)
- A list of references, if any, to include Internet links and the names of classmates you worked with.
- Create a tarball of these files and submit only the tarball. Do not include any data sets, generated code, or output! Only source code. For example:
[shadam1@491vm ~]$ cd shadam1-gitrepo
[shadam1@491vm ~]$ git pull
[shadam1@491vm shadam1-gitrepo]$ mkdir hw3
[shadam1@491vm shadam1-gitrepo]$ git add hw3
[shadam1@491vm shadam1-gitrepo]$ cd hw3
[shadam1@491vm hw3]$ # copy files
[shadam1@491vm hw3]$ tar -zcvf hw2.tar.gz *
[shadam1@491vm hw3]$ git add hw2.tar.gz
[shadam1@491vm hw3]$ git commit -m "Homework 3 submission"
[shadam1@491vm hw3]$ git push
What to do if you want to get a 0 on this project:
Any of the following items will cause significant point loss. Any item below with an asterisk (*) will give that submission a 0.
- Missing hw3.tar.gz*
- Submissions in excess of 10MB*
- Missing any of the required files above
- Submissions that do not extract with the command "tar -zxvf hw3.tar.gz"