CMSC 491 Spring 2016 Extra Credit Homework 5

Introduction

In this homework, we'll be re-creating the functionality of Homework 4 (Accumulo + WebApp) using a Storm topology and Redis. This will provide a more real-time experience and make the app a little more interesting. You'll need to re-use your Avro schema as well as your Kafka producer. Make sure HDFS, ZooKeeper, Kafka, and Storm are running.

This homework (and all future homeworks) will require heavy internet searching and reading! Use of these tools is best learnt by trial and error, so hit up the Googles.

Goals

Install Storm and Redis
Create and deploy a Storm topology to consume data from your Kafka Tweet topic, updating data in Redis
Create a RedisDataFetcher to display the information in near-real time.

Part 1: Download and Start Storm

Follow the below instructions to install and start Storm, using a pre-configured Storm tarball for this class.


# Open a terminal and download the tarball
[shadam1@cmsc491 ~]$ wget http://www.csee.umbc.edu/~shadam1/491s16/resources/hw5/apache-storm-1.0.0-for-vm.tar.gz

# Move it to /opt and change your working directory
[shadam1@cmsc491 ~]$ sudo mv apache-storm-1.0.0-for-vm.tar.gz /opt
[shadam1@cmsc491 ~]$ cd /opt/

# Unpack the tarball
[shadam1@cmsc491 opt]$ sudo tar -xf apache-storm-1.0.0-for-vm.tar.gz 

# Change ownership to your user (use your GL ID instead of shadam1)
[shadam1@cmsc491 opt]$ sudo chown -R shadam1:shadam1 apache-storm-1.0.0

# Create softlink from the real folder to /opt/storm
[shadam1@cmsc491 opt]$ sudo ln -s /opt/apache-storm-1.0.0 /opt/storm

# Create the data directory and give yourself ownership
[shadam1@cmsc491 opt]$ sudo mkdir -p /data1/storm
[shadam1@cmsc491 opt]$ sudo chown -R shadam1:shadam1 /data1/storm

# Change back to the home directory
[shadam1@cmsc491 opt]$ cd

# Modify the contents of ~/.bashrc to add the STORM_HOME environment
# variable and add the $STORM_HOME/bin folder to $PATH
[shadam1@cmsc491 ~]$ vi .bashrc # make changes using whatever text editor you want
[shadam1@cmsc491 ~]$ cat .bashrc 

# omitted lines

export ACCUMULO_HOME=/opt/accumulo
export STORM_HOME=/opt/storm

export PATH=$HADOOP_PREFIX/bin:$M2_HOME/bin:$KAFKA_HOME/bin:$PIG_HOME/bin:$SPARK_HOME/bin:$JAVA_HOME/bin:$ACCUMULO_HOME/bin:$PATH
export PATH=$STORM_HOME/bin:$PATH

# Source our changes for this window (this is done for you when you open a new terminal tab/window)
[shadam1@cmsc491 ~]$ source .bashrc 

# Start Kafka and ZooKeeper if you have not done so already
[shadam1@cmsc491 ~]$ ~/scripts/start_kafka.sh

# Start the Storm processes, ONE PER TAB.  These processes will hang indefinitely.  Whenever you need to start Storm, open a terminal with three tabs and type the below:
[shadam1@cmsc491 ~]$ storm nimbus
[shadam1@cmsc491 ~]$ storm supervisor
[shadam1@cmsc491 ~]$ storm ui

Navigate to http://localhost:8080 in your VM to view the Storm UI.

Part 2: Download and Start Redis

Follow the below instructions to install and start Redis, using a pre-configured Redis tarball for this class.

# Download the pre-configured redis tarball
[shadam1@cmsc491 ~]$ wget http://www.csee.umbc.edu/~shadam1/491s16/resources/hw5/redis-stable-for-vm.tar.gz

# Move and unpack Redis unde r/opt, changing permissions and creating the softlink
[shadam1@cmsc491 ~]$ sudo mv redis-stable-for-vm.tar.gz /opt
[shadam1@cmsc491 ~]$ cd /opt/
[shadam1@cmsc491 opt]$ sudo tar -xf redis-stable-for-vm.tar.gz 
[shadam1@cmsc491 opt]$ sudo chown -R shadam1:shadam1 redis-stable
[shadam1@cmsc491 opt]$ sudo ln -s /opt/redis-stable /opt/redis

# Create Redis data directory
[shadam1@cmsc491 opt]$ sudo mkdir -p /data1/redis
[shadam1@cmsc491 opt]$ sudo chown shadam1:shadam1 /data1/redis

# Make log directory

[shadam1@cmsc491 redis]$ mkdir -p /opt/redis/logs

# Start Redis
[shadam1@cmsc491 redis]$ /opt/redis/src/redis-server /opt/redis/redis.conf 

# Create REDIS_HOME variable in .bashrc and add it to $PATH
# NOTE we are adding the $REDIS_HOME/src directory, NOT the $REDIS_HOME/bin directory (which does not exist)

[shadam1@cmsc491 ~]$ vi .bashrc # use your preferred text editor
[shadam1@cmsc491 ~]$ cat .bashrc 

# ... omitted...

export STORM_HOME=/opt/storm
export REDIS_HOME=/opt/redis

export PATH=$HADOOP_PREFIX/bin:$M2_HOME/bin:$KAFKA_HOME/bin:$PIG_HOME/bin:$SPARK_HOME/bin:$JAVA_HOME/bin:$ACCUMULO_HOME/bin:$PATH
export PATH=$STORM_HOME/bin:$PATH
export PATH=$REDIS_HOME/src:$PATH

# Source it
[shadam1@cmsc491 ~]$ source .bashrc

Use the Redis command line to validate it is working as expected:

[shadam1@cmsc491 ~]$ redis-cli
127.0.0.1:6379> SET foo bar
OK
127.0.0.1:6379> GET foo
"bar"
127.0.0.1:6379> DEL foo
(integer) 1
127.0.0.1:6379>

Part 3: Download and import starter project into Eclipse

Download the starter project from below and unpack it.

[shadam1@cmsc491 ~]$ wget http://www.csee.umbc.edu/~shadam1/491s16/resources/hw5/twitter-storm.tar.gz
[shadam1@cmsc491 ~]$ tar -xf twitter-storm.tar.gz

In your VM, open Eclipse (purple button on top task bar).
If you are prompted to enter a workspace, then browse to and select the twitter-storm directory. If your Eclipse opens to an existing workspace, go to File -> Switch Workspace -> Other... , then browse to and select the twitter-storm directory you unpacked.
File -> Import... -> Maven -> Existing Maven projects -> Browse... -> Navigate to twitter-storm (it generally is already selected) -> OK -> You should see two projects, storm and storm-webapp, click Finish.
You should now see two Maven projects in Eclipse. storm is what you will use to write your Storm topology, and storm-webapp is what you will use to extract data from Redis.

Part 4: Implement your Storm topology

Using the storm project as your guide, follow the TODO statements to implement the Kafka spout, the various bolts, and the Driver to build the topology.
Delete the provided Avro twitter/Tweet.java file. This is using my Avro schema and likely will not line up with your own. Using your Avro schema, compile it via the Java tools library and move the folder containing your source code under storm/src/main/java.
Running the Driver class in Eclipse with no arguments will run the topology in local mode, allowing you to set breakpoints to debug your code and/or make sure it is working.
After you have finished your topology, you can build the topology with Maven and run it, passing the name of the Storm topology as the only command line argument (otherwise it will run in local mode).
```
[shadam1@cmsc491 storm]$ mvn clean package
[shadam1@cmsc491 storm]$ storm jar target/storm-0.0.1.jar com.adamjshook.demo.storm.Driver tweettop
        
```
To kill your topology so you can redeploy it, simply use storm kill <topology_name>.
```
[shadam1@cmsc491 storm]$ storm kill tweettop
        
```

Part 4: Implement the RedisDataFetcher

Using the storm-webapp project as your guide, follow the TODO statements to implement the RedisDataFetcher. This will retrieve the top ten hashtags, top ten most popular users, and the last ten tweets from Redis. This web app calls these functions every five seconds, retrieving the latest up-to-date data.

After you have finished your data fetcher, you can build it with Maven and run it as a java application.

[shadam1@cmsc491 storm]$ mvn clean package
[shadam1@cmsc491 storm]$ java -jar target/storm-webapp-1.0.0.jar

Press Ctrl+C to kill the java app

Part 5: Submit your Homework

Create a "hw5" folder in your git repository. Include the following items in your directory:
1. Your source code in it's entirety -- twitter-storm/storm and twitter-storm/storm-webapp
2. Your Avro schema
3. Screen shot of your web application with the tweets, top hashtags, and popular users
4. Create a README.txt file that contains the following items:
  - Instructions on how to execute your project for someone who has no prior knowledge about the project. This should include things like building the code, deploying the storm jar, building, running, and viewing the web application, etc. Whatever is needed for another student to check out your homework and get it started (assume all necessary software is installed -- Kafka, HDFS, Avro, your previous homework(s), etc.)
  - A list of references, if any, to include Internet links and the names of classmates you worked with.

Create a tarball of these files and submit only the tarball. Do not include any data sets, generated code, or output! Only source code. For example:

	[shadam1@491vm ~]$ cd shadam1-gitrepo
	[shadam1@491vm ~]$ git pull
	[shadam1@491vm shadam1-gitrepo]$ mkdir hw5
	[shadam1@491vm shadam1-gitrepo]$ git add hw5
	[shadam1@491vm shadam1-gitrepo]$ cd hw5
	[shadam1@491vm hw5]$ # copy files
	[shadam1@491vm hw5]$ # Locate and delete any target directories!  They are too big!
	[shadam1@491vm hw5]$ tar -zcvf hw5.tar.gz *
	[shadam1@491vm hw5]$ git add hw5.tar.gz
	[shadam1@491vm hw5]$ git commit -m "Homework 5 submission"
	[shadam1@491vm hw5]$ git push

What to do if you want to get a 0 on this project:

Any of the following items will cause significant point loss. Any item below with an asterisk (*) will give that submission a 0.

Missing hw5.tar.gz*
Submissions in excess of 10MB*
Missing any of the required files above
Submissions that do not extract with the command "tar -zxvf hw5.tar.gz"