In this homework, we'll be loading some data sets that are output from your Pig analytics into Accumulo, then using a provided web-app to query the data sets.Make sure HDFS, ZooKeeper, and Accumulo are running.
This homework (and all future homeworks) will require heavy internet searching and reading! Use of these tools is best learnt by trial and error, so hit up the Googles.
You're welcome to use this gitrepo that contains several Accumulo exercises in Java.
Using the previous homework's output, load the tweets, top hashtags, popular users, and reverse index data sets into Accumulo tables called tweets, hashtags, popular_users, and tweet_index.
It is your responsibility to determine the data model for how this data should be stored. What is your row ID? What column families will you have? What are the column qualifiers? How are you storing the values? Then, using Java or Python (with the pyaccumulo Python library), create a simple application to be executed hourly via cron that will scan HDFS for the past hour's Avro tweets and the past hour's analytic output for the top hashtags, popular users, and reverse index. For the purposes of development, you should pull down files and develop locally, then switch to using HDFS files and paths when it is time to deploy your application.
If you're using pyaccumulo, it requires you to use Accumulo's proxy server. After starting the proxy server, you can get the Accumulo connector like so:
# Start the proxy server using a terminal shell $ accumulo proxy -p /opt/accumulo/proxy/proxy.properties $ python Python 2.7.10 (default, Oct 23 2015, 18:05:06) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyaccumulo >>> from pyaccumulo import Accumulo >>> conn = Accumulo(host="localhost", port=42424, user="root", password="secret") >>> conn.list_tables() ['accumulo.root', 'trace', 'accumulo.metadata', foo']
Create a cron job similar to the Pig jobs to execute the Accumulo ingest hourly. Install this application in your analytics directory under your home directory.
[shook@mb:~]$ wget http://www.csee.umbc.edu/~shadam1/491s16/resources/hw4/accumulo-app.tar.gz . [shook@mb:~]$ tar -xf accumulo-app.tar.gz [shook@mb:~]$ cd accumulo-app [shook@mb:accumulo-app]$ mvn clean package [shook@mb:accumulo-app]$ java -jar target/accumulo-demo-app-1.0.0.jar
[shadam1@491vm ~]$ cd shadam1-gitrepo [shadam1@491vm ~]$ git pull [shadam1@491vm shadam1-gitrepo]$ mkdir hw4 [shadam1@491vm shadam1-gitrepo]$ git add hw4 [shadam1@491vm shadam1-gitrepo]$ cd hw4 [shadam1@491vm hw4]$ # copy files [shadam1@491vm hw4]$ tar -zcvf hw4.tar.gz * [shadam1@491vm hw4]$ git add hw4.tar.gz [shadam1@491vm hw4]$ git commit -m "Homework 4 submission" [shadam1@491vm hw4]$ git push
Any of the following items will cause significant point loss. Any item below with an asterisk (*) will give that submission a 0.