Please read all instructions carefully! Unlike the other assignments you'll be receiving, this homework is quite detailed so it won't take you a significant amount of time. It contains step-by-step instructions and is not meant to be difficult. If you hit a snag, please post about your issue on Piazza.
Development is significantly easier when all of the Hadoop services will be hosted on one machine. Using small data sets and the standalone services, you can implement your applications in a development environment and then deploy them on a cluster of machines. Due to the nature of the applications, no code changes are required whether you run on a single machine or many.
In this class, you will have access to a distributed Hadoop environment, but these environments are temporary in nature and will typically only be used for testing in a larger environment. For development, we will be using Hortonworks HDP Sandbox; a self-contained VM image pre-loaded with all the needed Hadoop software.
For this homework, we'll be getting the Sandbox downloaded, getting the Twitter API keys necessary for you to do the assignments this semester, accessing your git repository to submit your projects and homeworks, and doing some brief testing to make sure everything is working.
If you intend to work in the lab, you should have a USB drive with at least 16 GB of free space. Flash drives or USB-powered hard disks are the most convenient option. If you will be working on your personal laptop or desktop, you may want one regardless for backups. Note: Ensure your drive is formatted as NTFS. FAT32 has an unacceptable file size limit for virtual machines.
HyperV
Virtualization allows us to run a virtual machine ("the guest") on a physical machine ("the host"). For the purposes of the class projects, we assume that the host machine is a modern PC with at least 4 GB of RAM, a 1.5 GHz Dual-Core processor, at least 50 GB of available disk space. The host operating system can be Windows, Linux, or Mac OS X. While 64-bit host operating systems are preferred, 32-bit will be sufficient. For more details on hardware and software requirements refer to the VirtualBox home page.
The ITE 240 Lab has VirtualBox installed on the Linux partitions of the machines in the lab.
[root@sandbox ~]$ yum -y groupinstall "Desktop" [root@sandbox ~]$ yum -y groupinstall "Desktop Platform" [root@sandbox ~]$ yum -y groupinstall "X Window System" [root@sandbox ~]$ yum -y groupinstall "Fonts" [root@sandbox ~]$ yum -y groupinstall "Graphical Administration Tools" [root@sandbox ~]$ yum -y groupinstall "Internet Browser" [root@sandbox ~]$ yum -y groupinstall "General Purpose Desktop" [root@sandbox ~]$ yum -y groupinstall "Office Suite and Productivity" [root@sandbox ~]$ yum -y groupinstall "Graphics Creation Tools" [root@sandbox ~]$ yum -y groupinstall "Development tools"
id:5:initdefault:
# For example: [root@sandbox ~]$ useradd shadam1 [root@sandbox ~]$ passwd shadam1 New password: Retype new password: passwd: all authentication tokens updated successfully
shadam1 ALL=(ALL) ALL
[root@sandbox ~]$ poweroff
[shadam1@sandbox ~]$ sudo yum -y install kernel* [shadam1@sandbox ~]$ sudo reboot
At this point you should now have a desktop with a connection to the internet and that various Hadoop tools we'll be using throughout the course. We'll be doing some basic HDFS commands and running a sample MapReduce job to make sure everything is working.
[shadam1@sandbox ~]$ sudo -u hdfs hdfs dfs -mkdir /user/shadam1/ [shadam1@sandbox ~]$ sudo -u hdfs hdfs dfs -chown shadam1:shadam1 /user/shadam1/ # The 'put' command moves a file from the local filesystem to a destionation directory [shadam1@sandbox ~]$ hdfs dfs -put /usr/hdp/current/pig-client/CHANGES.txt /user/shadam1/ # The 'cat' command will print the contents of a file to stdout [shadam1@sandbox ~]$ hdfs dfs -cat CHANGES.txt # ... a bunch of output # The 'ls' command will list the files in a provided HDFS path. If no path is provided, or it is not an absolute path, it assumes you are located in your home directory. Note that there is no concept of a working directory when using HDFS. [shadam1@sandbox ~]$ hdfs dfs -ls -R -rw-r--r-- 3 shadam1 shadam1 190058 2016-02-04 19:57 CHANGES.txt
[shadam1@sandbox ~]$ yarn jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount CHANGES.txt output # ... a bunch of output [shadam1@sandbox ~]$ hdfs dfs -ls -R drwx------ - shadam1 shadam1 0 2016-02-04 19:58 .staging -rw-r--r-- 3 shadam1 shadam1 190058 2016-02-04 19:57 CHANGES.txt drwxr-xr-x - shadam1 shadam1 0 2016-02-04 19:58 output -rw-r--r-- 3 shadam1 shadam1 0 2016-02-04 19:58 output/_SUCCESS -rw-r--r-- 3 shadam1 shadam1 83856 2016-02-04 19:58 output/part-r-00000
The second objective of this homework is to obtain your Twitter API keys and make sure you are able to access the data.
[shadam1@sandbox ~]$ cd [shadam1@sandbox ~]$ sudo yum erase -y ruby ruby-devel [shadam1@sandbox ~]$ wget http://cache.ruby-lang.org/pub/ruby/2.3/ruby-2.3.0.tar.gz [shadam1@sandbox ~]$ tar -xf ruby-2.3.0.tar.gz [shadam1@sandbox ~]$ cd ruby-2.3.0 [shadam1@sandbox ~]$ ./configure -prefix=/usr [shadam1@sandbox ~]$ make [shadam1@sandbox ~]$ sudo make install [shadam1@sandbox ~]$ sudo gem update --system [shadam1@sandbox ~]$ sudo gem install bundler
[shadam1@sandbox ~]$ cd [shadam1@sandbox ~]$ git clone https://github.com/twitter/twurl.git [shadam1@sandbox ~]$ cd twurl/ [shadam1@sandbox twurl]$ sudo gem install twurl
[shadam1@sandbox twurl]$ twurl authorize --consumer-key <your.key> --consumer-secret <your.secret> Go to <link omitted> and paste in the supplied PIN 8564638 Authorization successful
[shadam1@sandbox twurl]$ twurl /1.1/statuses/user_timeline.json ... a bunch of json...
The final objective is to learn how to submit your homeworks for this course. We will be using git to submit homeworks and projects this semester. For this project, you will submit a screen shot of your running virtual machine along with a file containing the output of a couple commands. Let's walk through the steps of installing git, generating the files to commit, and then pushing your project to the repository.
[shadam1@sandbox ~]$ git config --global user.name "Adam Shook" [shadam1@sandbox ~]$ git config --global user.email "shadam1@umbc.edu"
[shadam1@sandbox ~]$ git clone ssh://<glusername>@linux.gl.umbc.edu/afs/umbc.edu/depts/cmsc/ajshook/491s16/<glusername> ~/<glusername>-gitrepo
[shadam1@sandbox ~]$ cd ~/shadam1-gitrepo [shadam1@sandbox shadam1-gitrepo]$ mkdir hw1 [shadam1@sandbox shadam1-gitrepo]$ cd hw1
[shadam1@sandbox ~]$ cd ~/shadam1-gitrepo [shadam1@sandbox shadam1-gitrepo]$ mkdir hw1 [shadam1@sandbox shadam1-gitrepo]$ cd hw1/ [shadam1@sandbox shadam1-gitrepo]$ twurl /1.1/statuses/user_timeline.json > twurl.out [shadam1@sandbox shadam1-gitrepo]$ hdfs dfs -ls -R &> script.out [shadam1@sandbox shadam1-gitrepo]$ cat script.out drwx------ - shadam1 shadam1 0 2016-02-04 19:58 .staging -rw-r--r-- 3 shadam1 shadam1 190058 2016-02-04 19:57 CHANGES.txt drwxr-xr-x - shadam1 shadam1 0 2016-02-04 19:58 output -rw-r--r-- 3 shadam1 shadam1 0 2016-02-04 19:58 output/_SUCCESS -rw-r--r-- 3 shadam1 shadam1 83856 2016-02-04 19:58 output/part-r-00000 [shadam1@sandbox hw1]$ ll total 156 -rw-rw-r-- 1 shadam1 shadam1 155551 2016-02-04 20:04 hw1-ss.png -rw-rw-r-- 1 shadam1 shadam1 359 2016-02-04 20:04 script.out -rw-rw-r-- 1 shadam1 shadam1 52183 2016-02-04 20:04 twurl.out [shadam1@sandbox hw1]$ tar -zcvf hw1.tar.gz hw1-ss.png twurl.out script.out hw1-ss.png script.out twurl.out [shadam1@sandbox hw1]$ rm hw1-ss.png script.out twurl.out [shadam1@sandbox hw1]$ git add hw1.tar.gz [shadam1@sandbox hw1]$ git commit -m "Submit homework one files" create mode 100644 hw1/hw1.tar.gz [shadam1@sandbox hw1]$ git push
Any of the following items will cause significant point loss. Any item below with an asterisk (*) will give that submission a 0.