R (DUE: 4/9 11:59 PM EDT)

This assignment requires you to write several R functions and classes. These problems are designed to give you some hands on experience in the following areas:

  • Statiscal Computing
  • Objects in R
  • Plotting using ggplot
  • Text processing in R
  • Working with DataFrames

The design and implementation (unless specified) of each problem is completely up to you. Suggested approaches may be mentioned in class, but they are just that, suggestions. You are welcome to implement down any path you see fit.

Some guidelines for the problems:

  • All problems should handle errors in a reasonable way. Your solutions will be tested for invalid input or usage. Give the user a meaningful message if they misuse the script or if an error occurs.
  • Projects will be graded electronically on one of the GL Linux servers — linux[123].gl.umbc.edu. You are welcome to develop your assignment at home, but make sure your scripts run correctly on the GL Linux servers.

The assignment will be submitted through GitHub. To get started with the assignment, go to https://classroom.github.com/a/4KGf0URx and sign in to your GitHub account. This will create a repository for you containing one file, with the names you are expected to use for this assignment. Also included are various data files. Please include a README with your name in it.

Git on GL is outdated and requires a slightly different mechanism to you. On most systems, running a git command will prompt for your GitHub username and password. On GL, it expects the user name as part of the command. One was to achieve this is to modify the clone command so that it now reads

git clone https://YOUR_GITHUB_USERNAME@github.com/YOUR_REPO_INFORMATION

Prior to doing this, you may need to run the command unset SSH_ASKPASS. This is so GL knows not to try and prompt you for your password using a GUI, and instead asks for it on the command line.

If you are using tcsh or another c-shell, you need to type unsetenv SSH_ASKPASS to prevent the GUI error.

Available R Libraries

The following libraries have been installed to my public directory, and are useable by setting your enviormental variable R_LIBS_USER to /afs/umbc.edu/users/b/w/bwilk1/pub/R_libs. These are the only non-standard libraries you are permitted to use.

    
Packages in library '/afs/umbc.edu/users/b/w/bwilk1/pub/R_libs':
assertthat              Easy Pre and Post Assertions
BH                      Boost C++ Header Files
bindr                   Parametrized Active Bindings
bindrcpp                An 'Rcpp' Interface to Active Bindings
broom                   Convert Statistical Analysis Objects into Tidy Data Frames
cellranger              Translate Spreadsheet Cell Ranges to Rows and  Columns
codetools               Code Analysis Tools for R
colorspace              Color Space Manipulation
dichromat               Color Schemes for Dichromats
digest                  Create Compact Hash Digests of R Objects
dplyr                   A Grammar of Data Manipulation
forcats                 Tools for Working with Categorical Variables (Factors)
foreign                 Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
ggplot2                 Create Elegant Data Visualisations Using the Grammar of Graphics
glue                    Interpreted String Literals
gtable                  Arrange 'Grobs' in Tables
hms                     Pretty Time of Day
jsonlite                A Robust, High Performance JSON Parser and Generator for R
labeling                Axis Labeling
lattice                 Trellis Graphics for R
lazyeval                Lazy (Non-Standard) Evaluation
lubridate               Make Dealing with Dates a Little Easier
magrittr                A Forward-Pipe Operator for R
mime                    Map Filenames to MIME Types
mnormt                  The Multivariate Normal and t Distributions
modelr                  Modelling Functions that Work with the Pipe
munsell                 Utilities for Using Munsell Colours
nlme                    Linear and Nonlinear Mixed Effects Models
pkgconfig               Private Configuration for 'R' Packages
plogr                   The 'plog' C++ Logging Library
plyr                    Tools for Splitting, Applying and Combining Data
psych                   Procedures for Psychological, Psychometric, and Personality Research
purrr                   Functional Programming Tools
R6                      Classes with Reference Semantics
RColorBrewer            ColorBrewer Palettes
Rcpp                    Seamless R and C++ Integration
readr                   Read Rectangular Text Data
rematch                 Match Regular Expressions with a Nicer 'API'
reshape2                Flexibly Reshape Data: A Reboot of the Reshape Package
rlang                   Functions for Base Types and Core R and 'Tidyverse' Features
scales                  Scale Functions for Visualization
selectr                 Translate CSS Selectors to XPath Expressions
stringi                 Character String Processing Facilities
stringr                 Simple, Consistent Wrappers for Common String Operations
tibble                  Simple Data Frames
tidyr                   Easily Tidy Data with 'spread()' and 'gather()' Functions
tidyselect              Select from a Set of Strings
viridisLite             Default Color Maps from 'matplotlib' (Lite Version)
xml2                    Parse XML


    
    

Environmental variables are set in the shell, prior to executing your R code or starting the R interpreter. If you are using the default shell on GL, use the command setenv R_LIBS_USER /afs/umbc.edu/users/b/w/bwilk1/pub/R_libs. If you are using bash as your shell, use the command export R_LIBS_USER=/afs/umbc.edu/users/b/w/bwilk1/pub/R_libs. If you want to set the more permanently, place it in the appropriate ~/.cshrc or ~/.bashrc

1. degreeChange (20 points)

Task

In this problem, you are to implement a function, degreeChange, that takes in some messy data from the University of System of Maryland, cleans it, and produces a DataFrame showing the change in the number of bachelor's degrees awarded at an institution, per area.

The degreeChange function takes exactly 3 parameters

  • Two file names, where the data for two different years is stored
  • The name of the instutitution, as a string, for which to produce a data frame for

Data File

The data for this problem comes from USM IRIS system. We have provided 3 sample files for you to test and develop your code with, although additional files may be used when grading. All files have the same general set up, although some older files have different column names.

Note: If you wish to download more files to test your code, feel free to do so. The files are downloaded using the download txt button, and have the first and last lines removed.

Example Output

 
	
>source('433.R')
>print(degreeChange('USM2017',"USM1986","University of Maryland, Baltimore County"))
                         Bachelor.s
Agric. & Nat. Resources           0
Arch. & Envtl. Design             0
Area Studies                      9
Biological Sciences             421
Business & Management             8
Communications                   74
Computer & Info. Sci.            77
Education                         1
Engineering                     194
Fine & Applied Arts              96
Foreign Languages                 6
Health Professions              -15
Home Economics                    0
Interdisc. & CC Transfer         -8
Law                               0
Letters                          28
Library Science                   0
Mathematics                      72
Physical Science                 20
Psychology                      249
Public Affairs & Serv.           72
Social Sciences                 164
Total                          1468

changes <- degreeChange('USM2017',"USM2016","University of Maryland, College Park")
> print(changes)
                         Bachelor.s
Agric. & Nat. Resources          14
Arch. & Envtl. Design            10
Area Studies                    -11
Biological Sciences              -6
Business & Management           -76
Communications                   -1
Computer & Info. Sci.            63
Education                       -27
Engineering                      11
Fine & Applied Arts              -1
Foreign Languages               -12
Health Professions              130
Home Economics                  -28
Interdisc. & CC Transfer         -1
Law                               0
Letters                           5
Library Science                   0
Mathematics                      -2
Physical Science                  7
Psychology                      -55
Public Affairs & Serv.            0
Social Sciences                  19
Total                            39

	
	

How you will be graded

The points for this problem will be assigned according to the following rubric

RequirementPoints
Total20
Differences are calculated correctly10
Only the specified institutions information is returned5
Numbers with commas in them in the original data are handled correctly3
No warnings are printed when your function is called2

2. Plotting Bechdel Test Data for Movies (40 points)

Task

For this problem you will create 4 plots, using data on the Bechedel test in movies, as collected by FiveThirtyEight. This dataset is found in the repository for this assignment under the name 'movies.csv'. The Bechedel test is a way to represent the portrayal of women in film. It is an imperfect measure, but still presents an interesting way to look at the data. In order to pass the Bechedel test, the movie, or any work of fiction, must meet the following three critera:

  1. Two women characters are in the film
  2. They talk to eachother at least once
  3. About something besides a man

The movies.csv dataset contains the following fields

  • The Year the Movie was Released
  • The movies IMDB number
  • The title of the movie
  • How the movie failed the test, or if it passed (ok)
  • A clean version fo the previous column, without modifiers like "-disagree"
  • A binary decision of did the movie past the test or not
  • The budget of the movie
  • The domestic gross of the movie
  • The international gross of the movie
  • A code which is simply the concatenation of the year and PASS or FAIL
  • The budget in 2013 dollars
  • The domestic gross in 2013 dollars
  • The interational gross in 2013 dollars
  • An integer indicating what five year span the movie occured in, from 1990-2013
  • An integer indicating what decade the movie occured in, only present for movies released from 1990-2013

Using this data, and the ggplot2 library, you should produce the following plots

  • A histogram showing the number of movies that passed, or failed, by reason they failed. You should have 5 bars
  • A violin plot showing the distribution of release years by whether the movie passed or not.
  • A plot showing the average budget for movies that pass and don't pass in a given year using geom_smooth
  • A scatter plot showing the budget vs domestic gross for each movie, colored by and shape determined by if it passed the test, or how it failed if it did fail

You should submit both the plot as a png as well as the code generated to create your plots. Wrap your code to generate the plots in a function named makePlots.

Sample Output

The following plots are shown just as examples of the plot type and general direction. They are very incomplete and you should not strive to reproduce them.

Notes

  • You should not use the default colors or grayscale on any of your plots.
  • All your plots should have a meaningful title and axis labels
  • Use 2013 dollar ammount for any plot requiring budget or gross

How you will be graded

The points for this problem will be assigned according to the following rubric

Each plot is worth 10 points

RequirementPoints
Total10 x 4
The plot is the correct type3
The correct information is plotted3
The axis and title are labeled2
The color scheme is not the default scheme2

3. Corpus Statistics (40 points)

Task

For this problem you will create two R classes, corpus and document. A document holds the text of a single document, and can be queried for summary statistics about the document. A corpus is a collection of documents, and allows for summary statistics for the entire corpus to be queried.

The constructor for a document should take one argument, the name of a file. The constructor for a corpus should take one argument, the name of a directory, under which a document for each file ending in '.txt' is added to the corpus

The following methods should be implemented for each class

  1. summary(object) - Which for a document prints
    • The total number of words
    • The average number of letters in a word
    • The longest word
    And for a corpus prints
    • The total number of words in the corpus
    • The average number of words in a document in the corpus
    • The maximum and minimum number of words in a document in the corpus
    • The average word length of a word in the corpus
    • The longest word in the corpus
  2. most_common(object,N) - which returns a vector of the N most common words in the object
  3. preview(object)- which returns the first 50 words for a document, or each document if called on a corpus

Example Output

        

> source('433.R')

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> doc1 <- document('hounds.txt')
> summary(doc1)

 The document has 62245 words
 The average number of characters in a word is 4.39996786890513
 The longest word is http://www.gutenberg.org/3/0/7/3070/
> most_common(doc1,10)
 the   of  and   to    a    I that   in  was  you
3230 1694 1562 1449 1279 1256 1040  908  781  663
> most_common(doc1,20)

  the    of   and    to     a     I  that    in   was   you   his    he    is
 3230  1694  1562  1449  1279  1256  1040   908   781   663   658   657   605
   it  have   had  with which    my   for
  576   525   496   466   422   420   413
> preview(doc1)
 [1] "Project"       "Gutenberg's"   "The"           "Hound"
 [5] "of"            "the"           "Baskervilles," "by"
 [9] "Arthur"        "Conan"         "Doyle"         "This"
[13] "eBook"         "is"            "for"           "the"
[17] "use"           "of"            "anyone"        "anywhere"
[21] "at"            "no"            "cost"          "and"
[25] "with"          "almost"        "no"            "restrictions"
[29] "whatsoever."   "You"           "may"           "copy"
[33] "it,"           "give"          "it"            "away"
[37] "or"            "re-use"        "it"            "under"
[41] "the"           "terms"         "of"            "the"
[45] "Project"       "Gutenberg"     "License"       "included"
[49] "with"          "this"
> doc2 <- document('les_mis.txt')
> summary(doc2)

 The document has 568729 words
 The average number of characters in a word is 4.68505914064519
 The longest word is "Police-agent-Ja-vert-was-found-drowned-un-der-a-boat-of-the-Pont-au-
> most_common(doc2,10)

  the    of   and     a    to    in   was  that    he   his
36456 19550 13787 13371 13263 10162  8274  6858  6535  6077
> most_common(doc2,20)

  the    of   and     a    to    in   was  that    he   his   had    is which
36456 19550 13787 13371 13263 10162  8274  6858  6535  6077  6009  5814  4961
 with    on   The    it   not    at     I
 4398  3910  3676  3560  3560  3490  2905
> preview(doc2)
 [1] "The"          "Project"      "Gutenberg"    "EBook"        "of"
 [6] "Les"          "Misérables,"  "by"           "Victor"       "Hugo"
[11] "This"         "eBook"        "is"           "for"          "the"
[16] "use"          "of"           "anyone"       "anywhere"     "at"
[21] "no"           "cost"         "and"          "with"         "almost"
[26] "no"           "restrictions" "whatsoever."  "You"          "may"
[31] "copy"         "it,"          "give"         "it"           "away"
[36] "or"           "re-use"       "it"           "under"        "the"
[41] "terms"        "of"           "the"          "Project"      "Gutenberg"
[46] "License"      "included"     "with"         "this"         "eBook"
> corpus1 <- corpus(".")
> summary(corpus1)

 The corpus has  760772 words
 The average number of words in a document is 190193 words
 The maximum number of words in a document is 568729 words
 The minimum number of words in a document is 20348 words
 The average word is 4.61732292986598 characters long
 The longest word in the coprpus is "Police-agent-Ja-vert-was-found-drowned-un-der-a-boat-of-the-Pont-au-
> most_common(corpus1,10)

  the    of   and    to     a    in   was  that    he   his
45765 24414 18609 17791 17493 12977 10451  9625  8273  7963
> most_common(corpus1,20)

  the    of   and    to     a    in   was  that    he   his    is   had     I
45765 24414 18609 17791 17493 12977 10451  9625  8273  7963  7702  7346  7025
which  with    it   not    at    on   The
 6186  5854  5235  4656  4641  4522  4291
> preview(corpus1)
      [,1]            [,2]           [,3]
 [1,] "Project"       "The"          "***The"
 [2,] "Gutenberg's"   "Project"      "Project"
 [3,] "The"           "Gutenberg"    "Gutenberg's"
 [4,] "Hound"         "EBook"        "Etext"
 [5,] "of"            "of"           "of"
 [6,] "the"           "Les"          "Shakespeare's"
 [7,] "Baskervilles," "Misérables,"  "First"
 [8,] "by"            "by"           "Folio***"
 [9,] "Arthur"        "Victor"       "********************The"
[10,] "Conan"         "Hugo"         "Tragedie"
[11,] "Doyle"         "This"         "of"
[12,] "This"          "eBook"        "Macbeth*********************"
[13,] "eBook"         "is"           "This"
[14,] "is"            "for"          "is"
[15,] "for"           "the"          "our"
[16,] "the"           "use"          "3rd"
[17,] "use"           "of"           "edition"
[18,] "of"            "anyone"       "of"
[19,] "anyone"        "anywhere"     "most"
[20,] "anywhere"      "at"           "of"
[21,] "at"            "no"           "these"
[22,] "no"            "cost"         "plays."
[23,] "cost"          "and"          "See"
[24,] "and"           "with"         "the"
[25,] "with"          "almost"       "index."
[26,] "almost"        "no"           "Copyright"
[27,] "no"            "restrictions" "laws"
[28,] "restrictions"  "whatsoever."  "are"
[29,] "whatsoever."   "You"          "changing"
[30,] "You"           "may"          "all"
[31,] "may"           "copy"         "over"
[32,] "copy"          "it,"          "the"
[33,] "it,"           "give"         "world,"
[34,] "give"          "it"           "be"
[35,] "it"            "away"         "sure"
[36,] "away"          "or"           "to"
[37,] "or"            "re-use"       "check"
[38,] "re-use"        "it"           "the"
[39,] "it"            "under"        "copyright"
[40,] "under"         "the"          "laws"
[41,] "the"           "terms"        "for"
[42,] "terms"         "of"           "your"
[43,] "of"            "the"          "country"
[44,] "the"           "Project"      "before"
[45,] "Project"       "Gutenberg"    "posting"
[46,] "Gutenberg"     "License"      "these"
[47,] "License"       "included"     "files!!"
[48,] "included"      "with"         "Please"
[49,] "with"          "this"         "take"
[50,] "this"          "eBook"        "a"
      [,4]
 [1,] "Project"
 [2,] "Gutenberg's"
 [3,] "Adventures"
 [4,] "of"
 [5,] "Sherlock"
 [6,] "Holmes,"
 [7,] "by"
 [8,] "A."
 [9,] "Conan"
[10,] "Doyle"
[11,] "This"
[12,] "eBook"
[13,] "is"
[14,] "for"
[15,] "the"
[16,] "use"
[17,] "of"
[18,] "anyone"
[19,] "anywhere"
[20,] "at"
[21,] "no"
[22,] "cost"
[23,] "and"
[24,] "with"
[25,] "almost"
[26,] "no"
[27,] "restrictions"
[28,] "whatsoever."
[29,] "You"
[30,] "may"
[31,] "copy"
[32,] "it,"
[33,] "give"
[34,] "it"
[35,] "away"
[36,] "or"
[37,] "re-use"
[38,] "it"
[39,] "under"
[40,] "the"
[41,] "terms"
[42,] "of"
[43,] "the"
[44,] "Project"
[45,] "Gutenberg"
[46,] "License"
[47,] "included"
[48,] "with"
[49,] "this"
[50,] "eBook"
>

How you will be graded

The points for this problem will be assigned according to the following rubric

RequirementPoints
Total40
The document constructor prints no errors1
The corpus constructor prints no errors1
The summary statistics for the document are correct6
The summary statistics for the corpus are correct12
The most common function returns the correct words for a document3
The most common function returns the correct number of words for the argument passed for a document2
The most common function returns the correct words for a corpus3
The most common function returns the correct number of words for the argument passed for a corpus2
Preview returns the first 50 words for a single document5
Preview returns the first 50 words for each document in the corpus5

Running your code

To test your code, you can start the R interpreter and load the file using source('433.R').

Submitting your code

Your code should be committed and pushed back to GitHub before the due date. All your R code should be in a file named 433.R. The plots should be named plot1.png, plot2.png, plot3.png, and plot4.png.