This assignment requires you to write several R functions and classes. These problems are designed to give you some hands on
experience in the following areas:
Statiscal Computing
Objects in R
Plotting using ggplot
Text processing in R
Working with DataFrames
The design and implementation (unless specified) of each problem
is completely up to you. Suggested approaches may be mentioned
in class, but they are just that, suggestions. You are welcome
to implement down any path you see fit.
Some guidelines for the problems:
All problems should handle errors in a reasonable way. Your solutions
will be tested for invalid input or usage. Give the user a meaningful
message if they misuse the script or if an error occurs.
Projects will be graded electronically on one of the GL Linux
servers — linux[123].gl.umbc.edu. You are welcome to develop
your assignment at home, but make sure your scripts run correctly on
the GL Linux servers.
The assignment will be submitted through GitHub. To get started with the assignment, go to https://classroom.github.com/a/4KGf0URx and sign in to your GitHub account. This will create a repository for you containing one file, with the names you are expected to use for this assignment. Also included are various data files. Please include a README with your name in it.
Git on GL is outdated and requires a slightly different mechanism to you. On most systems, running a git command will prompt for your GitHub username and password. On GL, it expects the user name as part of the command. One was to achieve this is to modify the clone command so that it now reads
Prior to doing this, you may need to run the command unset SSH_ASKPASS. This is so GL knows not to try and prompt you for your password using a GUI, and instead asks for it on the command line.
If you are using tcsh or another c-shell, you need to type unsetenv SSH_ASKPASS to prevent the GUI error.
Available R Libraries
The following libraries have been installed to my public directory, and are useable by setting your enviormental variable R_LIBS_USER to /afs/umbc.edu/users/b/w/bwilk1/pub/R_libs. These are the only non-standard libraries you are permitted to use.
Packages in library '/afs/umbc.edu/users/b/w/bwilk1/pub/R_libs':
assertthat Easy Pre and Post Assertions
BH Boost C++ Header Files
bindr Parametrized Active Bindings
bindrcpp An 'Rcpp' Interface to Active Bindings
broom Convert Statistical Analysis Objects into Tidy Data Frames
cellranger Translate Spreadsheet Cell Ranges to Rows and Columns
codetools Code Analysis Tools for R
colorspace Color Space Manipulation
dichromat Color Schemes for Dichromats
digest Create Compact Hash Digests of R Objects
dplyr A Grammar of Data Manipulation
forcats Tools for Working with Categorical Variables (Factors)
foreign Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
ggplot2 Create Elegant Data Visualisations Using the Grammar of Graphics
glue Interpreted String Literals
gtable Arrange 'Grobs' in Tables
hms Pretty Time of Day
jsonlite A Robust, High Performance JSON Parser and Generator for R
labeling Axis Labeling
lattice Trellis Graphics for R
lazyeval Lazy (Non-Standard) Evaluation
lubridate Make Dealing with Dates a Little Easier
magrittr A Forward-Pipe Operator for R
mime Map Filenames to MIME Types
mnormt The Multivariate Normal and t Distributions
modelr Modelling Functions that Work with the Pipe
munsell Utilities for Using Munsell Colours
nlme Linear and Nonlinear Mixed Effects Models
pkgconfig Private Configuration for 'R' Packages
plogr The 'plog' C++ Logging Library
plyr Tools for Splitting, Applying and Combining Data
psych Procedures for Psychological, Psychometric, and Personality Research
purrr Functional Programming Tools
R6 Classes with Reference Semantics
RColorBrewer ColorBrewer Palettes
Rcpp Seamless R and C++ Integration
readr Read Rectangular Text Data
rematch Match Regular Expressions with a Nicer 'API'
reshape2 Flexibly Reshape Data: A Reboot of the Reshape Package
rlang Functions for Base Types and Core R and 'Tidyverse' Features
scales Scale Functions for Visualization
selectr Translate CSS Selectors to XPath Expressions
stringi Character String Processing Facilities
stringr Simple, Consistent Wrappers for Common String Operations
tibble Simple Data Frames
tidyr Easily Tidy Data with 'spread()' and 'gather()' Functions
tidyselect Select from a Set of Strings
viridisLite Default Color Maps from 'matplotlib' (Lite Version)
xml2 Parse XML
Environmental variables are set in the shell, prior to executing your R code or starting the R interpreter. If you are using the default shell on GL, use the command setenv R_LIBS_USER /afs/umbc.edu/users/b/w/bwilk1/pub/R_libs. If you are using bash as your shell, use the command export R_LIBS_USER=/afs/umbc.edu/users/b/w/bwilk1/pub/R_libs. If you want to set the more permanently, place it in the appropriate ~/.cshrc or ~/.bashrc
1. degreeChange (20 points)
Task
In this problem, you are to implement a function, degreeChange, that takes in some messy data from the University of System of Maryland, cleans it, and produces a DataFrame showing the change in the number of bachelor's degrees awarded at an institution, per area.
The degreeChange function takes exactly 3 parameters
Two file names, where the data for two different years is stored
The name of the instutitution, as a string, for which to produce a data frame for
Data File
The data for this problem comes from USM IRIS system. We have provided 3 sample files for you to test and develop your code with, although additional files may be used when grading. All files have the same general set up, although some older files have different column names.
Note: If you wish to download more files to test your code, feel free to do so. The files are downloaded using the download txt button, and have the first and last lines removed.
Example Output
>source('433.R')
>print(degreeChange('USM2017',"USM1986","University of Maryland, Baltimore County"))
Bachelor.s
Agric. & Nat. Resources 0
Arch. & Envtl. Design 0
Area Studies 9
Biological Sciences 421
Business & Management 8
Communications 74
Computer & Info. Sci. 77
Education 1
Engineering 194
Fine & Applied Arts 96
Foreign Languages 6
Health Professions -15
Home Economics 0
Interdisc. & CC Transfer -8
Law 0
Letters 28
Library Science 0
Mathematics 72
Physical Science 20
Psychology 249
Public Affairs & Serv. 72
Social Sciences 164
Total 1468
changes <- degreeChange('USM2017',"USM2016","University of Maryland, College Park")
> print(changes)
Bachelor.s
Agric. & Nat. Resources 14
Arch. & Envtl. Design 10
Area Studies -11
Biological Sciences -6
Business & Management -76
Communications -1
Computer & Info. Sci. 63
Education -27
Engineering 11
Fine & Applied Arts -1
Foreign Languages -12
Health Professions 130
Home Economics -28
Interdisc. & CC Transfer -1
Law 0
Letters 5
Library Science 0
Mathematics -2
Physical Science 7
Psychology -55
Public Affairs & Serv. 0
Social Sciences 19
Total 39
How you will be graded
The points for this problem will be assigned according to the following rubric
Requirement
Points
Total
20
Differences are calculated correctly
10
Only the specified institutions information is returned
5
Numbers with commas in them in the original data are handled correctly
3
No warnings are printed when your function is called
2
2. Plotting Bechdel Test Data for Movies (40 points)
Task
For this problem you will create 4 plots, using data on the Bechedel test in movies, as collected by FiveThirtyEight. This dataset is found in the repository for this assignment under the name 'movies.csv'. The Bechedel test is a way to represent the portrayal of women in film. It is an imperfect measure, but still presents an interesting way to look at the data. In order to pass the Bechedel test, the movie, or any work of fiction, must meet the following three critera:
Two women characters are in the film
They talk to eachother at least once
About something besides a man
The movies.csv dataset contains the following fields
The Year the Movie was Released
The movies IMDB number
The title of the movie
How the movie failed the test, or if it passed (ok)
A clean version fo the previous column, without modifiers like "-disagree"
A binary decision of did the movie past the test or not
The budget of the movie
The domestic gross of the movie
The international gross of the movie
A code which is simply the concatenation of the year and PASS or FAIL
The budget in 2013 dollars
The domestic gross in 2013 dollars
The interational gross in 2013 dollars
An integer indicating what five year span the movie occured in, from 1990-2013
An integer indicating what decade the movie occured in, only present for movies released from 1990-2013
Using this data, and the ggplot2 library, you should produce the following plots
A histogram showing the number of movies that passed, or failed, by reason they failed. You should have 5 bars
A violin plot showing the distribution of release years by whether the movie passed or not.
A plot showing the average budget for movies that pass and don't pass in a given year using geom_smooth
A scatter plot showing the budget vs domestic gross for each movie, colored by and shape determined by if it passed the test, or how it failed if it did fail
You should submit both the plot as a png as well as the code generated to create your plots. Wrap your code to generate the plots in a function named makePlots.
Sample Output
The following plots are shown just as examples of the plot type and general direction. They are very incomplete and you should not strive to reproduce them.
Notes
You should not use the default colors or grayscale on any of your plots.
All your plots should have a meaningful title and axis labels
Use 2013 dollar ammount for any plot requiring budget or gross
How you will be graded
The points for this problem will be assigned according to the following rubric
Each plot is worth 10 points
Requirement
Points
Total
10 x 4
The plot is the correct type
3
The correct information is plotted
3
The axis and title are labeled
2
The color scheme is not the default scheme
2
3. Corpus Statistics (40 points)
Task
For this problem you will create two R classes, corpus and document.
A document holds the text of a single document, and can be queried for summary statistics about the document. A corpus is a collection of documents, and allows for summary statistics for the entire corpus to be queried.
The constructor for a document should take one argument, the name of a file.
The constructor for a corpus should take one argument, the name of a directory, under which a document for each file ending in '.txt' is added to the corpus
The following methods should be implemented for each class
summary(object) - Which for a document prints
The total number of words
The average number of letters in a word
The longest word
And for a corpus prints
The total number of words in the corpus
The average number of words in a document in the corpus
The maximum and minimum number of words in a document in the corpus
The average word length of a word in the corpus
The longest word in the corpus
most_common(object,N) - which returns a vector of the N most common words in the object
preview(object)- which returns the first 50 words for a document, or each document if called on a corpus
Example Output
> source('433.R')
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
> doc1 <- document('hounds.txt')
> summary(doc1)
The document has 62245 words
The average number of characters in a word is 4.39996786890513
The longest word is http://www.gutenberg.org/3/0/7/3070/
> most_common(doc1,10)
the of and to a I that in was you
3230 1694 1562 1449 1279 1256 1040 908 781 663
> most_common(doc1,20)
the of and to a I that in was you his he is
3230 1694 1562 1449 1279 1256 1040 908 781 663 658 657 605
it have had with which my for
576 525 496 466 422 420 413
> preview(doc1)
[1] "Project" "Gutenberg's" "The" "Hound"
[5] "of" "the" "Baskervilles," "by"
[9] "Arthur" "Conan" "Doyle" "This"
[13] "eBook" "is" "for" "the"
[17] "use" "of" "anyone" "anywhere"
[21] "at" "no" "cost" "and"
[25] "with" "almost" "no" "restrictions"
[29] "whatsoever." "You" "may" "copy"
[33] "it," "give" "it" "away"
[37] "or" "re-use" "it" "under"
[41] "the" "terms" "of" "the"
[45] "Project" "Gutenberg" "License" "included"
[49] "with" "this"
> doc2 <- document('les_mis.txt')
> summary(doc2)
The document has 568729 words
The average number of characters in a word is 4.68505914064519
The longest word is "Police-agent-Ja-vert-was-found-drowned-un-der-a-boat-of-the-Pont-au-
> most_common(doc2,10)
the of and a to in was that he his
36456 19550 13787 13371 13263 10162 8274 6858 6535 6077
> most_common(doc2,20)
the of and a to in was that he his had is which
36456 19550 13787 13371 13263 10162 8274 6858 6535 6077 6009 5814 4961
with on The it not at I
4398 3910 3676 3560 3560 3490 2905
> preview(doc2)
[1] "The" "Project" "Gutenberg" "EBook" "of"
[6] "Les" "Misérables," "by" "Victor" "Hugo"
[11] "This" "eBook" "is" "for" "the"
[16] "use" "of" "anyone" "anywhere" "at"
[21] "no" "cost" "and" "with" "almost"
[26] "no" "restrictions" "whatsoever." "You" "may"
[31] "copy" "it," "give" "it" "away"
[36] "or" "re-use" "it" "under" "the"
[41] "terms" "of" "the" "Project" "Gutenberg"
[46] "License" "included" "with" "this" "eBook"
> corpus1 <- corpus(".")
> summary(corpus1)
The corpus has 760772 words
The average number of words in a document is 190193 words
The maximum number of words in a document is 568729 words
The minimum number of words in a document is 20348 words
The average word is 4.61732292986598 characters long
The longest word in the coprpus is "Police-agent-Ja-vert-was-found-drowned-un-der-a-boat-of-the-Pont-au-
> most_common(corpus1,10)
the of and to a in was that he his
45765 24414 18609 17791 17493 12977 10451 9625 8273 7963
> most_common(corpus1,20)
the of and to a in was that he his is had I
45765 24414 18609 17791 17493 12977 10451 9625 8273 7963 7702 7346 7025
which with it not at on The
6186 5854 5235 4656 4641 4522 4291
> preview(corpus1)
[,1] [,2] [,3]
[1,] "Project" "The" "***The"
[2,] "Gutenberg's" "Project" "Project"
[3,] "The" "Gutenberg" "Gutenberg's"
[4,] "Hound" "EBook" "Etext"
[5,] "of" "of" "of"
[6,] "the" "Les" "Shakespeare's"
[7,] "Baskervilles," "Misérables," "First"
[8,] "by" "by" "Folio***"
[9,] "Arthur" "Victor" "********************The"
[10,] "Conan" "Hugo" "Tragedie"
[11,] "Doyle" "This" "of"
[12,] "This" "eBook" "Macbeth*********************"
[13,] "eBook" "is" "This"
[14,] "is" "for" "is"
[15,] "for" "the" "our"
[16,] "the" "use" "3rd"
[17,] "use" "of" "edition"
[18,] "of" "anyone" "of"
[19,] "anyone" "anywhere" "most"
[20,] "anywhere" "at" "of"
[21,] "at" "no" "these"
[22,] "no" "cost" "plays."
[23,] "cost" "and" "See"
[24,] "and" "with" "the"
[25,] "with" "almost" "index."
[26,] "almost" "no" "Copyright"
[27,] "no" "restrictions" "laws"
[28,] "restrictions" "whatsoever." "are"
[29,] "whatsoever." "You" "changing"
[30,] "You" "may" "all"
[31,] "may" "copy" "over"
[32,] "copy" "it," "the"
[33,] "it," "give" "world,"
[34,] "give" "it" "be"
[35,] "it" "away" "sure"
[36,] "away" "or" "to"
[37,] "or" "re-use" "check"
[38,] "re-use" "it" "the"
[39,] "it" "under" "copyright"
[40,] "under" "the" "laws"
[41,] "the" "terms" "for"
[42,] "terms" "of" "your"
[43,] "of" "the" "country"
[44,] "the" "Project" "before"
[45,] "Project" "Gutenberg" "posting"
[46,] "Gutenberg" "License" "these"
[47,] "License" "included" "files!!"
[48,] "included" "with" "Please"
[49,] "with" "this" "take"
[50,] "this" "eBook" "a"
[,4]
[1,] "Project"
[2,] "Gutenberg's"
[3,] "Adventures"
[4,] "of"
[5,] "Sherlock"
[6,] "Holmes,"
[7,] "by"
[8,] "A."
[9,] "Conan"
[10,] "Doyle"
[11,] "This"
[12,] "eBook"
[13,] "is"
[14,] "for"
[15,] "the"
[16,] "use"
[17,] "of"
[18,] "anyone"
[19,] "anywhere"
[20,] "at"
[21,] "no"
[22,] "cost"
[23,] "and"
[24,] "with"
[25,] "almost"
[26,] "no"
[27,] "restrictions"
[28,] "whatsoever."
[29,] "You"
[30,] "may"
[31,] "copy"
[32,] "it,"
[33,] "give"
[34,] "it"
[35,] "away"
[36,] "or"
[37,] "re-use"
[38,] "it"
[39,] "under"
[40,] "the"
[41,] "terms"
[42,] "of"
[43,] "the"
[44,] "Project"
[45,] "Gutenberg"
[46,] "License"
[47,] "included"
[48,] "with"
[49,] "this"
[50,] "eBook"
>
How you will be graded
The points for this problem will be assigned according to the following rubric
Requirement
Points
Total
40
The document constructor prints no errors
1
The corpus constructor prints no errors
1
The summary statistics for the document are correct
6
The summary statistics for the corpus are correct
12
The most common function returns the correct words for a document
3
The most common function returns the correct number of words for the argument passed for a document
2
The most common function returns the correct words for a corpus
3
The most common function returns the correct number of words for the argument passed for a corpus
2
Preview returns the first 50 words for a single document
5
Preview returns the first 50 words for each document in the corpus
5
Running your code
To test your code, you can start the R interpreter and load the file using source('433.R').
Submitting your code
Your code should be committed and pushed back to GitHub before the due date. All your R code should be in a file named 433.R. The plots should be named plot1.png, plot2.png, plot3.png, and plot4.png.