R¶

Introduction, Variables, Data Types¶

Brief History¶

R was developed initially as an alternative implementation of a language known as S
- S first came out in 1975 and was originally developed at Bell Labs
Work on R began in 1993, the first paper was publish in 1996, and the language reached version 1.0 in 2000
- Lead by a team at the University of Auckland in New Zealand originally
Designed originally for statisticians, not for programmers

Running R¶

R can be run
- From the command line, by using the command R
- Using the shebang line #!/usr/bin/Rscript
- In Jupyter using the IR kernel
- From inside the RStudio IDE

Limitations of R¶

Code is generally slower than other languages
- This was an acceptable trade off given the ease of use
Uses a lot of memory
- No easy way to perform calculations in chunks, although some packages are starting to provide support for this
- Is potentially a poor choice for big data

Assignment¶

R supports two assignment operators: <- and =
Although both are fine, most style guides and books suggest using <- is preferred
- There are many people that argue the exact oppostite however
<- Can be reversed to be written as -> but this is not normally done

In [ ]:

a <- 1
b = 1
1 -> c

In [ ]:

a == b

In [ ]:

b == c

Variable Names¶

Variables can contain letters, numbers, underscores, and the dot symbol
- Because of some historical weirdness, dots in R are often found instead of underscores
```
a.long.name <- "String"
```
The following names should not be used
```
  c, q, s, t, C, D, F, I, T
```

In [ ]:

aLongName <- 0
a_long_name <- 0
a.long.name <- 0

In [ ]:

print(aLongName)

In [ ]:

print(a_long_name)

In [ ]:

print(a.long.name)

Data Types and Data Structures¶

R has data types, and they are important, but they take a back seat to the data structures
- A variable cannot be scalar in R
The simplest data structure are vectors
- Every assignment that seems like a single number, string, etc. is actually a single element vector

In [ ]:

num <- 1
print(num)

In [ ]:

string <- "String"
print(string)

In [ ]:

bool <- TRUE
print(bool)

Data Types¶

The data types supported by R are:
- integer
- double
- complex (Uses "i" rather than "j" as seen in python)
- character (This can hold strings of any length)
- logical

In [ ]:

#Integers must be denoted by appending "L" to the number
#Otherwise they will be interpreted as a double by default
int <- 1L

#typeof() function returns the type as a string
print(typeof(1L))
print(typeof(1))

In [ ]:

float.a <- 1
float.b <- 1.01

print(typeof(float.a))
print(typeof(float.b))

In [ ]:

#Infinity and Not-a-Number are both represnted as doubles
float.c <- NaN
float.d <- Inf
float.e <- -Inf

print(typeof(float.c))
print(typeof(float.d))
print(typeof(float.e))

In [ ]:

imaginary.a <- 1 + 1i
imaginary.b <- 1 + 0i

print(typeof(imaginary.a))
print(typeof(imaginary.b))

In [ ]:

string.example.1 <- "String"
string.example.2 <- 'String'

print(typeof(string.example.1))
print(typeof(string.example.2))

string.example.2 <- 1
print(typeof(string.example.2))

In [ ]:

#Logical values are typed in all uppercase letters
logic.t <- TRUE
logic.f <- FALSE

print(typeof(logic.t))
print(typeof(logic.f))

Testing Data Types¶

R has numerous predicate functions relating to data types
There is one for each data type
- is.DATA_TYPE_NAME(x)
- e.g. is.integer(x)
There is also a generic number predicate
- is.numeric(x)

In [ ]:

print(int)
print(is.integer(int))
print(is.double(int))
print(is.numeric(int))
print(is.numeric("1"))

Type Casting¶

While data types will automatically be coerced in some situations, to explicitly cast use variations of the as function
- as.DATA_TYPE_NAME(x)
- eg as.integer(1.003)
This pattern is used throughout R, not just with primitive data types

In [ ]:

print(as.character(1L))
print(as.integer(1.0004))
print(as.integer(Inf))
print(as.double(1L))
print(as.complex(1))
print(as.numeric(TRUE))

Data Structures¶

Basic Data Structures in R can be described by the number of dimensions supported, and the data types allowed
From "Advanced R" by Hadley Wickham

	Homogeneous	Heterogeneous
1-D	Vector	List
2-D	Matrix	DataFrame
N-D	Array

Vectors¶

A vector can be created by using the c function
```
a.vector <- c(1,2,3,4)
```
All elements of a vector must be the same. If multiple types are passed to the c function, they will be coerced

In [ ]:

a.vector <- c(1,2,3,4)
print(a.vector)

In [ ]:

a.vector <- c(1.001,2,3,4)
print(a.vector)

In [ ]:

a.vector <- c(1.01,TRUE,3,4)
print(a.vector)

In [ ]:

a.vector <- c(TRUE,"a",3,4)
print(a.vector)

Factors¶

Factors are vectors that are limited to certain values
- Represent categorical data
- Helpful in statistical analysis
A factor can be created using the factor function, or converting an existing vector by using as.factor

In [ ]:

factor.1 <- factor(c("UMBC","UMCP","UMUC","UMB","UB"))
print(factor.1)
cat("\n")
factor.2 <- factor(c("Senior","Junior","Senior",
                     "Junior","Sophmore"))
print(factor.2)

In [ ]:

# Can use the levels keyword to specify all possible values
factor.3 <- factor(c("Senior","Junior","Senior",
                     "Junior","Sophmore"),
                    levels=c("Senior","Junior",
                             "Sophmore",'Freshman'))
print(factor.3)
cat("\n")
factor.4 <- as.factor(c("Senior","Junior",
                        "Senior","Junior","Sophmore"))
print(factor.4)

Lists¶

A list is a one dimensional (technically) data structure
- It can hold a mixture of any data types
- It can recursively hold other lists and vectors
Created using the list function
```
a.list <- list("a",2,3.14,FALSE)
```

In [ ]:

a.list <- list("a", 2, 3.14, FALSE)

#The str function will show the structure of a variable
#str DOES NOT stand for string, it stands for structure
str(a.list)
print(a.list)

In [ ]:

recursive.list <- list("a", 2, 3.14, list("re","cursive"))
str(recursive.list)

In [ ]:

# If you try to use c recursively, there is no error
# Everything is just flattened
a.vector <- c(1,2,3,c(4,5))
str(a.vector)

#Applying c to an arguments including at least one list 
#coerces the entire structure to a list
coerced.list <- c(1,2,3,list(4,5),list(6,7))
str(coerced.list)

Attributes¶

Under the surface, R is a very object-oriented language
- We will talk more about creating user-defined objects in a later lecture
All data structures we will discuss today have attributes that can be assigned values

The general syntax is

attr(OBJECT, "ATTRIBUTE_NAME") <- ATTRIBUTE_VALUE

In [ ]:

obj <- c(3,4,5,6)
print(attr(obj,"time_created"))
attr(obj,"time_created") <- date()
print(attr(obj,"time_created"))
cat("\n")
print(attributes(obj))

Special Attributes¶

While an attribute name can be anything, a few special attributes exist that modify the behavior of the object
- Names
- Dimensions
- Class
These attributes are so important that they have dedicated functions to access them, and cannot be access with the attr function

Naming Indexes¶

An existing list or vector can be given named indices by setting the names attribute
Just as before, we assign into what looks like function call
```
names(OBJECT) <- c(SERIES OF CHARACTERS)
```
A list or vector can also be created using named indices
```
VARIABLE <- c(a = 1, b = 2)
```

In [ ]:

scores <-  c(80,75,80,100,95,85)
names(scores) <- c("Regex HW","Regex Quiz",
                   "Shell HW","Shell Quiz", 
                   "R HW", "R Quiz")
print(scores)

Matrices¶

A matrix is a 2-d data structure that is homogenous in type
- Usually numbers, but could be boolean or characters too
Can by created by
- Using the matrix function
- Adding dimensions to an already existing vector
- Using the cbind or rbind functions

In [ ]:

# Using the Matrix Function
m <- matrix( c(1,2,3,4,5,6,7,8,9,10,11,12), 
            nrow=3, ncol=4 )
print(m)
cat("\n")
m2 <- matrix(1:12,ncol=4)
print(m2)

In [ ]:

#Creating a matrix of zeros
zeros <- matrix(0,nrow=3,ncol=4)
print(zeros)
cat("\n")
print(dim(zeros))

In [ ]:

#Adding Dimensions to an existing Vector
vec <- 1:12
print(vec)
print(dim(vec))
cat("\n")
dim(vec) <- c(3,4)
print(vec)

In [ ]:

#Using cbind
m3 <- cbind(c(1,2,3),c(4,5,6),c(7,8,9),c(10,11,12))
print(m3)
cat("\n")
m4 <- rbind(c(1,4,7,10),c(2,5,8,11),c(3,6,9,12))
print(m4)

Data Frames¶

Data Frames are 2-d data structures in which a given column of the data frame must have the same type, but columns may have different types
Each row is like a record in a simple database
Is generally the most common data structure encountered in R

Creating a Data Frame¶

While Data Frames are often created by reading directly from a file, it is also possible to create them programmatically.

The general syntax is

df <- data.frame(COL1 = c(VALUES FOR COL 1),
               COL2 = c(VALUES FOR COl2), ..., 
               COL_N = c(VALUES FOR COL_N))

In [ ]:

df <- data.frame(name=c("UMBC","UMCP","Towson"),
                 zipcode=c(21250,20742,21252),
                 undergrad=c(11142,28472,19596),
                 graduate=c(2498,10611,3109))
print(df)

Common Functions on a Data Frame¶

The function nrow returns the number of rows in the data frame
The functions ncol and length both return the number of columns
The names of the the rows can be accessed and changed using the row.names function

In [ ]:

print(nrow(df))
print(ncol(df))
row.names(df) <- c('A','B','C')
print(df)

Reading Data¶

R has many built in functions to read data files into data frames
- read.table reads a space separated file by default, and is the base to many other functions
- read.csv reads a comma separated values file, is actually just a call to read.table
R supports many other formats through various libraries
- One of the most common libraries is foreign which reads in data from many similar languages to R

In [ ]:

usm <- read.table("data/usm.tsv",sep="\t",header=TRUE)
print(usm)

In [ ]:

usm2 <- read.csv("data/usm.csv",row.names=1)
print(usm2)

Writing Data¶

R similarly supports many different formats in which to write data to a file
- write.table
- write.csv
By default, column and row names are printed to the file, to remove them set col.names or row.names to FALSE

In [ ]:

write.csv(usm2,'data/usm2.csv')

In [ ]:

write.csv(usm2,'data/usm2.csv',append=TRUE,col.names=FALSE)

In [ ]:

write.table(usm2,'data/usm2.csv',sep=","
          ,append=TRUE,col.names=FALSE)

Math¶

Standard operations of +,-,*,/, and ^
Modulus operator is %%
Integer division is %/%
Square root and absolute value are part of R's base package

In [ ]:

#Addition
print(1 + 1)
print(1 + 1.0)
print(1 + 1i + 2)
print(2 + 1 + 3i)
print(2 + 3i + 4 + 5i)

In [ ]:

#Subtraction
print(3-2)
print(0-3)

In [ ]:

#Multiplication
print(3 * 4)
print(3 * .12)

In [ ]:

#Division
print(3/4)
print(0/4)
print(0/0)
print(3/0)
print(-3/0)

In [ ]:

# Integer Division
print(3 %/% 4)
print(12 %/% 5)
print(3 %/% 0)
print(0 %/% 0)

In [ ]:

#Modulus

print(3 %% 3)
print(10 %% 3)
print(0 %% 0)
print(3 %% 0)

In [ ]:

print(3 ^ 3)
print(9 ^ 0.5)
print(10 ^ -2)

High-Dimensional Math¶

Mathmatical operation on higher dimensional data structures is navtively part of R
For scalar operations, like mutiplying every value by 2, the dimensionality doesn't matter
- For operations involving two data frames, two matrices, etc. the size should match to prevent unintended outcomes
In addition, both matrices and data.frames can be transposed using the t function

In [ ]:

#Vector / Scalar Math
vec <- 1:5
print(vec * 2)
print(vec / 10)
print(vec + 1)

In [ ]:

#Vector addition
vec2 <- 10:15
print(vec + vec2)
vec2 <- 11:15
print(vec + vec2)

In [ ]:

#Element-wise multiplication
print(vec * vec2)
cat("\n")
#Dot Product
print(vec %*% vec2)
#print(cvec,vec2))

In [ ]:

#Matrix / Vector Operations
mat <- matrix(1:20,nrow=5)
print(mat)
print(mat / vec)

In [ ]:

#Matrix / Vector Operations
mat2 <- matrix(1:20,nrow=4)
print(mat2)
print(mat2 / vec)

In [ ]:

#DataFrame Operations
print(usm)
cat("\n")
print(usm * 2)

In [ ]:

#Transposition
print(t(mat))
cat("\n")

In [ ]:

#What is the datastructure returned by this function?
print(t(usm))
print(as.data.frame(t(usm)))

Boolean Comparison¶

R supports the standard boolean operators of <, >, <=, >=, == !=
- The and an or operators are & and | respectively
When used between vectors or matrices, returns a object of the same size filled with boolean values

In [ ]:

##Standard Scalar Comparison
print(3 == 4)
print(3 < 4)
print(3 < 4 & 5 < 10)
print(3 == 4 | 4 != 4)

In [ ]:

## Comparing Data Structures
print(vec)
print(vec2)
cat("\n")
print(vec == vec2)
print(vec < vec2)

In [ ]:

#Vector and Matrix Comparison
print(vec)
print(mat)
cat("\n")
print(vec == mat)

Subsetting Vectors¶

Indexing starts at 1!
Subsetting is done using square brackets ([ ])
Subsetting is most commonly done with a vector of
- Positive Integers
- Negative Integers
- Boolean Values

Positive Integer Subsetting¶

Positive integers denote which values to return

In [ ]:

print(vec)
print(vec[1])
print(vec[2:3])
print(vec[c(1,5)])
#Can repeat indices
print(vec[c(2,2)])

Positive Integer Subsetting¶

Negative integers denote which values to not return

In [ ]:

print(vec)
print(vec[-1])
print(vec[-2:-3])
print(vec[c(-1,-5)])

Boolean Value Subsetting¶

Values are returned when the subsetting vector contains TRUE
To prevent unexpected errors, the vector used to subset should be the same length as the vector being indexed into
- If the index vector is shorter than the vector being indexed, the values will repeat as many times as necessary

In [ ]:

# Explicit Boolean Subsetting
print(vec)
print(vec[c(TRUE,FALSE,TRUE,FALSE,TRUE)])
cat("\n")
#Using an expression
print(vec[vec %% 2 == 0])

Subsetting Lists¶

Subsetting a list with the [] operator will return another list
- To return a specific value (as a vector) use [[]]
The dollar operator is an alias for [[]], but only [[]] can use a variable to do the subsetting

In [ ]:

#Returns a list
li <- list(a=1,b=2,c=3,d=4,e=5)
print(li[2])
print(li[[2]])
print(li[['b']])
print(li$b)
idx <- 'b'
cat("\n")
print(li[[idx]])
print(li$idx)

Subsetting Matrices¶

Matrices can also be subset using the [] operator
- With matrices, two indices can be provided, in the order of row,column
- If just one is provided, it treats the matrix like a vector

In [ ]:

print(mat)
cat("\n")
print(mat[5])
print(mat[5,])
print(mat[,4])
print(mat[5,4])
print(mat[c(5,4),])

Subsetting Data Frames¶

Subsetting Data Frames is very similar to matrices, but passing one index considered a column
- The $ operator as used with lists can also be used to refer to a specific column
Rows (or observations) are selected by adding a comma after the row indices

In [ ]:

print(usm[1])
cat("\n")
print(usm['Name'])
#This is a vector rather than a one column DF
print(usm$Name)

In [ ]:

print(usm[usm['Undergraduate.Enrollment'] > 10000,])
cat("\n")
print(usm[usm['Undergraduate.Enrollment'] > 10000,'name'])
usm['total'] <- usm[3] + usm[4]
print(usm)

R's built-in help system¶

R has excellent built in help capabilities
- To access the documentation for a specific function, type ?FUNCTION_NAME
- To search all helpfiles for a keyword, use the ?? function
Typing a function without any arguments or parentheses will at a minimum show you the signature of the function
- If code is not compiled, the code of the function will be displayed too

In [ ]:

?read.table

In [ ]:

read.table