R¶

Strings, Performance, Misc¶

Base R String Functions¶

R has limited support for text processing
- If this is the main purpose of a project, think about using another language
Just like other functions in R, the string functions operate on vectors
Common string functions
- strsplit
- grep/ grepl
- nchar
- toupper / tolower
- substr

In [ ]:

print(nchar(c("I'm a little teapot","short and stout")))
print(nchar(c("I'm a little teapot", 14)))
print(nchar("I the only string"))

In [ ]:

str_vector <- c("I'm a little teapot","short and stout",14,
                FALSE)
print(toupper(str_vector))
print(tolower(str_vector))

Substring in R¶

substr and substring take in 3 arguments, any of which can be vectors

substr(strings, start, end)
  substring(strings, first, last)

If start or end is longer than the other, the values of the shorter one are recycled
- Only substring repeats the strings

In [ ]:

print(substr("Hello World",3,5))

In [ ]:

print(substr("Hello World",1:3,1:3))
print(substring("Hello World",1:3,1:3))
print(substring("Hello World",c(1,2,3),c(1,2,3)))
print(substring("Hello World",5:20,10))
print(substring("Hello World",4,10:15))

In [ ]:

str_vector <- c("I'm a little teapot","short and stout",14,FALSE)
print(substr(str_vector,2,1000L))
cat("\n")
print(substr(str_vector,1:5,1000L))
cat("\n")

In [ ]:

print(substring(str_vector,1:5,1000L))
cat("\n")
print(substring(str_vector,1:15,1000))

Regex in R¶

Both strsplit as well as grep and grepl can take regular expressions
- By default, these are POSIX style regular expressions
- Pass perl=TRUE to use PCRE
grep returns the indexes in the vector the match was found at
grepl returns a logic vector indicating if an element of the vector matched

In [ ]:

strings_with_spaces <- c("I am a string",
                         "I am one too",
                         "This also has spaces")
print(strsplit(strings_with_spaces,split=' '))

In [ ]:

strings_with_spaces <- c("I am a string",
                         "I am one too",
                         "This also has spaces")
print(strsplit(strings_with_spaces,split="\\s",perl=TRUE))

In [ ]:

strings_with_spaces <- c("I am a string","I am one too","This also has spaces")
print(strsplit(strings_with_spaces,split="\\W",perl=TRUE))

In [ ]:

strings_with_spaces <- c("I am a string",
                         "I am one too",
                         "This also has spaces")
idx <- grep('I',strings_with_spaces,perl=TRUE)
print(strings_with_spaces[idx])

In [ ]:

grep('I',strings_with_spaces,perl=TRUE,ignore.case=TRUE)

In [ ]:

grep('\\bI\\b',strings_with_spaces,perl=TRUE,ignore.case=TRUE)

In [ ]:

grepl('\\bI\\b',strings_with_spaces,perl=TRUE,ignore.case=TRUE)

The StringR library¶

StringR is based on an older library, called stringi
The aim is to
- improve consistency in function calls
- make common string manipulation tasks easy
Has robust multilingual support

In [ ]:

library(stringr)

In [ ]:

print(str_length(str_vector))

In [ ]:

print(str_sort(str_vector))

In [ ]:

print(str_to_title(str_vector))

In [ ]:

print(str_pad(str_vector,40))

In [ ]:

str_vector <- c("\n\rI am a string\t\t",
                         "I am one\ntoo",
                         "This also has spaces")
print(str_trim(str_pad(str_vector,40)))

In [ ]:

str_c(str_vector,",")

In [ ]:

str_c(str_vector,collapse=", ")

In [ ]:

str_detect(str_vector,'o')

In [ ]:

str_count(str_vector,'o')

Directory Traversal in R¶

Most scripting languages provided an easy way to iterate over files in a directory
- This is known as globbing
- It also allows wildcards to be used
In R, the function is Sys.glob (note the uppercase)
- Rather than returning an iterator, it returns a vector containing all the file names

In [ ]:

print(Sys.glob("*.html"))

The readr package¶

As an alternative to built-in data loading functions, some people use the readr package
- I find the built in functions good enough usually
readr provides the read_file and write_file functions
- These read or write an entire file into a string ,or vice versa
- This is possible in base R, but cumbersome, because you must calculate the length of the string first

In [ ]:

library(readr)

In [ ]:

contents <- read_file("index.html")

In [ ]:

print(contents)

In [ ]:

print(str_extract_all(contents,'<a href=".*?">.*</a>'))

Performance in R¶

R is commonly viewed as a slow language
- Mostly because it is
We can still optimized and make sure to program in an R style
- Avoid for loops if you can use a vectorized function
- S4 methods are slower than S3, which is slower than a direct function call
- Consider bytecode compilation

Profiling your code¶

The microbenchmark library provides the microbenchmark function
- Takes in several functions, runs them all, and prints statistics
For line-by-line profiling, use the profvis package
- Uses a web browser to show results

In [ ]:

library(microbenchmark)
nums <- matrix(c(1:5000),nrow=100)
print(
    microbenchmark(
    colMeans(nums),
    apply(nums,2,mean)
    )
)

In [ ]:

## Needs to be run in RStudio
library(profvis)
print(
    profvis(
        {
    nums <- matrix(c(1:50000),nrow=100)  
    apply(nums,2,mean)
}
    ))

Parallelism¶

Because of its functional design, R is a perfect language for parallelization
The library parallel provides a mutlicore versions of mapply and lapply,
- mclapply
- mcmapply

mclapply(vector,function,mc.cores=N_CORES)
    mclapply(vector,function,axis,mc.cores=N_CORES)

In [ ]:

library(parallel)
print(detectCores())

In [ ]:

seed_strings <- c("asdf","ghhjk",'qerwet',
                  'uopi','zxcv','asdgf')
lots_of_strings <- rep(seed_strings,20000)
print(
    microbenchmark(
        lapply(lots_of_strings,str_length),
        mclapply(lots_of_strings,str_length,mc.cores=7)
    )
)

In [ ]:

cl <- makeCluster(8)
print(
    microbenchmark(
        colMeans(nums),
        apply(nums,2,mean),
        parCapply(cl,nums,mean)
        )
    )
stopCluster(cl)

Presenting Data¶

R is often used in the analysis phase of research
- Especially to produce nice graphics
Packages exist that allow papers to be written in R, combined with code
- knitr is a very popular one

KnitR¶

knitR allows a document to be written in
- R-style Markdown
- HTML
- LaTeX
R code is set off in these documents using various conventions
Code is executed and results displayed inline correctly

In [ ]:

library(knitr)
knit('005-latex.Rtex')

Loading Libraries from Non Default Locations¶

By default, R tries to install and looks for packages in a location that needs sudo access to write
You can change where libraries are installed by adding the lib parameter to install.packages
There are numerous ways to tell where to look for libraries, including in the library function
- The most consistent way is to set the environmental variable R_LIBS_USER in your shell before calling R