September 17, 2016

R Object Classes, Functions and Coding Patterns

About

Mike Badescu

  • PhD Economics, Lehigh University
  • Experience: Economics, Finance, Statistics, Valuation
  • Programming / main tools: R, Python, SQL, Spark, Scala

Numeract

  • Data Science and Economics / Finance consulting services

Summary

  1. R Object Classes
    • S3
    • S4
    • Reference Classes & R6
  2. R Functions and Coding Patterns
    • dplyr & ggplot2
    • immutability vs mutability
    • pure functions
  3. Open mic: how do you use R?

Object-Oriented Programming (OOP)

Main Principles:

  • Abstraction
    • identify essential characteristics that distinguish objects
  • Encapsulation
    • hide implementation
  • Modularity / Inheritance
    • promote code reuse (Don’t Repeat Yourself - DRY)
  • Polymorphism
    • single interface to entities of different data types

Motivation

Setup

library(tibble); library(dplyr) 
library(ggplot2); library(R6)

Regular polygons, first take

triangle_size <- 6
square_size   <- 4
hexagon_size  <- 5

A common distinguishing feature: size

triangle <- list(size = 6)
square   <- list(size = 4)
hexagon  <- list(size = 5)

Motivation

Size alone is not sufficient:

x <- list(size = 6)

# is x a square or a triangle?

Adding sides

triangle <- list(sides = 3, size = 6)
square   <- list(sides = 4, size = 4)
hexagon  <- list(sides = 6, size = 5)

Motivation

The objects triangle, square and hexagon:

  • have the same object attributes
  • belong to the same type of object / class: regular_polygon

We can perform similar actions: print(), resize(), plot()

  • functions associated with objects are called methods

Motivation

Let’s implement a print function:

print_rp <- function(rp) {    # hmm... the 'print' name is taken
    cat("Regular polygon with", rp$sides, 
        "sides of size", rp$size, "\n")
}

print_rp(triangle)
## Regular polygon with 3 sides of size 6

Nice!

A readable representation of the object is always welcome!

Motivation

Let’s implement resize():

resize <- function(rp, value = 1) {
    rp$size <- rp$size * value
    rp
}

print_rp(resize(triangle, 2))
## Regular polygon with 3 sides of size 12
print_rp(triangle)  # the original object is unchanged
## Regular polygon with 3 sides of size 6

Motivation

Or we can also use a replacement function:

`size<-` <- function(rp, value = 1) {   # note the name
    rp$size <- rp$size * value
    rp
}

triangle_copy <- triangle  # let's work with a copy
size(triangle_copy) <- 2 
print_rp(triangle_copy)
## Regular polygon with 3 sides of size 12

We modified the original / input object.

Motivation

Congratulations, we have a rudimentary OO model!

Anything missing?

How do I add a circle to the collection?

  • Circles are distant relatives of Regular Polygons
    • common elements like size
    • different elements such as sides and print_rp()
  • This is Inheritance

I would love to use print() instead of print_rp()!

  • This is Polymorphism

S3

  • S3 is an OO style based on generic functions
  • In Java, Python:
    • triangle.resize()
    • the object decides which method to call
  • In R:
    • resize(triangle)
    • a generic function used as dispatcher
  • Main role: specific methods for print, summary, plot, predict, etc.

S3 Classes

The class belongs to the object as an R attribute

attr(triangle, "class") <- "regular_polygon"
triangle
## $sides
## [1] 3
## 
## $size
## [1] 6
## 
## attr(,"class")
## [1] "regular_polygon"

S3 Classes

Better: use class()

class(square) <- "regular_polygon"
class(square)
## [1] "regular_polygon"
class(hexagon) <- "regular_polygon"
class(hexagon)
## [1] "regular_polygon"

S3 Example

We have seen classes before

df1 <- data.frame(a = 1:2, b = 3:4)
class(df1)
## [1] "data.frame"
str(df1)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: int  1 2
##  $ b: int  3 4

S3 Example

Under the hood, statistical models are lists

model <- lm(b ~ a, data = df1)
class(model)
## [1] "lm"
str(model)
## List of 12
##  $ coefficients : Named num [1:2] 2 1
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "a"
##  $ residuals    : Named num [1:2] 0 0
##   ..- attr(*, "names")= chr [1:2] "1" "2"
##  $ effects      : Named num [1:2] -4.95 0.707
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "a"
##  $ rank         : int 2
##  $ fitted.values: Named num [1:2] 3 4
##   ..- attr(*, "names")= chr [1:2] "1" "2"
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:2, 1:2] -1.414 0.707 -2.121 0.707
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "1" "2"
##   .. .. ..$ : chr [1:2] "(Intercept)" "a"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.707 0.707
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 0
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = b ~ a, data = df1)
##  $ terms        :Classes 'terms', 'formula'  language b ~ a
##   .. ..- attr(*, "variables")= language list(b, a)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "b" "a"
##   .. .. .. ..$ : chr "a"
##   .. ..- attr(*, "term.labels")= chr "a"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(b, a)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "b" "a"
##  $ model        :'data.frame':   2 obs. of  2 variables:
##   ..$ b: int [1:2] 3 4
##   ..$ a: int [1:2] 1 2
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language b ~ a
##   .. .. ..- attr(*, "variables")= language list(b, a)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "b" "a"
##   .. .. .. .. ..$ : chr "a"
##   .. .. ..- attr(*, "term.labels")= chr "a"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(b, a)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "b" "a"
##  - attr(*, "class")= chr "lm"

S3 Example

More than one class is possible

df2 <- as_data_frame(df1)
class(df2)
## [1] "tbl_df"     "tbl"        "data.frame"
str(df2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
##  $ a: int  1 2
##  $ b: int  3 4

S3 Example

Printing using the same function print()

print(df1)
##   a b
## 1 1 3
## 2 2 4
print(df2)
## # A tibble: 2 x 2
##       a     b
##   <int> <int>
## 1     1     3
## 2     2     4

S3 Example

We can print() the model, even if it is a different object

print(model)
## 
## Call:
## lm(formula = b ~ a, data = df1)
## 
## Coefficients:
## (Intercept)            a  
##           2            1

How does this magic work?

S3 Generic Functions

Let’s look at the code for print()

print
## function (x, ...) 
## UseMethod("print")
## <bytecode: 0x000000001d0d0100>
## <environment: namespace:base>

UseMethod() is the dispatcher!

From ?UseMethod:

Method dispatch takes place based on the class(es) of the first argument to the generic function or of the object supplied as an argument to UseMethod

S3 Generic Functions

  • Specific print methods are named print.class_name(), etc.
  • print() calls the right method based on the class of the first argument
  • How many “prints” are there?
methods(print)

# ...
# [70] print.data.frame                                  
# [71] print.Date                                        
# [72] print.default
# ...
# [128] print.lm*        <- * means non-visible
# ...
# [221] print.tbl_df*
# ...

S3 Example

Yeah? … I want a print() method too!

  • The generic print() takes care of the dispatch
  • All I need to do is to define print.regular_polygon()
print.regular_polygon <- function(rp) {
    cat("Regular polygon with", rp$sides, 
        "sides of size", rp$size, "\n")
}

print(triangle)
## Regular polygon with 3 sides of size 6

Success!

S3 Methods

What methods are available for my regular_polygon?

methods(class = "regular_polygon")
## [1] print
## see '?methods' for accessing help and source code

I want a resize() method!

Is there a resize() generic to take care of the dispatch?

methods(resize)
## no methods found

S3 Methods

Define the generic (it will overwrite the previous resize() function)

resize <- function(x, ...) UseMethod("resize", x)   # only one argument

resize.regular_polygon <- function(rp, value = 1) {
    rp$size <- rp$size * value
    rp
}

Testing with printing

print(resize(triangle, 2))
## Regular polygon with 3 sides of size 12

This takes care of Polymorphism. What about Inheritance?

S3 Inheritance

Inheritance requires to think about an hierarchy.

In our example, let’s say that circle has its own class:

  • inherits from regular_polygon
  • we keep size but ignore (NA) sides
  • we keep resize() but re-define print()

Alternative:

  • new class
  • define size
  • define resize()and print()

S3 Inheritance

circle <- list(sides = NA, size = 1)
class(circle) <- c('circle', 'regular_polygon')
print(circle)
## Regular polygon with NA sides of size 1

circle inherited everything from regular_polygon; define own print.

print.circle <- function(rp) {
    cat("Circle of size", rp$size, "\n")
}
print(circle)
## Circle of size 1

Better!

S3 Inheritance

Check inheritance

inherits(circle, 'circle')
## [1] TRUE
inherits(circle, 'regular_polygon')
## [1] TRUE

Resize works by calling resize of the parent class

print(resize(circle, 2))
## Circle of size 2

S3: That’s all folks!

(there are more details but we skip them today)

But wait …

You mean I can just take over objects, assign them any class?

class(model)
## [1] "lm"
class(model) <- c("Agent Smith", class(model))
class(model)
## [1] "Agent Smith" "lm"

S3 Problems

class(df1)
## [1] "data.frame"
class(df1) <- c("Agent Smith", class(df1))
class(df1)
## [1] "Agent Smith" "data.frame"

This code was responsible for Agent Smith taking over in Matrix Revolutions*

*not a true fact

S3 Problems

  • R is weakly / dynamically typed
  • Functions / Methods do not check for data type correctness
class(model) <- "data.frame"   # model was created by lm()
print(model)

# Error in `[.data.frame`(x, seq_len(n), , drop = FALSE) : 
#  attempt to set an attribute on NULL

S4

  • More formal
  • Object attributes (slots) are clearly defined
    • Slots are accessed with @ (not $)
    • Advanced: access to slots from Rcpp
  • Object validation
  • Methods have defined data types for arguments and output
    • can dispatch on multiple arguments
  • Inheritance is clearly defined

S4 Define

Use setClass()

rm(list = ls())   # clean up

setClass(
    Class = "Circle",
    slots = representation(
        size = "numeric"
    )
)

S4 New Instance

Use new(), must name the arguments

circle <- new("Circle", size = 1.5)
circle
## An object of class "Circle"
## Slot "size":
## [1] 1.5

Type checking works

circle2 <- new("Circle", size = "1.5")
# Error in validObject(.Object) :
#  invalid class “Circle” object: invalid object for slot 
# "size" in class "Circle": got class "character", 
# should be or extend class "numeric"

S4 Accessing Slots

Use @ or slot()

circle@size
## [1] 1.5
slot(circle, 'size')
## [1] 1.5

S4 Modifying Slots

circle@size <- 1.7  # you should do this only inside of a class method
circle@size
## [1] 1.7

Type checking works again

circle@size <- "1.7"
# Error in (function (cl, name, valueClass)  : 
#  assignment of an object of class “character” is not valid for 
# @‘size’ in an object of class “Circle”; 
# is(value, "numeric") is not TRUE

S4 Printing

  • S4 printing is using the show() method; we need to overwrite it
  • Use setMethod() with a function (which can be anonymous)
Circle_show <- function(object) {
    cat("Circle of size", object@size, "\n")
}

setMethod("show", "Circle", Circle_show)
## [1] "show"
print(circle)
## Circle of size 1.7

S4 Validity

  • We want additional constraints, e.g., @size should be positive
  • Additional validity is provided by a function we must supply
  • Add validity within setClass() or by using setValidity()
    • return either TRUE or an error message
Circle_validity <- function(object) {
    if (object@size < 0) return("Size must be positive")
    TRUE   # return TRUE if all tests pass
}

setValidity("Circle", Circle_validity)
circle2 <- new("Circle", size = -5)
# Error in validObject(.Object) : 
#  invalid class “Circle” object: Size must be positive

S4 Validity

Modifying an existing S4 object is still permitted!!

circle@size <- -5
circle
## Circle of size -5

Force validity check

validObject(circle)
# Error in validObject(circle) : 
#  invalid class “Circle” object: Size must be positive

Make the circle nice again

circle@size <- 1.7

S4 Accessors

  • It’s not a good idea to directly modify slots outside the object
  • We will define accessor methods (since we are formal and all)
    • need to define the generic first setGeneric()
    • standardGeneric() used for S4 dispatching
Circle_get_size <- function(object) object@size

setGeneric(name = "size", 
           def = function(object) standardGeneric("size"),
           valueClass = "numeric")
setMethod("size", "Circle", Circle_get_size)
size(circle)
## [1] 1.7

S4 Accessors

Circle_set_size <- function(object, value) {
    object@size <- value     # modifies the object in the local scope
    validObject(object)      # force validation
    object                   # the setter must return the object
}

setGeneric(name = "size<-", 
           def = function(object, value) standardGeneric("size<-"))
setMethod("size<-", "Circle", Circle_set_size)

Let’s test it:

size(circle) <- 2.5
size(circle)
## [1] 2.5

S4 Inheritance

Let’s have RegularPolygon inherit from Circle

  • need to add a slot for sides
setClass(
    Class = "RegularPolygon",
    slots = representation(sides = "integer"), # only the new slot
    contains = "Circle"     # size inherited from Circle
)
triangle <- new("RegularPolygon", sides = 3L, size = 3.5)
print(triangle)
## Circle of size 3.5

Bummer!

S4 Inheritance

Override show() for RegularPolygon

RegularPolygon_show <- function(object) {
    cat("Regular polygon with", object@sides, 
        "sides of size", object@size, "\n")    
}

setMethod("show", "RegularPolygon", RegularPolygon_show)
## [1] "show"
print(triangle)
## Regular polygon with 3 sides of size 3.5

S4 Inheritance

Test resizing

(size(triangle) <- 3.8)
## [1] 3.8
is(triangle)            # this is how we check for S4 classes
## [1] "RegularPolygon" "Circle"
is(triangle, "Circle")  # R cannot overcome poor design choices
## [1] TRUE

We could overwrite validity and create more accessors, etc.

S4 Problems

  • Possible conflicts among multiple setGeneric() functions
  • Pollution of the namespace with extra generics
  • Difficult to debug
  • Type checking
    • works well only with simple data types
      • 'numeric' means 1.5 but also c(1, 2, 3)
      • in most cases one needs to define a validity function
    • made irrelevant if the data type is list
    • for data frames, it matters what is inside the data frame

Frank Harrell (author of Hmisc and “problems with stepwise regression”):

“If you love computer science more than you value your own time, use S4”

Immutability in R

Almost all R objects are immutable => copy-on-change:

  • a new object is created every time
  • even when using replacement functions!
  • R requires more time and more memory to copy the data
  • S3 and S4 objects cannot modify themselves without a setter or re-assignment
triangle <- expand(triangle, times = 2)  # assume we created an expand method
  • Although assign() can be used as a hack
  • It’s not as bad as it seems, lists and data frames copy only the modified elements; they point to the old ones until a change is needed.

Mutability in R

Except Environments:

  • Similar to a list where each element is accessed by reference
  • No new environment will be created at copy
  • The old object is gone at re-assignment
my_env <- new.env()
my_env$x <- 42
ls(my_env)
## [1] "x"
my_env$x
## [1] 42

The Environment Trick

  • We can set a slot to be an environment
  • Need to define an initialize() method, called by new()
    • each new object initializes its own environment
setClass("Element",
         slots = representation(x = 'numeric', my_env = 'environment')
)

setMethod("initialize", "Element",
  function(.Object, ..., x=numeric(), my_env = new.env()) {
    callNextMethod(.Object, x = x, my_env = my_env, ...)
  }
)

The Environment Trick

el <- new("Element", x = 3)
print(el)
## An object of class "Element"
## Slot "x":
## [1] 3
## 
## Slot "my_env":
## <environment: 0x000000001d49e960>
el@my_env$a <- 1
print(el@my_env$a)
## [1] 1

Reference Classes (RC)

Sometimes we want mutability, e.g., for bidirectional relationships

Alternative: Reference Classes

Reference Classes == S4 + Environment Trick + syntactic sugar

  • Sometimes called R5, but this name causes confusion
  • RC objects are mutable, they don’t use copy-on-change
  • Use $ to access methods; looks like Java and Python
  • Still a hack, but a hidden hack
  • It is rather slow

Let’s skip Reference Classes! (that was fast!)

R6

  • Similar to Reference Classes, but it’s faster and uses less memory
    • does not do type checking (one reason it is faster)
  • Developed by Winston Chang at RStudio
    • Shiny (from RStudio) uses R6 => production ready
  • The R6 object is an environment, so it can store:
    • variables => object attributes
    • functions => methods
  • R6 also supports:
    • public and private methods
    • inheritance across packages
    • self$ notation as in Python

R6 Basics

rm(list = ls())

Circle <- R6Class(
    classname = "Circle",
    public = list(            # public attributes
        size = NULL,          # no type!
        initialize = function(size = NA) {
          self$size <- size   # good place to check types
        },
        print = function(...) cat("Circle of size", self$size, "\n"),
        expand = function(value) {
            self$size <- self$size * value   # self modifying
        }
    )
)

R6 Basics

circle <- Circle$new(size = 6.1)
str(circle)
## Classes 'Circle', 'R6' <Circle>
##   Public:
##     clone: function (deep = FALSE) 
##     expand: function (value) 
##     initialize: function (size = NA) 
##     print: function (...) 
##     size: 6.1
print(circle)
## Circle of size 6.1

R6 Basics

Access the object attributes

circle$size
## [1] 6.1

Access the object methods

circle$expand(2)         # look, no assignment
print(circle)
## Circle of size 12.2

R6 Private Members and Accessors

PrivateCircle <- R6Class(
    classname = "PrivateCircle",
    private = list(
        hidden_size = NULL),
    public = list(            # public attributes
        initialize = function(size = NA) {
          private$hidden_size <- size   # good place to check data types
        },
        print = function(...) cat("Circle of size", private$hidden_size, "\n")),
    active = list(
        size = function(value) {
          if (missing(value)) return(private$hidden_size)
          else private$hidden_size <- value
        }
    )
)

R6 Private Members and Accessors

private_circle <- PrivateCircle$new(6.6)
private_circle$hidden_size  # not visible
## NULL
private_circle$size         # getter
## [1] 6.6
private_circle$size <- 7.7  # setter 
private_circle
## Circle of size 7.7

R6 Inheritance

RegularPolygon <- R6Class("RegularPolygon",
  inherit = Circle,
  public = list(
      sides = NULL,
      initialize = function(size = NA, sides = NA) {
          super$initialize(size)  # call parent's initialize
          self$sides <- sides     # take care of the new argument
      },
      print = function(...) {
          cat("Regular polygon with", self$sides, 
              "sides of size", self$size, "\n")    
      }
  )
)

R6 Inheritance

triangle <- RegularPolygon$new(size = 8.2, sides = 3)
str(triangle)
## Classes 'RegularPolygon', 'Circle', 'R6' <RegularPolygon>
##   Inherits from: <Circle>
##   Public:
##     clone: function (deep = FALSE) 
##     expand: function (value) 
##     initialize: function (size = NA, sides = NA) 
##     print: function (...) 
##     sides: 3
##     size: 8.2
triangle
## Regular polygon with 3 sides of size 8.2

R6 Inheritance

triangle$size
## [1] 8.2
triangle$expand(2)
triangle
## Regular polygon with 3 sides of size 16.4

We modified the original object

R6 and the R user

As seen from Java, Python, etc. …

  • Finally, an Object Class that I understand!
  • I understand public, private, self$
  • I should use only R6!

… and that is a problem.

R users expect copy-on-change and may be confused by references.

R6 and the R user

The following statement is not clear at the first glance to the R user:

triangle$expand(2)
  • I see $, is this a list or a data frame?
    • most users are not familiar with environments
  • Are you calling a function from a list?
  • So, what if you “expand” the triangle, what do you get back?

R6 and the R user

Example from the package openxlsx (which uses Reference Classes)

workbook <- createWorkbook()
addWorksheet(workbook, "Sheet 1")
  • Does not follow the pattern workbook$addWorksheet()
  • Internally, addWorksheet() calls workbook$addWorksheet()
  • So, did you or did you not add a worksheet to workbook?
names(workbook)   # calls names.Workbook()
## [1] "Sheet 1"

R6 and the R user

Martin Morgan (Bioconductor Project Lead), in reply to a
“I’m a C++/Python developer” point of view:

“R’s copy-on-change semantics leads me to expect that

b = a
slt(a) = 2

leaves b unchanged, which S4 does (necessarily copying and thus with a time and memory performance cost)."

[…]

“You either need to change your expectations, or use reference classes (and change the expectations of your users).”

R Object Classes - Which one to use?

You are using Bioconductor (S4 based) or a similar environment:

  • Use S4

You must have reference objects due to the nature of the data

  • Use R6 and avoid Reference Classes
  • Manage user expectations: provide documentation, examples, etc.

You do not need reference objects:

  • Use S3, if you have an option between S3 or S4
    • Google’s R Style Guide advises against using S4
  • Better: use packages and a little of S3, e.g. for print()
    • packages promote modularity and encapsulation

R Functions and Coding Patterns

Experience is:

  • knowing what to do
  • knowing what not to do
  • learning from making mistakes
  • better, learning from other people’s mistakes

The R Class Objects are far from perfect. So, what really works in R?

What are some of the best practices in R?

warning: subjective views / experience follows

R Coding Style

  • Follow a style guide, the most concise is Hadley’s style guide
  • Break complex lines into components and pick meaningful names
    • one step closer to self-documenting code
df2 <- df[grepl(
    '^[0-9]{5}$', df$zip_code), ]     # usually the pattern is more complex

# vs

zip5_pattern <- '^[0-9]{5}$'
zip5_mask <- grepl(zip5_pattern, sales$zip_code)   
sales_with_zip <- sales[zip5_mask, ]
  • Brandon Rhodes: The Naming of Ducks (video and slides)
    • It’s about Python, but it applies to R as well

dplyr

dplyr Example

The code looks almost like a sentence in English

mtcars %>%
    rownames_to_column('car_name') %>%
    select(car_name, mpg, cyl) %>%
    group_by(cyl) %>%                # 3 cyl groups: 4, 6, 8
    filter(mpg > mean(mpg)) %>%      # above average of each group
    top_n(1, mpg)                    # top 1 from each group, by mpg
## # A tibble: 3 x 3
## # Groups:   cyl [3]
##   car_name           mpg   cyl
##   <chr>            <dbl> <dbl>
## 1 Hornet 4 Drive    21.4    6.
## 2 Toyota Corolla    33.9    4.
## 3 Pontiac Firebird  19.2    8.

dplyr .end of pipe

I dislike retyping %>% at the end of the line when the workflow changes

.end <- identity   # identity(x) returns x; .end is hidden

mtcars %>%
    select(mpg, cyl) %>%
    head(3) %>%    # now all the dplyr lines end with %>%
    .end
##                mpg cyl
## Mazda RX4     21.0   6
## Mazda RX4 Wag 21.0   6
## Datsun 710    22.8   4

Git and GitHub work better with fewer line changes

ggplot2

  • Start with the data and aesthetics, add geoms, then make it look nice
  • Add color dimension (continuous or discrete)
mtcars %>%
    mutate(cyl = factor(cyl)) %>%        # pipe your data into ggplot
    ggplot(aes(x = wt, y = mpg, color = cyl)) + 
    geom_point(size = 2) +
    scale_colour_brewer(palette = 'Dark2')

R Dialects

  • RStudio’s tidyverse (dplyr, ggplot2, tidyr, etc.)
  • Bioconductor
  • data.table
  • *pply functions
  • shell / full stack

“But R is not the best tool for every application. And it doesn’t try to be. One of the design goals of R is to make it easy to interact with other software to encourage the best tool being used for each task.”

from Patrick Burns: The R Inferno

  • Ideas and emerging trends appear on Twitter #rstats

Bad Functions

sapply(mtcars, is.numeric)
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
sapply(list(), is.numeric)
## list()

More details at Functionals

Good Functions

  • Use functions, even in Rmd and notebooks
    • they promote modularity and help with debugging
  • Use verbs for functions names
    • there exceptions, e.g. math names like mean
  • Keep functions about one page long
  • Receive all arguments in the argument list
    • do not rely on global values
  • Return only one (the same) data type
  • Avoid side effects
    • do not modify objects outside the function
    • no global assignment (<<-)

Pure Functions

A pure function is a function for which the return value is only determined by input values, without side effect.

  • Pure functions:
    • length(x)
    • exp(x)
  • Impure functions
    • runif(x)
    • print(x)

Package purrr provides functions that always return the same type: map_dbl(), map_chr(), etc.

Pure Functions and Immutability

  • Almost all R objects are immutable => copy-on-change
  • Immutability is consistent with pure functions
  • Pure functions produce reproducible results
    • Immutability => only the function result contains the output
  • Pure functions can be parallelized:
    • Distribute the inputs, collect the results
    • Immutability => do not worry about modifying the inputs
  • The output of pure functions can be cached (memoization)
    • Immutability => the cache will not be modified by other functions

Pure Functions

Functions => Pure Functions => Functional Programming

  • In Data Science, pure functions are transformations of data frame columns
  • Thinking in Excel, a pure function is a function that takes a table and returns a new column containing modified data

Writing pure functions in R makes it easier to deploy them in other production environments:

  • Easier to migrate code to Python and Scala
  • MS SQL Server 2016 allows R code in stored procedures
  • MapReduce is based on functions (map and reduce)
  • Spark is written in Scala, which has functional programming features

How do YOU use R?

  • What works consistently well?
  • What new trends have you observed?

Summary

Object Classes:

  • S3 > S4
  • R6 > Reference Classes

Modularity and focus rule:

  • packages
  • pure functions

Thank You!

References - Object Classes

References - Other

sessionInfo()

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2 R6_2.2.2       ggplot2_2.2.1  dplyr_0.7.4   
## [5] tibble_1.4.2  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16       knitr_1.20         bindr_0.1.1       
##  [4] magrittr_1.5       munsell_0.4.3      colorspace_1.3-2  
##  [7] rlang_0.2.0        plyr_1.8.4         stringr_1.3.0     
## [10] tools_3.4.4        grid_3.4.4         gtable_0.2.0      
## [13] utf8_1.1.3         cli_1.0.0          htmltools_0.3.6   
## [16] lazyeval_0.2.1     yaml_2.1.18        rprojroot_1.3-2   
## [19] digest_0.6.15      assertthat_0.2.0   crayon_1.3.4      
## [22] RColorBrewer_1.1-2 glue_1.2.0         evaluate_0.10.1   
## [25] rmarkdown_1.9      labeling_0.3       stringi_1.1.7     
## [28] compiler_3.4.4     pillar_1.2.1       scales_0.5.0      
## [31] backports_1.1.2    pkgconfig_2.0.1