Mike Badescu, PhD
HTML Slides on GitHub Pages: https://numeract.github.io/dallas-roo/
Source Code on GitHub: https://github.com/numeract/dallas-roo
Dallas R Users Group
September 17, 2016
September 17, 2016
Mike Badescu, PhD
HTML Slides on GitHub Pages: https://numeract.github.io/dallas-roo/
Source Code on GitHub: https://github.com/numeract/dallas-roo
Dallas R Users Group
September 17, 2016
Main Principles:
Setup
library(tibble); library(dplyr) library(ggplot2); library(R6)
Regular polygons, first take
triangle_size <- 6 square_size <- 4 hexagon_size <- 5
A common distinguishing feature: size
triangle <- list(size = 6) square <- list(size = 4) hexagon <- list(size = 5)
Size alone is not sufficient:
x <- list(size = 6) # is x a square or a triangle?
Adding sides
triangle <- list(sides = 3, size = 6) square <- list(sides = 4, size = 4) hexagon <- list(sides = 6, size = 5)
The objects triangle
, square
and hexagon
:
regular_polygon
We can perform similar actions: print()
, resize()
, plot()
Let’s implement a print function:
print_rp <- function(rp) { # hmm... the 'print' name is taken cat("Regular polygon with", rp$sides, "sides of size", rp$size, "\n") } print_rp(triangle)
## Regular polygon with 3 sides of size 6
Nice!
A readable representation of the object is always welcome!
Let’s implement resize()
:
resize <- function(rp, value = 1) { rp$size <- rp$size * value rp } print_rp(resize(triangle, 2))
## Regular polygon with 3 sides of size 12
print_rp(triangle) # the original object is unchanged
## Regular polygon with 3 sides of size 6
Or we can also use a replacement function:
`size<-` <- function(rp, value = 1) { # note the name rp$size <- rp$size * value rp } triangle_copy <- triangle # let's work with a copy size(triangle_copy) <- 2 print_rp(triangle_copy)
## Regular polygon with 3 sides of size 12
We modified the original / input object.
Congratulations, we have a rudimentary OO model!
Anything missing?
How do I add a circle to the collection?
size
sides
and print_rp()
I would love to use print()
instead of print_rp()
!
triangle.resize()
resize(triangle)
The class belongs to the object as an R attribute
attr(triangle, "class") <- "regular_polygon" triangle
## $sides ## [1] 3 ## ## $size ## [1] 6 ## ## attr(,"class") ## [1] "regular_polygon"
Better: use class()
class(square) <- "regular_polygon" class(square)
## [1] "regular_polygon"
class(hexagon) <- "regular_polygon" class(hexagon)
## [1] "regular_polygon"
We have seen classes before
df1 <- data.frame(a = 1:2, b = 3:4) class(df1)
## [1] "data.frame"
str(df1)
## 'data.frame': 2 obs. of 2 variables: ## $ a: int 1 2 ## $ b: int 3 4
Under the hood, statistical models are lists
model <- lm(b ~ a, data = df1) class(model)
## [1] "lm"
str(model)
## List of 12 ## $ coefficients : Named num [1:2] 2 1 ## ..- attr(*, "names")= chr [1:2] "(Intercept)" "a" ## $ residuals : Named num [1:2] 0 0 ## ..- attr(*, "names")= chr [1:2] "1" "2" ## $ effects : Named num [1:2] -4.95 0.707 ## ..- attr(*, "names")= chr [1:2] "(Intercept)" "a" ## $ rank : int 2 ## $ fitted.values: Named num [1:2] 3 4 ## ..- attr(*, "names")= chr [1:2] "1" "2" ## $ assign : int [1:2] 0 1 ## $ qr :List of 5 ## ..$ qr : num [1:2, 1:2] -1.414 0.707 -2.121 0.707 ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : chr [1:2] "1" "2" ## .. .. ..$ : chr [1:2] "(Intercept)" "a" ## .. ..- attr(*, "assign")= int [1:2] 0 1 ## ..$ qraux: num [1:2] 1.707 0.707 ## ..$ pivot: int [1:2] 1 2 ## ..$ tol : num 1e-07 ## ..$ rank : int 2 ## ..- attr(*, "class")= chr "qr" ## $ df.residual : int 0 ## $ xlevels : Named list() ## $ call : language lm(formula = b ~ a, data = df1) ## $ terms :Classes 'terms', 'formula' language b ~ a ## .. ..- attr(*, "variables")= language list(b, a) ## .. ..- attr(*, "factors")= int [1:2, 1] 0 1 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "b" "a" ## .. .. .. ..$ : chr "a" ## .. ..- attr(*, "term.labels")= chr "a" ## .. ..- attr(*, "order")= int 1 ## .. ..- attr(*, "intercept")= int 1 ## .. ..- attr(*, "response")= int 1 ## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## .. ..- attr(*, "predvars")= language list(b, a) ## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" ## .. .. ..- attr(*, "names")= chr [1:2] "b" "a" ## $ model :'data.frame': 2 obs. of 2 variables: ## ..$ b: int [1:2] 3 4 ## ..$ a: int [1:2] 1 2 ## ..- attr(*, "terms")=Classes 'terms', 'formula' language b ~ a ## .. .. ..- attr(*, "variables")= language list(b, a) ## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1 ## .. .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. .. ..$ : chr [1:2] "b" "a" ## .. .. .. .. ..$ : chr "a" ## .. .. ..- attr(*, "term.labels")= chr "a" ## .. .. ..- attr(*, "order")= int 1 ## .. .. ..- attr(*, "intercept")= int 1 ## .. .. ..- attr(*, "response")= int 1 ## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## .. .. ..- attr(*, "predvars")= language list(b, a) ## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" ## .. .. .. ..- attr(*, "names")= chr [1:2] "b" "a" ## - attr(*, "class")= chr "lm"
More than one class is possible
df2 <- as_data_frame(df1) class(df2)
## [1] "tbl_df" "tbl" "data.frame"
str(df2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables: ## $ a: int 1 2 ## $ b: int 3 4
Printing using the same function print()
print(df1)
## a b ## 1 1 3 ## 2 2 4
print(df2)
## # A tibble: 2 x 2 ## a b ## <int> <int> ## 1 1 3 ## 2 2 4
We can print()
the model, even if it is a different object
print(model)
## ## Call: ## lm(formula = b ~ a, data = df1) ## ## Coefficients: ## (Intercept) a ## 2 1
How does this magic work?
Let’s look at the code for print()
## function (x, ...) ## UseMethod("print") ## <bytecode: 0x000000001d0d0100> ## <environment: namespace:base>
UseMethod()
is the dispatcher!
From ?UseMethod
:
Method dispatch takes place based on the class(es) of the first argument to the generic function or of the object supplied as an argument to
UseMethod
print.class_name()
, etc.print()
calls the right method based on the class of the first argumentmethods(print) # ... # [70] print.data.frame # [71] print.Date # [72] print.default # ... # [128] print.lm* <- * means non-visible # ... # [221] print.tbl_df* # ...
Yeah? … I want a print()
method too!
print()
takes care of the dispatchprint.regular_polygon()
print.regular_polygon <- function(rp) { cat("Regular polygon with", rp$sides, "sides of size", rp$size, "\n") } print(triangle)
## Regular polygon with 3 sides of size 6
Success!
What methods are available for my regular_polygon
?
methods(class = "regular_polygon")
## [1] print ## see '?methods' for accessing help and source code
I want a resize()
method!
Is there a resize()
generic to take care of the dispatch?
methods(resize)
## no methods found
Define the generic (it will overwrite the previous resize()
function)
resize <- function(x, ...) UseMethod("resize", x) # only one argument resize.regular_polygon <- function(rp, value = 1) { rp$size <- rp$size * value rp }
Testing with printing
print(resize(triangle, 2))
## Regular polygon with 3 sides of size 12
This takes care of Polymorphism. What about Inheritance?
Inheritance requires to think about an hierarchy.
In our example, let’s say that circle
has its own class:
regular_polygon
size
but ignore (NA) sides
resize()
but re-define print()
Alternative:
size
resize()
and print()
circle <- list(sides = NA, size = 1) class(circle) <- c('circle', 'regular_polygon') print(circle)
## Regular polygon with NA sides of size 1
circle
inherited everything from regular_polygon
; define own print.
print.circle <- function(rp) { cat("Circle of size", rp$size, "\n") } print(circle)
## Circle of size 1
Better!
Check inheritance
inherits(circle, 'circle')
## [1] TRUE
inherits(circle, 'regular_polygon')
## [1] TRUE
Resize works by calling resize
of the parent class
print(resize(circle, 2))
## Circle of size 2
(there are more details but we skip them today)
But wait …
You mean I can just take over objects, assign them any class?
class(model)
## [1] "lm"
class(model) <- c("Agent Smith", class(model)) class(model)
## [1] "Agent Smith" "lm"
class(df1)
## [1] "data.frame"
class(df1) <- c("Agent Smith", class(df1)) class(df1)
## [1] "Agent Smith" "data.frame"
This code was responsible for Agent Smith taking over in Matrix Revolutions*
*not a true fact
class(model) <- "data.frame" # model was created by lm() print(model) # Error in `[.data.frame`(x, seq_len(n), , drop = FALSE) : # attempt to set an attribute on NULL
@
(not $
)Use setClass()
rm(list = ls()) # clean up setClass( Class = "Circle", slots = representation( size = "numeric" ) )
Use new()
, must name the arguments
circle <- new("Circle", size = 1.5) circle
## An object of class "Circle" ## Slot "size": ## [1] 1.5
Type checking works
circle2 <- new("Circle", size = "1.5") # Error in validObject(.Object) : # invalid class “Circle” object: invalid object for slot # "size" in class "Circle": got class "character", # should be or extend class "numeric"
Use @
or slot()
circle@size
## [1] 1.5
slot(circle, 'size')
## [1] 1.5
circle@size <- 1.7 # you should do this only inside of a class method circle@size
## [1] 1.7
Type checking works again
circle@size <- "1.7" # Error in (function (cl, name, valueClass) : # assignment of an object of class “character” is not valid for # @‘size’ in an object of class “Circle”; # is(value, "numeric") is not TRUE
show()
method; we need to overwrite itsetMethod()
with a function (which can be anonymous)Circle_show <- function(object) { cat("Circle of size", object@size, "\n") } setMethod("show", "Circle", Circle_show)
## [1] "show"
print(circle)
## Circle of size 1.7
@size
should be positivesetClass()
or by using setValidity()
TRUE
or an error messageCircle_validity <- function(object) { if (object@size < 0) return("Size must be positive") TRUE # return TRUE if all tests pass } setValidity("Circle", Circle_validity)
circle2 <- new("Circle", size = -5) # Error in validObject(.Object) : # invalid class “Circle” object: Size must be positive
Modifying an existing S4 object is still permitted!!
circle@size <- -5 circle
## Circle of size -5
Force validity check
validObject(circle) # Error in validObject(circle) : # invalid class “Circle” object: Size must be positive
Make the circle nice again
circle@size <- 1.7
setGeneric()
standardGeneric()
used for S4 dispatchingCircle_get_size <- function(object) object@size setGeneric(name = "size", def = function(object) standardGeneric("size"), valueClass = "numeric") setMethod("size", "Circle", Circle_get_size)
size(circle)
## [1] 1.7
Circle_set_size <- function(object, value) { object@size <- value # modifies the object in the local scope validObject(object) # force validation object # the setter must return the object } setGeneric(name = "size<-", def = function(object, value) standardGeneric("size<-")) setMethod("size<-", "Circle", Circle_set_size)
Let’s test it:
size(circle) <- 2.5 size(circle)
## [1] 2.5
Let’s have RegularPolygon
inherit from Circle
sides
setClass( Class = "RegularPolygon", slots = representation(sides = "integer"), # only the new slot contains = "Circle" # size inherited from Circle )
triangle <- new("RegularPolygon", sides = 3L, size = 3.5) print(triangle)
## Circle of size 3.5
Bummer!
Override show()
for RegularPolygon
RegularPolygon_show <- function(object) { cat("Regular polygon with", object@sides, "sides of size", object@size, "\n") } setMethod("show", "RegularPolygon", RegularPolygon_show)
## [1] "show"
print(triangle)
## Regular polygon with 3 sides of size 3.5
Test resizing
(size(triangle) <- 3.8)
## [1] 3.8
is(triangle) # this is how we check for S4 classes
## [1] "RegularPolygon" "Circle"
is(triangle, "Circle") # R cannot overcome poor design choices
## [1] TRUE
We could overwrite validity and create more accessors, etc.
setGeneric()
functions'numeric'
means 1.5
but also c(1, 2, 3)
list
Frank Harrell (author of Hmisc
and “problems with stepwise regression”):
“If you love computer science more than you value your own time, use S4”
Almost all R objects are immutable => copy-on-change:
triangle <- expand(triangle, times = 2) # assume we created an expand method
assign()
can be used as a hackExcept Environments:
list
where each element is accessed by referencemy_env <- new.env() my_env$x <- 42 ls(my_env)
## [1] "x"
my_env$x
## [1] 42
initialize()
method, called by new()
setClass("Element", slots = representation(x = 'numeric', my_env = 'environment') ) setMethod("initialize", "Element", function(.Object, ..., x=numeric(), my_env = new.env()) { callNextMethod(.Object, x = x, my_env = my_env, ...) } )
el <- new("Element", x = 3) print(el)
## An object of class "Element" ## Slot "x": ## [1] 3 ## ## Slot "my_env": ## <environment: 0x000000001d49e960>
el@my_env$a <- 1 print(el@my_env$a)
## [1] 1
Sometimes we want mutability, e.g., for bidirectional relationships
Alternative: Reference Classes
Reference Classes == S4 + Environment Trick + syntactic sugar
$
to access methods; looks like Java and PythonLet’s skip Reference Classes! (that was fast!)
self$
notation as in Pythonrm(list = ls()) Circle <- R6Class( classname = "Circle", public = list( # public attributes size = NULL, # no type! initialize = function(size = NA) { self$size <- size # good place to check types }, print = function(...) cat("Circle of size", self$size, "\n"), expand = function(value) { self$size <- self$size * value # self modifying } ) )
circle <- Circle$new(size = 6.1) str(circle)
## Classes 'Circle', 'R6' <Circle> ## Public: ## clone: function (deep = FALSE) ## expand: function (value) ## initialize: function (size = NA) ## print: function (...) ## size: 6.1
print(circle)
## Circle of size 6.1
Access the object attributes
circle$size
## [1] 6.1
Access the object methods
circle$expand(2) # look, no assignment print(circle)
## Circle of size 12.2
PrivateCircle <- R6Class( classname = "PrivateCircle", private = list( hidden_size = NULL), public = list( # public attributes initialize = function(size = NA) { private$hidden_size <- size # good place to check data types }, print = function(...) cat("Circle of size", private$hidden_size, "\n")), active = list( size = function(value) { if (missing(value)) return(private$hidden_size) else private$hidden_size <- value } ) )
private_circle <- PrivateCircle$new(6.6) private_circle$hidden_size # not visible
## NULL
private_circle$size # getter
## [1] 6.6
private_circle$size <- 7.7 # setter private_circle
## Circle of size 7.7
RegularPolygon <- R6Class("RegularPolygon", inherit = Circle, public = list( sides = NULL, initialize = function(size = NA, sides = NA) { super$initialize(size) # call parent's initialize self$sides <- sides # take care of the new argument }, print = function(...) { cat("Regular polygon with", self$sides, "sides of size", self$size, "\n") } ) )
triangle <- RegularPolygon$new(size = 8.2, sides = 3) str(triangle)
## Classes 'RegularPolygon', 'Circle', 'R6' <RegularPolygon> ## Inherits from: <Circle> ## Public: ## clone: function (deep = FALSE) ## expand: function (value) ## initialize: function (size = NA, sides = NA) ## print: function (...) ## sides: 3 ## size: 8.2
triangle
## Regular polygon with 3 sides of size 8.2
triangle$size
## [1] 8.2
triangle$expand(2) triangle
## Regular polygon with 3 sides of size 16.4
We modified the original object
As seen from Java, Python, etc. …
self$
… and that is a problem.
R users expect copy-on-change and may be confused by references.
The following statement is not clear at the first glance to the R user:
triangle$expand(2)
$
, is this a list or a data frame?
Example from the package openxlsx
(which uses Reference Classes)
workbook <- createWorkbook() addWorksheet(workbook, "Sheet 1")
workbook$addWorksheet()
addWorksheet()
calls workbook$addWorksheet()
workbook
?names(workbook) # calls names.Workbook() ## [1] "Sheet 1"
Martin Morgan (Bioconductor Project Lead), in reply to a
“I’m a C++/Python developer” point of view:
“R’s copy-on-change semantics leads me to expect that
b = a
slt(a) = 2leaves b unchanged, which S4 does (necessarily copying and thus with a time and memory performance cost)."
[…]
“You either need to change your expectations, or use reference classes (and change the expectations of your users).”
You are using Bioconductor (S4 based) or a similar environment:
You must have reference objects due to the nature of the data
You do not need reference objects:
print()
Experience is:
The R Class Objects are far from perfect. So, what really works in R?
What are some of the best practices in R?
warning: subjective views / experience follows
df2 <- df[grepl( '^[0-9]{5}$', df$zip_code), ] # usually the pattern is more complex # vs zip5_pattern <- '^[0-9]{5}$' zip5_mask <- grepl(zip5_pattern, sales$zip_code) sales_with_zip <- sales[zip5_mask, ]
If you are learning R …
subset()
, plot()
, etc.dplyr
, ggplot2
and other packages from the tidyverse
%>%
from magrittr, available in dplyr
For a detailed example, have a look at Zev Ross:
A new data processing workflow for R: dplyr, magrittr, tidyr, ggplot2
The code looks almost like a sentence in English
mtcars %>% rownames_to_column('car_name') %>% select(car_name, mpg, cyl) %>% group_by(cyl) %>% # 3 cyl groups: 4, 6, 8 filter(mpg > mean(mpg)) %>% # above average of each group top_n(1, mpg) # top 1 from each group, by mpg
## # A tibble: 3 x 3 ## # Groups: cyl [3] ## car_name mpg cyl ## <chr> <dbl> <dbl> ## 1 Hornet 4 Drive 21.4 6. ## 2 Toyota Corolla 33.9 4. ## 3 Pontiac Firebird 19.2 8.
I dislike retyping %>%
at the end of the line when the workflow changes
.end <- identity # identity(x) returns x; .end is hidden mtcars %>% select(mpg, cyl) %>% head(3) %>% # now all the dplyr lines end with %>% .end
## mpg cyl ## Mazda RX4 21.0 6 ## Mazda RX4 Wag 21.0 6 ## Datsun 710 22.8 4
Git and GitHub work better with fewer line changes
mtcars %>% mutate(cyl = factor(cyl)) %>% # pipe your data into ggplot ggplot(aes(x = wt, y = mpg, color = cyl)) + geom_point(size = 2) + scale_colour_brewer(palette = 'Dark2')
tidyverse
(dplyr
, ggplot2
, tidyr
, etc.)“But R is not the best tool for every application. And it doesn’t try to be. One of the design goals of R is to make it easy to interact with other software to encourage the best tool being used for each task.”
from Patrick Burns: The R Inferno
sapply(mtcars, is.numeric)
## mpg cyl disp hp drat wt qsec vs am gear carb ## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
sapply(list(), is.numeric)
## list()
More details at Functionals
Rmd
and notebooks
mean
<<-
)A pure function is a function for which the return value is only determined by input values, without side effect.
Package purrr provides functions that always return the same type: map_dbl()
, map_chr()
, etc.
Functions => Pure Functions => Functional Programming
Writing pure functions in R makes it easier to deploy them in other production environments:
Object Classes:
Modularity and focus rule:
## R version 3.4.4 (2018-03-15) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 16299) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] bindrcpp_0.2.2 R6_2.2.2 ggplot2_2.2.1 dplyr_0.7.4 ## [5] tibble_1.4.2 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.16 knitr_1.20 bindr_0.1.1 ## [4] magrittr_1.5 munsell_0.4.3 colorspace_1.3-2 ## [7] rlang_0.2.0 plyr_1.8.4 stringr_1.3.0 ## [10] tools_3.4.4 grid_3.4.4 gtable_0.2.0 ## [13] utf8_1.1.3 cli_1.0.0 htmltools_0.3.6 ## [16] lazyeval_0.2.1 yaml_2.1.18 rprojroot_1.3-2 ## [19] digest_0.6.15 assertthat_0.2.0 crayon_1.3.4 ## [22] RColorBrewer_1.1-2 glue_1.2.0 evaluate_0.10.1 ## [25] rmarkdown_1.9 labeling_0.3 stringi_1.1.7 ## [28] compiler_3.4.4 pillar_1.2.1 scales_0.5.0 ## [31] backports_1.1.2 pkgconfig_2.0.1