Flexible R Pipelines with Caching • rflow

The package is currently under active development, please expect major changes while the API stabilizes.

Motivation

A common problem when processing data as part of a pipeline is avoiding unnecessary calculations. For example, if a function is called over and over with the same arguments, it should not recalculate the result each time but it should provide the cached (pre-computed) result.

While caching of the function output resolves the first problem, a second issue occurs when large data sets are being processed. In this case, hashing of the input arguments each time might take too long. This issue can be solved by hashing the data only once (as output) and then by noticing changes in the hash received by the downstream function. In other words, it is not the data that flows through the pipeline (as is the case with standard function), but hashes of the data.

A third issue is output sub-setting. When working with a pipeline there is often the case (e.g. ETL, Machine Learning) that we need to pass the whole data frame but the function is going to use only a subset (e.g. a CV fold). Since the main data frame has changes, caching of the result is no longer efficient. The solution involves hashing of the subset of interest which can be done by introducing additional intermediate functions in the pipeline. However, there is a loss of efficiency due to excessive rehashing as the main data frame passes through many functions.

The package rflow addresses these inefficiencies and makes pipelines as easy to use as in tidyverse.

Installation

# install.packages("devtools")
devtools::install_github("numeract/rflow")

Use

Simple Example

x1 <- 10
x2 <- 0.5
x3 <- 2

f1 <- function(a, b, c = 1) {a * b + c}
f2 <- function(d, e) {d / e}

# passing the results downstream using functions
(o1 <- f1(x1, x2))  # 6
(o2 <- f2(o1, x3))  # 3


# variant 1: declaring flows for each function using default options
ff1 <- make_flow_fn(f1)
ff2 <- make_flow_fn(f2)

# passing to the downstream flow and collecting the results
r1 <- ff1(x1, x2)   # does not trigger re-calc
r2 <- ff2(r1, x3)   # does not trigger re-calc; first arg. is a flow arg.
collect(r1)         # 6
collect(r2)         # 3


# variant 2: arguments and functions withing one call
library(dplyr)                          # makes life easier 
flow_fn(x1, x2, fn = f1) %>%            # reuses cache created by ff1
  flow_fn(x3, fn = f2) %>%              # reuses cache created by ff2
  collect()                             # 3, no actual re-calc takes place

Pipelines

Create your function, e.g. f <- function(...) {...}

rflow works best with pure functions, i.e. functions that depend only on their inputs (and not on variables outside the function frame) and do not produce any side effects (e.g. printing, modifying variables in the global environment).

“flow” the function: ff <- make_flow_fn(f))
When pipelining ff into another rflow function, simply supply ff() as an argument, for example: ff(x) %>% ff2(y) %>% ff3(z)
At the end of the rflow pipeline you must use collect() to collect the actual data (and not just the cached structure). Alternatively, use flow_ns_sink() to dump the data into an environment or a Shiny::reactiveValues name space.

Shiny

Shiny from RStudio uses reactive values to know what changes took place and what to recompute. It is thus possible to use a series of reactive elements in Shiny to prevent expensive re-computations from taking place. Example:

rv1 <- reactive({ 
    ... input$x ... 
})

rv2 <- reactive({ 
    ... rv1() .... input$y ... 
})

rv3 <- reactive({ 
    ... rv2() .... input$z ... 
})

The downside is that we need one reactive element for each function in the pipeline - this makes data processing dependent on UI / Shiny. Using rflow, we can separate the UI from the data processing, maintaining the caching not only for the current state but for all previously computed states.

rv <- reactive({ 
    rf1(input$x, ...) %>%
    rf2(input$y, ...) %>%
    rf3(input$z, ...) %>%
    collect()
})

While a similar workflow can be achieved with package memoise, it suffers from several disadvantages (below).

Output Subset

(to be updated)

Other frameworks

Memoise

Package memoise by Hadley Wickham, Jim Hester and others was the main source of inspiration. Memoise is elegant, fast, simple to use, but it suffers from certain limitations that we hope to overcome in this package:

excessive rehashing of inputs
only one cache layer (although its cache framework is extensible)
no input/output sub-setting, it uses the complete set of arguments provided
no reactivity (yet to be implemented in rflow)

Drake

Package drake by Will Landau and others provides a complete framework for large data sets, including using files as inputs and outputs. The downside is that it requires additional overhead to get started and its focus is on the pipeline as a whole. If your work requires many hours of computations (which increases the value of each result), the overhead due to the setup has a relatively lower cost - in this scenario drake is an excellent choice.

Package rflow is somewhere between memoise and drake:

one can start using rflow right away, with minimal overhead
allows focusing on the data processing (e.g., EDA) and not on the framework

TODO list

reactivity
multi-layer cache (with file locking)
files sinks
parallel processing

rflow - Flexible R Pipelines with Caching