Of R and APIs: Running R in Amazon Lambda

March 24, 2018

Of R and APIs: Running R in Amazon Lambda

Slides: https://numeract.github.io/aws-lambda-r/
Code: https://github.com/numeract/aws-lambda-r

Mike Badescu, PhD

Dallas R Users Group
March 24, 2018

About

Numeract LLC

Data Science and Economics / Finance consulting services
Technology Stack: (Postgre)SQL, R, Python, Spark, Docker, AWS

Authors and Contributors

Mike Badescu
- PhD Economics, Lehigh University
- Experience: Economics, Finance, Statistics, Machine Learning
Ana Niculescu
Teodor Ciuraru

Summary

Motivation
APIs
Plumber
OpenCPU
Web Server / Docker vs FaaS
R on Amazon Lambda
Ideal Setup

Motivation

Typical Data Product Progression

Exploratory Data Analysis
- get the data
- some cleaning too
Proof of Concept
- R Markdown report
Stakeholder Engagement and Validation
- Shiny App
- iterate steps 1-3 as needed

Motivation

4. Can we use it in production?

Is Shiny scalable?
We already have a dashboard, we do not use Shiny.
Can we use the current R code but without Shiny?
How do we integrate it with our current code base?

Rewriting the code in another programming language is not efficient
Certain algorithms are not available in other programming languages

Solutions

Install R on the same machine as the production server
- Increases chances of failure, maintenance issues
- Security issues
R running inside of a VM next to other VMs
- How do we connect to it?
- Use files? Read from a Database?
API
- Can we have an API?
- Can it be a REST API?

API

Modularize the application: calls from Client to Server
Programming language independent
Common concept: programmers know how to work with APIs
Common "data language": JSON (but not the only one)
REST API
- stateless ==> pure functions
- over HTTP, browsable
Over HTTP, two main methods:
- GET: sends a request to a server and waits for data
- POST: sends data to the server and waits for an answer

API Example

HackerNews API

{
  "by" : "dhouston",
  "descendants" : 71,
  "id" : 8863,
  "kids" : [ 9224, 8952, 8917, 8884, 8887, 8943, 8869, 8940, 8958, ... ],
  "score" : 104,
  "time" : 1175714200,
  "title" : "My YC app: Dropbox - Throw away your USB drive",
  "type" : "story",
  "url" : "http://www.getdropbox.com/u/2/screencast.html"
}

API Example

Facebook Graph API

{
   "data": {
      "height": 50,
      "is_silhouette": false,
      "url": "https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/12006203_ ...",
      "width": 50
   }
}

API Calls

Main Tool: curl

Free and open source
Can handle: FTP, FTPS, HTTP, HTTPS, IMAP, SCP, etc.
Deals with SSL certificates, cookies, HTTP headers, etc.
R packages:
- curl by Jeroen Ooms
- RCurl by Duncan Temple Lang and the CRAN team

API Example

Facebook Graph API: using curl from Git Bash

R and API

Can we serve API requests from R?

Not directly, as R is not a web server
We need a wrapper / server that:
- receives requests and hands them to R
- passes any response from R to the client

Approaches

Web server / container:
- plumber and OpenCPU
Serverless: wrap R functions inside FasS (Function as a Service)
- running R on AWS Lambda

plumber

An R package that generates a web API from the R code you already have.

Main author: Jeff Allen
Package website: https://www.rplumber.io/
Usage:
- use function decorators, similar to roxygen2
- call plumb() on the .R file
- access local server, similar to shiny
Quick, interactive loop: code <==> local server => prod server
Has its own web server to deploy in production
Very good documentation and examples
- source of the following examples

plumber

Decorators

#* @get /mean
normalMean <- function(samples = 10) {
    data <- rnorm(samples)
    mean(data)
}


#* @post /sum
addTwo <- function(a, b) {
    as.numeric(a) + as.numeric(b)
}

plumber - Local Server

install.packages("plumber")

r <- plumber::plumb("docs/plumber_functions.R")

r$run(port = 8000)

GET Method

POST Method

curl --data "a=4&b=3" "http://localhost:8000/sum"

plumber - Features

Output customization, including JSON (default), HTML, PNG, JPEG

#' @get /plot
#' @png
function(){
    myData <- iris
    plot(myData$Sepal.Length, myData$Petal.Length,
         main="All Species", xlab="Sepal Length", ylab="Petal Length")
}

plumber - Features

Filters

Pipeline of handling requests
Can be used as authorizers (similar to AWS Lambda functions)

#* @filter checkAuth
function(req, res){
    if (is.null(req$username)){
        res$status <- 401 # Unauthorized
        return(list(error="Authentication required"))
    } else {
        plumber::forward()
    }
}

plumber - Features

Dynamic Routes

API endpoint: /users/13, where 13 is dynamic

#' @get /users/<id>
function(id) {
    subset(users, uid == id)
}

plumber - Deployment

Local Server

Single threaded R instance

Web Server / Docker

VMs (documentation example for Digital Ocean)
RStudio Connect
Docker: single container
Docker: multiple plumber instances
- traffic routed by nginx
- use load balancing to scale horizontally

OpenCPU

A reliable and interoperable HTTP API for data analysis based on R.

Author: Jeroen Ooms
Package website: https://www.opencpu.org/
Usage on local machine:
- write functions ==> deploy within a package
- start server, call API (tmp key for extra info)
Testing new code on the local server takes longer
JavaScript client
Web server with more features

OpenCPU

Local Server Demo

install.packages("opencpu")

library(opencpu)

ocpu_start_app("rwebapps/appdemo", port = 5656)

Local Server UI: http://localhost:5656/ocpu/test/
- test GET and POST methods

Other Apps: https://www.opencpu.org/apps.html
- example: nabel

JavaScript Client on jsfiddle: http://jsfiddle.net/user/opencpu/
- example: Posting Code Snippets

OpenCPU

Features

The tmp key exposes: the input, the R command, the code, the output
Linux OpenCPU Server for production deployment
- support for parallel/async request ==> get a large server
- AppArmor for App security within Linux
- Docker container
- authentication through apache2 / nginx authentication modules

Deployment

Similar to plumber, except no RStudio Connect

Web Server / Docker vs FaaS

Web Server / Docker

Common Linux layer
Application specific containers
- flexible memory and CPU resources
If only one container (0 to 1 issue)
- always running
- not scalable
More than one container (1 to N issue)
- minimum framework needed: cache, load balancing
- auto-scaling up to a given N containers

Web Server / Docker vs FaaS

Function as a Service

In general, a platform to build apps while minimizing infrastructure maintenance
The focus is on the code / functionality, not DevOps
Pay only for what you use
Horizontally Scalable
"Smaller Containers": usually resources are limited
Programming languages:
- initially: Java, JavaScript, C#, GO
- now Python is being added by more providers
- docker containers are becoming more popular

Web Server / Docker vs FaaS

FaaS providers (as of March 2018)

Amazon: AWS Lambda
- Node.js, Python, Java, C#, and Go
Google: Google Cloud Functions
- Node.js
Microsoft: Azure Functions
- Node.js, Java, C#, F#
- Python and others are experimental

Web Server / Docker vs FaaS

FaaS providers (cont.)

Apache: OpenWhisk
- on IBM Bluemix
- JavaScript, Swift, Python, PHP function, Java
- any binary-compatible executable including Go programs
- Docker
Oracle: Cloud Platform / Fn project
- Java, Go, Python, and Ruby
- Docker

Web Server / Docker vs FaaS

Common concerns

How to authenticate the client?
- HTTP Headers, tokens, etc.
How to encrypt communication with the client?
- HTTPS
How to limit the resources?
- Limits on CPU run time, memory and disk
- Careful not to DDOS yourself!
How fast to scale horizontally?
- How easy/fast is it to go from 0 to 1 calls?
- How easy/fast is it to go from 1 to N (small) calls?
- How easy/fast is it to go to N (large) calls?

R on FaaS

Previous Work

Analyzing Genomics Data at Scale using R, AWS Lambda, and Amazon API Gateway by Anish Kejariwal, July 2016
- the inspiration for our project
- trick: use Python package rpy2 to call R from Python
Go serverless with R Scripts on Azure Function, by Thomas Delteil, July 2017
- GitHub: https://github.com/thdeltei/azure-function-r
- needs updated R extensions on Azure
- seems that it requires Windows

aws-lambda-r

Motivation

Proof of concept validated by client (Shiny rocks!)
App engine tasks:
- get a request id
- get data from a database DB
- process data for 5-20 sec
Q: How to use this code in production?
- needs to be triggered by the production server
- not possible to run R on the same server
- large processing back load
- uncertain and irregular future demand

aws-lambda-r

Solution: R on AWS Lambda
- 0 to 10,000 scalability
- very low costs
- allows us to change the implementation details (Python?) later
numeract/aws-lambda-r on GitHub
- not an R package
- a series of Bash scripts
A framework
- uses AWS CLI through (Git)Bash ==> available on all platforms
- it is flexible because AWS changes often
- easy to adapt to automatically deploy Python or js functions

aws-lambda-r - Details

Top view

Deploy R function on AWS Lambda
Configure access to AWS Lambda

How hard can it be?

aws-lambda-r - Details

Deploy R function on AWS Lambda
- needs to be called from Python
- create a temporary .zip deployment package and store it on S3
Configure access to AWS Lambda
- configure API Gateway
- configure AWS Lambda

aws-lambda-r - Details

Deploy R function on AWS Lambda
- needs to be called from Python
  - temporary EC2 instance
  - install Python, R + packages
  - copy R files to EC2
- create a temporary .zip deployment package and store it on S3
Configure access to AWS Lambda
- configure API Gateway
  - create API resources and HTTP methods
  - create another Lambda function to use as an authorizer
- configure AWS Lambda
  - size, permissions

aws-lambda-r - Details

Deploy R function on AWS Lambda
- needs to be called from Python
  - temporary EC2 instance
    - create IAM roles, VPC, security groups, etc.
  - install Python, R + packages
    - figure out what other Linux packages are needed + a Python wrapper
  - copy R files to EC2
- create a temporary .zip deployment package and store it on S3
  - figure out what files to copy
Configure access to AWS Lambda
- configure API Gateway
  - create API resources and HTTP methods
    - account for API versioning
  - create another Lambda function to use as an authorizer
- configure AWS Lambda
  - size, permissions

aws-lambda-r - Security

source: Use API Gateway Custom Authorizers

example.R

aws_lambda_r <- function(input_json) {
    output_json <- '{"message": "Cannot create output JSON"}'
    tryCatch({
        input_lst <- from_json(input_json)
        request_id <- input_lst$request_id[1]
        output_lst <- list(
            result_id = request_id,
            result_lst = list(a = 1, b = 2:4),
            result_dbl = 1:10 / 2,
            message = NULL
        )
        output_json <- to_json(output_lst)
    }, error = function(e) {
        output_json <<- paste0('{"message": "', e$message, '"}')
    })
    output_json
}

lambda_get.py

import os
os.environ["R_HOME"] = os.getcwd()
os.environ["R_LIBS"] = os.path.join(os.getcwd(), 'libraries')
import rpy2
import ctypes
import rpy2.robjects as robjects
import json

for file in os.listdir('lib/external'):
    file_name='lib/external/' + file
    ctypes.cdll.LoadLibrary(os.path.join(os.getcwd(), file_name))

# source R file
# this R file might load libraries and source other files
robjects.r['source']('example.R')

# exposing R entry point to python
aws_lambda_r = robjects.globalenv['aws_lambda_r']

def handler_get(event, context):
    input_json = json.dumps(event)
    output_json = json.loads(str(aws_lambda_r(input_json)))
    return output_json

Bash scripts

What we need

An AWS account
AWS CLI installed and configured on the local machine
- look for a ~/.aws/ directory with two files
- get a SSH key from AWS and place it in ~/.ssh/directory
- keep your secrets safe!!
Copy these directories to your app:
- lambda/ - will contain your .R and .py files before uploading to AWS
- python/ - Python scrips (one is usually sufficient)
- scripts/ - configuration and deployment
- settings/ - default, auto-config, user settings and secrets

Bash scripts - `settings/`

settings_default.sh
- for reference, list of variables that need to be populated
secrets_default.sh
- for reference, list of access keys, names, IDs to be populated
setup_auto.sh
- list of variables populated by the auto-configuration scripts
- keep this file safe - do not commit to GitHub!
setup_user.sh
- user overriding settings and secrets
- danger: you will enter your AWS ACCESS KEY here
- keep this file safe - do not commit to GitHub!

Bash scripts - `scripts/`

Auto-configuration: settings will be saved in setup_auto.sh

21_setup_vpc.sh
- setup a new VPC in a AWS zone indicated in settings
22_setup_custom_ami.sh
- new AMI = AWS Lambda Linux + updates + Python + R + packages
- calls 11_install_packages.sh
23_setup_s3.sh
- creates an AWS S3 bucket if not already present
24_setup_lambda.sh
- creates Lambda Authorizer function
- creates and configures API Gateway

Bash scripts - `scripts/`

Local Scripts

01_main.sh - calls all other local scripts
02_setup.sh - loads all settings
03_check_settings.sh - checks and print outs
04_create_ec2.sh - new EC2 instance
05_update_ec2.sh - update EC2 if no custom AMI found
06_copy_files.sh - copy files from local to EC2
- indicate which files to copy in settings/lambda_files.txt
07_deploy_lambda.sh - calls remote scripts
08_terminate_ec2.sh - terminate EC2 instance
09_test_deployment.sh - curl deployment tests

Bash scripts - `scripts/`

Remote Scripts

12_configure_ec2.sh
- remote configuration (loads all settings)
13_create_deployment_package.sh
- creates the .zip deployment package
14_create_lambda_api_method.sh
- configures API Gateway & Lambda
15_create_alias_api_method.sh
- optional, configures a mirror API (e.g. most recent version)

aws-lambda-r - Demo

Deployment to AWS region us-east-1 (N. Virginia)

Prepare
- check user settings
- create VPC
- create AMI
- setup AWS S3
- setup AWS Lambda and API Gateway
Deploy / Run the main script

aws-lambda-r - Demo

aws-lambda-r - Demo

aws-lambda-r - Evaluation

Excellent performance in production!
- running for almost 6 months without a problem
AWS Lambda Limitations
- size for the deployment package: 250MB unzipped
- problem getting tideverse, DBI packages installed
- max run time for 5 min for Lambda / 30 sec for API Gateway
Installing R on Amazon Linux
- get the right Linux packages for R packages
- get the right files to copy to the deployment package
Too many options to configure API Gateway and AWS Lambda
- alternative is to lock down (limit) the implementation

aws-lambda-r - TODO List

Update documentation
Other HTTP responses, e.g. PNG, JPEG
Use the /tmp disk space (500MB) for large packages (e.g., BH)
Create a package to simplify deployment(?)
Allow for installing R packages from GitHub
Configure in settings which Linux and Python packages to install
Option for AWS Lambda only (without API Gateway)

Ideal Setup

Docker container
- set our own limits for disk, memory and run time
- no need for a Python wrapper
- install R and packages ourselves (the alternative is a locked down R)
FaaS infrastructure
- load balancing, auto scaling
- multiple authentication methods
- VPN for increase security
- versioning and multiple stages (e.g., alpha, beta, prod)
- monitoring and notification
Easy configuration (yaml)
- reproducible configuration (not using a web console)

Conclusion

Deploying R in production is not very easy but it is a good problem to have
Diverse approaches towards creating API
FaaS Provider space is getting competitive

Thank You!

Questions?

sessionInfo()

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.4  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
##  [5] tools_3.4.4     htmltools_0.3.6 yaml_2.1.18     Rcpp_0.12.16   
##  [9] stringi_1.1.7   rmarkdown_1.9   knitr_1.20      stringr_1.3.0  
## [13] digest_0.6.15   evaluate_0.10.1