March 24, 2018

Of R and APIs: Running R in Amazon Lambda

About

Numeract LLC

  • Data Science and Economics / Finance consulting services
  • Technology Stack: (Postgre)SQL, R, Python, Spark, Docker, AWS


Authors and Contributors

  • Mike Badescu
    • PhD Economics, Lehigh University
    • Experience: Economics, Finance, Statistics, Machine Learning
  • Ana Niculescu
  • Teodor Ciuraru

Summary

  1. Motivation
  2. APIs
  3. Plumber
  4. OpenCPU
  5. Web Server / Docker vs FaaS
  6. R on Amazon Lambda
  7. Ideal Setup

Motivation

Typical Data Product Progression

  1. Exploratory Data Analysis
    • get the data
    • some cleaning too
  2. Proof of Concept
    • R Markdown report
  3. Stakeholder Engagement and Validation
    • Shiny App
    • iterate steps 1-3 as needed

Motivation

4. Can we use it in production?

  • Is Shiny scalable?
  • We already have a dashboard, we do not use Shiny.
  • Can we use the current R code but without Shiny?
  • How do we integrate it with our current code base?


  • Rewriting the code in another programming language is not efficient
  • Certain algorithms are not available in other programming languages

Solutions

  • Install R on the same machine as the production server
    • Increases chances of failure, maintenance issues
    • Security issues
  • R running inside of a VM next to other VMs
    • How do we connect to it?
    • Use files? Read from a Database?
  • API
    • Can we have an API?
    • Can it be a REST API?

API

  • Modularize the application: calls from Client to Server
  • Programming language independent
  • Common concept: programmers know how to work with APIs
  • Common "data language": JSON (but not the only one)
  • REST API
    • stateless ==> pure functions
    • over HTTP, browsable
  • Over HTTP, two main methods:
    • GET: sends a request to a server and waits for data
    • POST: sends data to the server and waits for an answer

API Example

API Example

API Calls

Main Tool: curl

  • Free and open source
  • Can handle: FTP, FTPS, HTTP, HTTPS, IMAP, SCP, etc.
  • Deals with SSL certificates, cookies, HTTP headers, etc.
  • R packages:
    • curl by Jeroen Ooms
    • RCurl by Duncan Temple Lang and the CRAN team

API Example

Facebook Graph API: using curl from Git Bash

R and API

Can we serve API requests from R?

  • Not directly, as R is not a web server
  • We need a wrapper / server that:
    • receives requests and hands them to R
    • passes any response from R to the client

Approaches

  • Web server / container:
    • plumber and OpenCPU
  • Serverless: wrap R functions inside FasS (Function as a Service)
    • running R on AWS Lambda

plumber

An R package that generates a web API from the R code you already have.

  • Main author: Jeff Allen
  • Package website: https://www.rplumber.io/
  • Usage:
    • use function decorators, similar to roxygen2
    • call plumb() on the .R file
    • access local server, similar to shiny
  • Quick, interactive loop: code <==> local server => prod server
  • Has its own web server to deploy in production
  • Very good documentation and examples
    • source of the following examples

plumber

Decorators

#* @get /mean
normalMean <- function(samples = 10) {
    data <- rnorm(samples)
    mean(data)
}


#* @post /sum
addTwo <- function(a, b) {
    as.numeric(a) + as.numeric(b)
}

plumber - Local Server

plumber - Features

Output customization, including JSON (default), HTML, PNG, JPEG

#' @get /plot
#' @png
function(){
    myData <- iris
    plot(myData$Sepal.Length, myData$Petal.Length,
         main="All Species", xlab="Sepal Length", ylab="Petal Length")
}

plumber - Features

Filters

  • Pipeline of handling requests
  • Can be used as authorizers (similar to AWS Lambda functions)
#* @filter checkAuth
function(req, res){
    if (is.null(req$username)){
        res$status <- 401 # Unauthorized
        return(list(error="Authentication required"))
    } else {
        plumber::forward()
    }
}

plumber - Features

Dynamic Routes

  • API endpoint: /users/13, where 13 is dynamic
#' @get /users/<id>
function(id) {
    subset(users, uid == id)
}

plumber - Deployment

Local Server

  • Single threaded R instance

Web Server / Docker

  • VMs (documentation example for Digital Ocean)
  • RStudio Connect
  • Docker: single container
  • Docker: multiple plumber instances
    • traffic routed by nginx
    • use load balancing to scale horizontally

OpenCPU

A reliable and interoperable HTTP API for data analysis based on R.

  • Author: Jeroen Ooms
  • Package website: https://www.opencpu.org/
  • Usage on local machine:
    • write functions ==> deploy within a package
    • start server, call API (tmp key for extra info)
  • Testing new code on the local server takes longer
  • JavaScript client
  • Web server with more features

OpenCPU

OpenCPU

Features

  • The tmp key exposes: the input, the R command, the code, the output
  • Linux OpenCPU Server for production deployment
    • support for parallel/async request ==> get a large server
    • AppArmor for App security within Linux
    • Docker container
    • authentication through apache2 / nginx authentication modules

Deployment

  • Similar to plumber, except no RStudio Connect

Web Server / Docker vs FaaS

Web Server / Docker

  • Common Linux layer
  • Application specific containers
    • flexible memory and CPU resources
  • If only one container (0 to 1 issue)
    • always running
    • not scalable
  • More than one container (1 to N issue)
    • minimum framework needed: cache, load balancing
    • auto-scaling up to a given N containers

Web Server / Docker vs FaaS

Function as a Service

  • In general, a platform to build apps while minimizing infrastructure maintenance
  • The focus is on the code / functionality, not DevOps
  • Pay only for what you use
  • Horizontally Scalable
  • "Smaller Containers": usually resources are limited
  • Programming languages:
    • initially: Java, JavaScript, C#, GO
    • now Python is being added by more providers
    • docker containers are becoming more popular

Web Server / Docker vs FaaS

Web Server / Docker vs FaaS

FaaS providers (cont.)

  • Apache: OpenWhisk
    • on IBM Bluemix
    • JavaScript, Swift, Python, PHP function, Java
    • any binary-compatible executable including Go programs
    • Docker
  • Oracle: Cloud Platform / Fn project
    • Java, Go, Python, and Ruby
    • Docker

Web Server / Docker vs FaaS

Common concerns

  • How to authenticate the client?
    • HTTP Headers, tokens, etc.
  • How to encrypt communication with the client?
    • HTTPS
  • How to limit the resources?
    • Limits on CPU run time, memory and disk
    • Careful not to DDOS yourself!
  • How fast to scale horizontally?
    • How easy/fast is it to go from 0 to 1 calls?
    • How easy/fast is it to go from 1 to N (small) calls?
    • How easy/fast is it to go to N (large) calls?

R on FaaS

aws-lambda-r

Motivation

  • Proof of concept validated by client (Shiny rocks!)
  • App engine tasks:
    • get a request id
    • get data from a database DB
    • process data for 5-20 sec
  • Q: How to use this code in production?
    • needs to be triggered by the production server
    • not possible to run R on the same server
    • large processing back load
    • uncertain and irregular future demand

aws-lambda-r

  • Solution: R on AWS Lambda
    • 0 to 10,000 scalability
    • very low costs
    • allows us to change the implementation details (Python?) later
  • numeract/aws-lambda-r on GitHub
    • not an R package
    • a series of Bash scripts
  • A framework
    • uses AWS CLI through (Git)Bash ==> available on all platforms
    • it is flexible because AWS changes often
    • easy to adapt to automatically deploy Python or js functions

aws-lambda-r - Details

Top view

  • Deploy R function on AWS Lambda
  • Configure access to AWS Lambda

How hard can it be?

aws-lambda-r - Details

  • Deploy R function on AWS Lambda
    • needs to be called from Python
    • create a temporary .zip deployment package and store it on S3
  • Configure access to AWS Lambda
    • configure API Gateway
    • configure AWS Lambda

aws-lambda-r - Details

  • Deploy R function on AWS Lambda
    • needs to be called from Python
      • temporary EC2 instance
      • install Python, R + packages
      • copy R files to EC2
    • create a temporary .zip deployment package and store it on S3
  • Configure access to AWS Lambda
    • configure API Gateway
      • create API resources and HTTP methods
      • create another Lambda function to use as an authorizer
    • configure AWS Lambda
      • size, permissions

aws-lambda-r - Details

  • Deploy R function on AWS Lambda
    • needs to be called from Python
      • temporary EC2 instance
        • create IAM roles, VPC, security groups, etc.
      • install Python, R + packages
        • figure out what other Linux packages are needed + a Python wrapper
      • copy R files to EC2
    • create a temporary .zip deployment package and store it on S3
      • figure out what files to copy
  • Configure access to AWS Lambda
    • configure API Gateway
      • create API resources and HTTP methods
        • account for API versioning
      • create another Lambda function to use as an authorizer
    • configure AWS Lambda
      • size, permissions

aws-lambda-r - Security

example.R

aws_lambda_r <- function(input_json) {
    output_json <- '{"message": "Cannot create output JSON"}'
    tryCatch({
        input_lst <- from_json(input_json)
        request_id <- input_lst$request_id[1]
        output_lst <- list(
            result_id = request_id,
            result_lst = list(a = 1, b = 2:4),
            result_dbl = 1:10 / 2,
            message = NULL
        )
        output_json <- to_json(output_lst)
    }, error = function(e) {
        output_json <<- paste0('{"message": "', e$message, '"}')
    })
    output_json
}

lambda_get.py

import os
os.environ["R_HOME"] = os.getcwd()
os.environ["R_LIBS"] = os.path.join(os.getcwd(), 'libraries')
import rpy2
import ctypes
import rpy2.robjects as robjects
import json

for file in os.listdir('lib/external'):
    file_name='lib/external/' + file
    ctypes.cdll.LoadLibrary(os.path.join(os.getcwd(), file_name))

# source R file
# this R file might load libraries and source other files
robjects.r['source']('example.R')

# exposing R entry point to python
aws_lambda_r = robjects.globalenv['aws_lambda_r']

def handler_get(event, context):
    input_json = json.dumps(event)
    output_json = json.loads(str(aws_lambda_r(input_json)))
    return output_json

Bash scripts

What we need

  • An AWS account
  • AWS CLI installed and configured on the local machine
    • look for a ~/.aws/ directory with two files
    • get a SSH key from AWS and place it in ~/.ssh/directory
    • keep your secrets safe!!
  • Copy these directories to your app:
    • lambda/ - will contain your .R and .py files before uploading to AWS
    • python/ - Python scrips (one is usually sufficient)
    • scripts/ - configuration and deployment
    • settings/ - default, auto-config, user settings and secrets

Bash scripts - settings/

  • settings_default.sh
    • for reference, list of variables that need to be populated
  • secrets_default.sh
    • for reference, list of access keys, names, IDs to be populated
  • setup_auto.sh
    • list of variables populated by the auto-configuration scripts
    • keep this file safe - do not commit to GitHub!
  • setup_user.sh
    • user overriding settings and secrets
    • danger: you will enter your AWS ACCESS KEY here
    • keep this file safe - do not commit to GitHub!

Bash scripts - scripts/

Auto-configuration: settings will be saved in setup_auto.sh

  • 21_setup_vpc.sh
    • setup a new VPC in a AWS zone indicated in settings
  • 22_setup_custom_ami.sh
  • 23_setup_s3.sh
    • creates an AWS S3 bucket if not already present
  • 24_setup_lambda.sh
    • creates Lambda Authorizer function
    • creates and configures API Gateway

Bash scripts - scripts/

Local Scripts

  • 01_main.sh - calls all other local scripts
  • 02_setup.sh - loads all settings
  • 03_check_settings.sh - checks and print outs
  • 04_create_ec2.sh - new EC2 instance
  • 05_update_ec2.sh - update EC2 if no custom AMI found
  • 06_copy_files.sh - copy files from local to EC2
    • indicate which files to copy in settings/lambda_files.txt
  • 07_deploy_lambda.sh - calls remote scripts
  • 08_terminate_ec2.sh - terminate EC2 instance
  • 09_test_deployment.sh - curl deployment tests

Bash scripts - scripts/

aws-lambda-r - Demo

Deployment to AWS region us-east-1 (N. Virginia)

  • Prepare
    • check user settings
    • create VPC
    • create AMI
    • setup AWS S3
    • setup AWS Lambda and API Gateway
  • Deploy / Run the main script

aws-lambda-r - Demo

aws-lambda-r - Demo

aws-lambda-r - Evaluation

  • Excellent performance in production!
    • running for almost 6 months without a problem
  • AWS Lambda Limitations
    • size for the deployment package: 250MB unzipped
    • problem getting tideverse, DBI packages installed
    • max run time for 5 min for Lambda / 30 sec for API Gateway
  • Installing R on Amazon Linux
    • get the right Linux packages for R packages
    • get the right files to copy to the deployment package
  • Too many options to configure API Gateway and AWS Lambda
    • alternative is to lock down (limit) the implementation

aws-lambda-r - TODO List

  • Update documentation
  • Other HTTP responses, e.g. PNG, JPEG
  • Use the /tmp disk space (500MB) for large packages (e.g., BH)
  • Create a package to simplify deployment(?)
  • Allow for installing R packages from GitHub
  • Configure in settings which Linux and Python packages to install
  • Option for AWS Lambda only (without API Gateway)

Ideal Setup

  • Docker container
    • set our own limits for disk, memory and run time
    • no need for a Python wrapper
    • install R and packages ourselves (the alternative is a locked down R)
  • FaaS infrastructure
    • load balancing, auto scaling
    • multiple authentication methods
    • VPN for increase security
    • versioning and multiple stages (e.g., alpha, beta, prod)
    • monitoring and notification
  • Easy configuration (yaml)
    • reproducible configuration (not using a web console)

Conclusion

  • Deploying R in production is not very easy but it is a good problem to have
  • Diverse approaches towards creating API
  • FaaS Provider space is getting competitive

Thank You!

Questions?

sessionInfo()

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.4  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
##  [5] tools_3.4.4     htmltools_0.3.6 yaml_2.1.18     Rcpp_0.12.16   
##  [9] stringi_1.1.7   rmarkdown_1.9   knitr_1.20      stringr_1.3.0  
## [13] digest_0.6.15   evaluate_0.10.1