Productionizing R scripts in the cloud

Gergely Daróczi

April 6, 2019

dummy slide

Intro

Every Data Science project starts with … ETL

  • Have you ever written an R script to be run in a non-interactive way?
  • Did it work?
  • Have you ever scheduled an R script to run without human intervention?
  • Do you have any R script running in production?
  • … with central logging, proper monitoring and alerting?
  • … run on cheap spot-instances with real-time performance metrics collected and feeding an AI picking the optimal instance type for the next run?

Every BI consulting firm have developed its own …

  • job scheduler
  • data repository and metadata documentation tool
  • central logging
  • and a few other things …

Every BI consulting firm have developed its own …

  • job scheduler
  • data repository and metadata documentation tool
  • central logging
  • and a few other things …

Today we can just pick the right open-source tool, such as

  • Jenkins
  • Airflow or Luigi
  • cronR and taskscheduleR R packages
  • cloud services (eg AWS Batch, Cloudwatch, Lambda)

Source: When my co-worker wants to simplify code …

Don’t get confused …

… use version control!

Source: A junior vs a senior during a huge system failure

git 101

Start from scratch:

Contribute to an existing project:

git 102

Version control all your scripts!

  • R packages
  • R scripts
  • Other scripts
  • Configuration files
  • Rmd reports
  • Ad-hoc queries

As you might will need these later!

git 201

Read more at https://happygitwithr.com

Clean code!

Source: Junior dev being awesome

Code style

It can be anything … just be consistent!

Extra hints:

DRY

Document and potentially open-source your code

  • Pick a permissive license!
  • Use roxygen2, potentially with markdown
  • Write vignettes
  • Run R CMD check frequently to check on your docs
  • Build your R package homepage with pkgdown

Read more!

Code Smells and Feels by Jenny Bryan

Advanced R by Hadley Wickham

Read even more!

Clean Code by Robert C. Martin

Refactoring by Martin Fowler et al.

When I desperatly search in the logs why the app crashed

logger: a lightweight, modern and flexibly logging utility for R

cat demo.R

Colorized output

The Anatomy of a Log Request

Custom formatter / layout / appender

More about logging

https://daroczig.github.io/logger

Database connections

Spot the issues in the above code block!

Why secure database credentials?

“When I woke up the next morning, I had four emails and a missed phone call from Amazon AWS – something about 140 servers running on my AWS account, mining Bitcoin.” – Andrew Hoffman

Loading MySQL configuration from global options

But how to set those global options on the server?

Loading MySQL configuration from the keyring

Great for the single-desktop R user, but how to make use of it on a remote server?

Using a pre-configured Data Source Name

But we still need someone to set up / deploy configuration.

Using a MySQL configuration file

But we still need to set up ~/.my.cnf:

And then .pgpass etc as well.

Loading MySQL configuration from more general, custom files

Again, how to get those unencrypted RData files to the server?

Loading MySQL configuration from encrypted custom files

But how to get the private key to a new server?

Loading MySQL configuration from environment variables

Loading MySQL configuration from config files v1

With the below YAML config:

Loading MySQL configuration from config files v2

With the below YAML config:

But again, we have to get the YAML file to the server in a secure way :/

Amazon KMS for passwords

Source: AWS Encryption SDK

Amazon KMS for 4+ kbytes data

Source: AWS Encryption SDK

Loading MySQL configuration from config files v3

With the below YAML config:

Parsing a YAML config with KMS

Using a YAML config with KMS

Reminder: motivation

Reminder: motivation

Reminder: motivation

Source: xkcd/1205

A simple wrapper to connect AND query

A simple wrapper to connect AND query v2

db_query example

Bundled db_config.yml coming with the dbr package:

db_query features: attributes

db_query features: making use of attributes

db_query features: connections

db_query features: connections

db_query features: forking

It’s not caching the connection this time, only the credentials.

db_query features: forking

> parallel::mclapply(1:16, function(i) db_query("SELECT 42", "sqlite"), mc.cores = 8)
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Executing:**********
INFO [2018-07-12 03:05:04] SELECT 42
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] ********************
INFO [2018-07-12 03:05:04] Executing:**********
INFO [2018-07-12 03:05:04] Finished in 0.001053 secs returning 1 rows
INFO [2018-07-12 03:05:04] SELECT 42
INFO [2018-07-12 03:05:04] Closing connection to sqlite
INFO [2018-07-12 03:05:04] ********************
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Executing:**********
INFO [2018-07-12 03:05:04] Finished in 0.005117 secs returning 1 rows
INFO [2018-07-12 03:05:04] SELECT 42
INFO [2018-07-12 03:05:04] Closing connection to sqlite
INFO [2018-07-12 03:05:04] ********************
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Executing:**********
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Finished in 0.003497 secs returning 1 rows
INFO [2018-07-12 03:05:04] SELECT 42
INFO [2018-07-12 03:05:04] ********************
INFO [2018-07-12 03:05:04] Executing:**********
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Closing connection to sqlite
INFO [2018-07-12 03:05:04] SELECT 42
INFO [2018-07-12 03:05:04] Executing:**********
INFO [2018-07-12 03:05:04] ********************
INFO [2018-07-12 03:05:04] Executing:**********
INFO [2018-07-12 03:05:04] SELECT 42
INFO [2018-07-12 03:05:04] Finished in 0.0008972 secs returning 1 rows
INFO [2018-07-12 03:05:04] SELECT 42
INFO [2018-07-12 03:05:04] Finished in 0.00947 secs returning 1 rows
INFO [2018-07-12 03:05:04] ********************
INFO [2018-07-12 03:05:04] ********************
INFO [2018-07-12 03:05:04] Finished in 0.001237 secs returning 1 rows
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Closing connection to sqlite
INFO [2018-07-12 03:05:04] Finished in 0.001038 secs returning 1 rows
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Closing connection to sqlite
INFO [2018-07-12 03:05:04] Closing connection to sqlite
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Connecting to sqlite 
INFO [2018-07-12 03:05:04] Executing:**********
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] Executing:**********
INFO [2018-07-12 03:05:05] Executing:**********
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] Executing:**********
INFO [2018-07-12 03:05:05] Connecting to sqlite 
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] Finished in 0.0009949 secs returning 1 rows
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] Executing:**********
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] Executing:**********
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] Finished in 0.001511 secs returning 1 rows
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] Connecting to sqlite 
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] Finished in 0.001259 secs returning 1 rows
INFO [2018-07-12 03:05:05] Executing:**********
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] Finished in 0.001187 secs returning 1 rows
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] Finished in 0.03187 secs returning 1 rows
INFO [2018-07-12 03:05:05] Finished in 0.009078 secs returning 1 rows
INFO [2018-07-12 03:05:05] Finished in 0.0009627 secs returning 1 rows
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] Connecting to sqlite 
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] Connecting to sqlite 
INFO [2018-07-12 03:05:05] Executing:**********
INFO [2018-07-12 03:05:05] Executing:**********
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] SELECT 42
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] ********************
INFO [2018-07-12 03:05:05] Finished in 0.0007973 secs returning 1 rows
INFO [2018-07-12 03:05:05] Finished in 0.0007989 secs returning 1 rows
INFO [2018-07-12 03:05:05] Closing connection to sqlite
INFO [2018-07-12 03:05:05] Closing connection to sqlite

Extending DBI to be able to use db_query

Extending DBI to be able to use db_query

Extending DBI to be able to use db_query

Benchmarking!!!

Transaction Processing Performance Council (TPC) Benchmarks ™ 10 TB

Benchmarking!!!

A more complex benchmark

A more complex benchmark

There’s more!

https://github.com/daroczig/dbr/blob/master/inst/example_sql_chunks.yaml

There’s more!

https://github.com/daroczig/dbr

More on the cloud …

From Java to Python

botor: Reticulate wrapper on ‘boto3’

Loading the botor package might take a while as it will also import the boto3 Python module in the background:

Useful AWS S3 helpers

s3_read('s3://botor/example-data/mtcars.csv', read.csv)
#> DEBUG [2019-04-06 02:08:14] Downloaded 1303 bytes from s3://botor/example-data/mtcars.csv and saved at '/tmp/Rtmpme20BF/filec247798c20d'
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Useful AWS S3 helpers

Useful AWS KMS helpers

Getting secrets from the AWS System Manager’s Parameter Store:

Missing AWS helpers

Scheduling jobs

Installing Jenkins

  1. Install Jenkins from https://pkg.jenkins.io/debian-stable/
wget -q -O - https://pkg.jenkins.io/debian-stable/jenkins.io.key | sudo apt-key add -
echo "deb https://pkg.jenkins.io/debian-stable binary/" | sudo tee -a /etc/apt/sources.list
sudo apt update
sudo apt install openjdk-8-jdk-headless jenkins
sudo netstat -tapen | grep java
  1. Open up port 8080 in the related security group / firewall
  2. Access Jenkins from your browser and finish installation

    1. Read the initial admin password:
    sudo cat /var/lib/jenkins/secrets/initialAdminPassword
    1. Proceed with installing the suggested plugins
    2. Create your first user(s)

Scheduling jobs via cron(ish)

Source: crontab.guru

Jenkins output

Jenkins plugins

  • Extended email (edit subject & body, attachments, inline images, flexible triggers)
  • Slack notifications
  • Write to Cloudwatch logs
  • Start remote jobs with AWS Batch
  • Groovy Postbuild to override build status and set custom badges
  • Jenkins Job DSL Plugin to define jobs in a programmatic way

Reproducible jobs

Creating Docker images

Creating reproducible Docker images

Use littler:

Sharing and using Docker images

  • Use Jenkins or Travis etc to build (and keep up-to-date) your Docker images
  • Also check Amazon’s Elastic Container Registry (ECR) for private Docker images
  • Use EC2 or assumed roles in IAM instead of AWS credentials
  • Write a wrapper around docker run --rm ecr.eu-west-1.amazonaws.com/test:latest
  • Write a wrapper for starting temporarily cloud instances, run docker pull and your Docker runner, then shut down the node after the script finished

Thanks!

Twitter: @daroczig