Productionizing R scripts in the cloud

Gergely Daróczi

April 6, 2019

dummy slide

Intro

Every Data Science project starts with … ETL

  • Have you ever written an R script to be run in a non-interactive way?
  • Did it work?
  • Have you ever scheduled an R script to run without human intervention?
  • Do you have any R script running in production?
  • … with central logging, proper monitoring and alerting?
  • … run on cheap spot-instances with real-time performance metrics collected and feeding an AI picking the optimal instance type for the next run?

Every BI consulting firm have developed its own …

  • job scheduler
  • data repository and metadata documentation tool
  • central logging
  • and a few other things …

Every BI consulting firm have developed its own …

  • job scheduler
  • data repository and metadata documentation tool
  • central logging
  • and a few other things …

Today we can just pick the right open-source tool, such as

  • Jenkins
  • Airflow or Luigi
  • cronR and taskscheduleR R packages
  • cloud services (eg AWS Batch, Cloudwatch, Lambda)

Source: When my co-worker wants to simplify code …

Don’t get confused …

… use version control!

Source: A junior vs a senior during a huge system failure

git 101

Start from scratch:

Contribute to an existing project:

git 102

Version control all your scripts!

  • R packages
  • R scripts
  • Other scripts
  • Configuration files
  • Rmd reports
  • Ad-hoc queries

As you might will need these later!

git 201

Read more at https://happygitwithr.com

Clean code!