The Data team is responsible for crunching, reporting, and serving data. The team also does data integrations with other systems, creating machine learning, and deep learning models.
With this post, we intend to share our favorite tools, which are proven to run with thousands of millions of data. Scaling processes in real-world scenarios is a hot topic among new people coming to data.
R or Python?
R is a GNU project, thought as a statistical data language originally developed at Bell Laboratories around 1996.
Python, developed in 1991 by Guido van Rossum, is a general-purpose language with a focus on code readability.
Both R and Python are highly extensible through packages.
We mainly use R for our data processes and ML projects, and Python to do the integrations and Deep Learning projects.
RStudio is an open-source and vast IDE capable of browsing data and objects created during the session, plots, debugging code, among many other options. It also provides an enterprise-ready solution.
Jupyter is also an open-source IDE aimed to interface Julia, Python, and R. Today's is widely used for data scientists to share their analysis. Recently Google creates "Colab", a Jupyter notebook environment capable of running in the google drive cloud.
So is R Capable of Running on Production?
We run several heavy data preparations and predictive models every day, every hour, and every few minutes.
How Do We Run R and Python Tasks on Production?
We use Airflow as an orchestrator, an open-source project created by Airbnb.
Airflow is an incredible and robust project which allows us to schedule processes, assign priorities, rules, detailed log, etc.
For development, we still use the form:
Airflow is a Python-based task scheduler that allows us to run chained processes, with many complex dependencies, monitoring the current state of all of them and firing alerts if anything goes wrong to Slack. This is ideal for running import jobs to populate the Data Warehouse with fresh data every day.
Do We Have a Data Warehouse?
Yes, and it's huge!
It's mounted on Amazon Redshift, a suitable option if scaling is a priority. Visit their website to learn more about it.
Generally, data is uploaded from R to Amazon Redshift using
redshiftTools. This data can be either plain files or from data frames created during the R session.
We use Python to import and export unstructured data since R does not have useful libraries currently to handle it.
We have experimented with JSON libraries in R but the result is much worse than using Python in this scenario. For example, using RJSONIO the dataset is automatically transformed into an R Data Frame, with little control of how the transformation is done. This is only useful for very simple JSON data structures and is very difficult to manipulate in R, compared to Python where this is much easier and more natural.
How do we deal with data preparation using R?
We have two scenarios, data preparation for data engineering, and data preparation for machine learning/AI.
One of the biggest strengths of R is the tidyverse package, which is a set of packages developed by lots of ninja developers, some of them working at RStudio Inc company. They provide a common API and a shared philosophy for working with data. We will cover an example in the next section.
The tidyverse, especially the dplyr package, contains a set of functions that make the exploratory data analysis and data preparation quite comfortable.
For certain tasks in crunching data prep and visualization, we use the funModeling package. It was the seed for an open-source book I published some time ago: Data Science Live Book. It contains some good practices we follow related to deploying models on production, dealing with missing data, handling outliers, and more.
"One of the biggest strengths of R is the tidyverse package, which is a set of packages developed by lots of ninja developers."
Does R Scale?
One of the key points of
dplyr is it can be run on databases, thanks to another package with a pretty similar name: dbplyr.
This way, we write R syntax (
dplyr) and it is "automagically" converted to SQL syntax and it then runs on production.
There are some cases in which these conversions from R to SQL are not made automatically. For such cases, we are still able to do a mix of SQL syntax in R.
For example, following dplyr syntax:
flights %>% group_by(month, day) %>% summarise(delay = mean(dep_delay))
SELECT `month`, `day`, AVG(`dep_delay`) AS `delay` FROM `nycflights13::flights` GROUP BY `month`, `day`
dbplyr makes transparent for the R user working with objects in RAM or a foreign database.
Not many people know, but many key pieces of R are written in C++ (concretely, the Rcpp package).
How do we share the results?
For ad hoc reports (HTML), we use R markdown which shares some functionality with to jupyter notebooks. It allows a script to be created with an analysis that ends in a dashboard, PDF report, web-based reports, and also books!
Machine Learning / AI
We use both R and Python.
For Machine Learning projects, we use mainly the
caret package in R. It provides a high-level interface to many machine learning algorithms, as well as common tasks in data preparation, model evaluation, and hyper-tuning parameter.
For Deep Learning, we use Python, specifically the libraries Keras with TensorFlow as the backend. Keras is an API to build with just a bunch of lines of code, many of the most complex neural networks. It can easily scale by training them on the cloud, in services like AWS.
Nowadays we are also doing some experiments with the fastai library for NLP problems.
The open-source languages are leading the data path. R and Python have strong communities, and there are free and top-notch resources to learn.
Here we wanted to share the not-so-common approach of using R for data engineering tasks, what are our favorite and Python libraries, with a focus on sharing the results, explaining some of the practices we do every day.
We think the most important stages in a data project are the data analysis and data preparation. Choosing the right approach can save a lot of time and make the project to scale.
We hope this post encourages you to try some of the suggested technologies and rock your data projects!
The Auth0 Identity Platform, a product unit within Okta, takes a modern approach to identity and enables organizations to provide secure access to any application, for any user. Auth0 is a highly customizable platform that is as simple as development teams want, and as flexible as they need. Safeguarding billions of login transactions each month, Auth0 delivers convenience, privacy, and security so customers can focus on innovation. For more information, visit https://auth0.com.