R Package for internal usage - Released for everyone

Today I'll be presenting R for Auth0, our internal R package to smoothen our workflow when working with the R language. At Auth0, we use R not only for statistical analysis but also for most of our ETL processing jobs.

Auth0's Data Technology Stack

First I'll present some insight into our technology stack, we use:

R for ETL, Analysis and Machine Learning
RStudio as an interface for working with R
Python as our importer/exporter for external APIs
Amazon Redshift as our data warehouse
Apache Airflow as our job scheduling system

Except for Amazon Redshift, our stats backend is all open source software.

Enter `rauth0`

Working with ETL processes every day we noticed some recurring patterns, table loading, upserting, slowly changing dimensions, ggplot theming and others, that we could simplify by centralizing in one place. So we have this package which centralizes functionality that we reuse in many processes.

We will walk through some of the most interesting pieces of rauth0 in hopes it'll be useful to build out your own internal R package for your organization, or just personal use. The package is MIT licensed so you can use to adapt and use the code without significant restrictions.

Plotting Beautifully

We teamed up with Auth0's design team member Julian Leiss to work out some better defaults for ggplot2, in hopes of making our R generated charts better for presentation, and more according to the company's style. Working on this, we defined these palettes for discrete variables:

library(ggplot2)
library(rauth0)

base_plot =
ggplot(diamonds, aes(clarity, fill = cut)) +
geom_bar() +
theme_auth0()

base_plot + scale_fill_auth0_discrete(palette='default') +
  ggtitle('Default palette')

Example graph of Auth0 default palette

base_plot + scale_fill_auth0_discrete(palette='sequential') +
  ggtitle('Sequential palette')

Example graph of Auth0 sequential palette

base_plot + scale_fill_auth0_discrete(palette='colorful') +
  ggtitle('Colorful palette')

Example graph of Auth0 colorful palette And this one for continous variables:

dsamp <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(dsamp, aes(carat, price)) +
  geom_point(aes(colour = price)) +
  theme_auth0() +
  scale_color_auth0_gradient(midpoint = 10000) +
  labs(title="Continous scale", subtitle='Carat vs price')

Example chart of continous scale

Accessing the Data Warehouse

As mentioned earlier, we will use Amazon Redshift for accessing the data warehouse (DWH), as such, we use the open source R package redshiftTools to upload data into the DWH. Although that is just a piece of the picture, we also need to read data, and there are some repetitive manipulations we do.

For example, many times we may want to connect with dbplyr and figure out which images are the most common as a Twitter thumbnail:

library(tidyverse)
library(dbplyr)

con = dwh_connect()

page_metadata = tbl(con, in_schema('prod', 'dim_page_metadata'))
twitter_image_count = filter(page_metadata, page_meta_twitter_image != '<Blank>', scd_end_date == '9999-12-31') %>%
  group_by(page_meta_twitter_image) %>%
  count() %>%
  arrange(desc(n))

knitr::kable(collect(twitter_image_count, n=10), format='markdown') # Top 10

pagemetatwitter_image	n
https://cdn.auth0.com/website/authenticate/meta-twitter.png	10186
https://cdn.auth0.com/blog/blog/home-twitter-card.png	2465
/modules/twittercard/1x-new.png	687
https://cdn.auth0.com/blog/illustrations/auth0.png	339
https://cdn.auth0.com/website/assets/modules/twittercard/1x-new-383f8d95d1.png	67
https://cdn.auth0.com/website/assets/modules/opengraph/1x-new-69e09d5054.png	61
https://auth0.com/learn/wp-content/uploads/2016/01/1x-new.png	59
https://cdn.auth0.com/blog/series-c/auth0-logo.png	36
https://cdn.auth0.com/blog/angular/logo3.png	23
https://cdn.auth0.com/blog/jwtalgos/logo.png	12

The beauty of this is that with dbplyr we can also inspect which query is the client running to make this work:

show_query(twitter_image_count)

## <SQL>
## SELECT "page_meta_twitter_image", COUNT(*) AS "n"
## FROM prod.dim_page_metadata
## WHERE (("page_meta_twitter_image" != '<Blank>') AND ("scd_end_date" = '9999-12-31'))
## GROUP BY "page_meta_twitter_image"
## ORDER BY "n" DESC

In the same spirit of data manipulation, we have Redshift compatible functions to run arbitrary queries and statements, create views, replace tables with data or with an SQL query, etc. I can't stress enough how painlessly it is to work with R and Amazon Redshift with these tools available, especially "dbplyr".

Kimball methods

We also have in this library, functions to update SCD type 1 and type 2 dimensions, I invite you to look at functions dwh_update_scd_type1, dwh_update_scd_type1_in_database and dwh_update_scd_type2, they are quite complex but worth the effort because of the problems they solve.

They might have some edge cases not considered in our current use cases and will evolve over time. This can be considered an alternative to buying a full-fledged ETL tool without having to use a graphical interface for designing the data flows, with its advantages and disadvantages.

A real clickstream data example

For example, this is the real production code for building the sessions dimension for clickstream data, there is another process that generates the staging tables (stg_session_events with a row every pageview/event and stg_sessions has one only row per session).

library(rauth0)
library(tidyverse)
library(dbplyr)

con = dwh_connect()
all_events = tbl(con, 'events_metrics') %>% filter(received > '2012-01-01')

# Only use pageviews, tracked events and identify calls
filteredEvents = filter(all_events, type %in% c('track', 'page', 'identify'), sql("url_protocol(url)") %in% c('http', 'https'))


dim_ref = tbl(con, 'stg_sessions')
map_ref = tbl(con, 'stg_session_events')
dim_page = tbl(con, in_schema('prod', 'dim_page'))
dim_sessions = tbl(con, in_schema('prod', 'dim_session'))

dwh_set_execution_slots(con, 3)

onlyNew = select(map_ref, id=event_id, session_key) %>%
  anti_join(dim_sessions, by='session_key') %>%
  compute()

basePageviews = onlyNew %>% # Only sessions not in dimension already, recalculating is costly
  inner_join(filteredEvents, by=c('id')) %>%
  inner_join(dim_page, by=c('url'='page_full_url')) %>%
  # Remove consecutive visits/tracks to the same url
  mutate(
    prev_page=sql('lag(page_clean_url) over (partition by session_key order by received, id)')
  ) %>%
  filter(prev_page!=page_clean_url)

actionNum = basePageviews %>%
  group_by(session_key) %>%
  tally()

actionSequence = basePageviews %>%
  # Get sequence numbers
  mutate(
    url=sql('left(coalesce(page_clean_url, url), 500)::varchar(500)'),
    seq=sql('row_number() over(partition by session_key order by received, id)'),
    revseq=sql('row_number() over(partition by session_key order by received desc, id)')
  ) %>%
  # Build string for first five and last 5 pageviews
  filter(seq<=5 | revseq <=5) %>%
  mutate(
    path_firstfive = ifelse(seq<=5, url, NA),
    path_lastfive = ifelse(revseq<=5 & seq > 5, url, NA)
  ) %>%
  group_by(session_key) %>%
  summarize(
    firstfive=sql("listagg(path_firstfive, ' -> ') within group (order by received)"),
    lastfive=sql("listagg(path_lastfive, ' -> ') within group (order by received)")
  ) %>%
  ungroup() %>%
  mutate(
    firstfive=sql('firstfive::varchar(2520)'),
    lastfive=sql('lastfive::varchar(2520)')
  ) %>%
  inner_join(actionNum, by='session_key') %>%
  mutate(
    action_sequence=case_when(
      n > 10 ~ concat(concat(firstfive, ' -> [...] -> '), lastfive),
      !is.na(lastfive) ~ concat(concat(firstfive, ' -> '), lastfive),
      TRUE ~ firstfive
    )
  ) %>%
  inner_join(dim_ref, by='session_key') %>%
  select(session_key, action_sequence, session_id) %>%
  compute()

dwh_upsert_table_from_temp_table(con, "prod.dim_session", remote_name(actionSequence), keys=c('session_key'))

Final Notes

We use R and the other technologies mentioned for our ETL processing and it has been very fruitful and smooth for us, we hope some of the ideas on this article can be useful for your Data Warehousing efforts.

You can access and download our internal library here: https://github.com/auth0/rauth0

About the author

Pablo Seibelt

Head of Data (Auth0 Alumni)

Data Scientist and Engineer, I love tinkering with data!View profile

RAuth0 - Our internal R package

Auth0's Data Technology Stack

Enter `rauth0`

Plotting Beautifully

Accessing the Data Warehouse

Kimball methods

A real clickstream data example

Final Notes

Pablo Seibelt

A comprehensive guide to connect to Amazon Redshift from R

A guide through the available drivers and tools to make your life easier when using Amazon Redshift from R and/or RStudio

Adding Authentication to Shiny Server in 4 Simple Steps

Learn how to add authentication to your free Shiny Server setup and secure your interactive R apps!

Machine Learning for Everyone

Learn the basics of predictive modeling behind one of the most-used machine learning models

RAuth0 - Our internal R package

Auth0's Data Technology Stack

Enter rauth0

Plotting Beautifully

Accessing the Data Warehouse

Kimball methods

A real clickstream data example

Final Notes

Pablo Seibelt

Enter `rauth0`