developers

RAuth0 - Our internal R package

Learn about the R package we use internally at Auth0 for ETL / Data analysis.

May 2, 20196 min read

Today I'll be presenting R for Auth0, our internal R package to smoothen our workflow when working with the R language. At Auth0, we use R not only for statistical analysis but also for most of our ETL processing jobs.

Auth0's Data Technology Stack

First I'll present some insight into our technology stack, we use:

Except for Amazon Redshift, our stats backend is all open source software.

Enter
rauth0

Working with ETL processes every day we noticed some recurring patterns, table loading, upserting, slowly changing dimensions,

ggplot
theming and others, that we could simplify by centralizing in one place. So we have this package which centralizes functionality that we reuse in many processes.

We will walk through some of the most interesting pieces of rauth0 in hopes it'll be useful to build out your own internal R package for your organization, or just personal use. The package is MIT licensed so you can use to adapt and use the code without significant restrictions.

Plotting Beautifully

We teamed up with Auth0's design team member Julian Leiss to work out some better defaults for

ggplot2
, in hopes of making our R generated charts better for presentation, and more according to the company's style. Working on this, we defined these palettes for discrete variables:

library(ggplot2)
library(rauth0)

base_plot =
ggplot(diamonds, aes(clarity, fill = cut)) +
geom_bar() +
theme_auth0()

base_plot + scale_fill_auth0_discrete(palette='default') +
  ggtitle('Default palette')

Example graph of Auth0 default palette

base_plot + scale_fill_auth0_discrete(palette='sequential') +
  ggtitle('Sequential palette')

Example graph of Auth0 sequential palette

base_plot + scale_fill_auth0_discrete(palette='colorful') +
  ggtitle('Colorful palette')

Example graph of Auth0 colorful palette And this one for continous variables:

dsamp <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(dsamp, aes(carat, price)) +
  geom_point(aes(colour = price)) +
  theme_auth0() +
  scale_color_auth0_gradient(midpoint = 10000) +
  labs(title="Continous scale", subtitle='Carat vs price')

Example chart of continous scale

Accessing the Data Warehouse

As mentioned earlier, we will use Amazon Redshift for accessing the data warehouse (DWH), as such, we use the open source R package

redshiftTools
to upload data into the DWH. Although that is just a piece of the picture, we also need to read data, and there are some repetitive manipulations we do.

For example, many times we may want to connect with

dbplyr
and figure out which images are the most common as a Twitter thumbnail:

library(tidyverse)
library(dbplyr)

con = dwh_connect()

page_metadata = tbl(con, in_schema('prod', 'dim_page_metadata'))
twitter_image_count = filter(page_metadata, page_meta_twitter_image != '<Blank>', scd_end_date == '9999-12-31') %>%
  group_by(page_meta_twitter_image) %>%
  count() %>%
  arrange(desc(n))

knitr::kable(collect(twitter_image_count, n=10), format='markdown') # Top 10
pagemetatwitter_imagen
https://cdn.auth0.com/website/authenticate/meta-twitter.png10186
https://cdn.auth0.com/blog/blog/home-twitter-card.png2465
/modules/twittercard/1x-new.png687
https://cdn.auth0.com/blog/illustrations/auth0.png339
https://cdn.auth0.com/website/assets/modules/twittercard/1x-new-383f8d95d1.png67
https://cdn.auth0.com/website/assets/modules/opengraph/1x-new-69e09d5054.png61
https://auth0.com/learn/wp-content/uploads/2016/01/1x-new.png59
https://cdn.auth0.com/blog/series-c/auth0-logo.png36
https://cdn.auth0.com/blog/angular/logo3.png23
https://cdn.auth0.com/blog/jwtalgos/logo.png12

The beauty of this is that with

dbplyr
we can also inspect which query is the client running to make this work:

show_query(twitter_image_count)
## <SQL>
## SELECT "page_meta_twitter_image", COUNT(*) AS "n"
## FROM prod.dim_page_metadata
## WHERE (("page_meta_twitter_image" != '<Blank>') AND ("scd_end_date" = '9999-12-31'))
## GROUP BY "page_meta_twitter_image"
## ORDER BY "n" DESC

In the same spirit of data manipulation, we have Redshift compatible functions to run arbitrary queries and statements, create views, replace tables with data or with an SQL query, etc. I can't stress enough how painlessly it is to work with R and Amazon Redshift with these tools available, especially "dbplyr".

Kimball methods

We also have in this library, functions to update SCD type 1 and type 2 dimensions, I invite you to look at functions

dwh_update_scd_type1
,
dwh_update_scd_type1_in_database
and
dwh_update_scd_type2
, they are quite complex but worth the effort because of the problems they solve.

They might have some edge cases not considered in our current use cases and will evolve over time. This can be considered an alternative to buying a full-fledged ETL tool without having to use a graphical interface for designing the data flows, with its advantages and disadvantages.

A real clickstream data example

For example, this is the real production code for building the sessions dimension for clickstream data, there is another process that generates the staging tables (

stg_session_events
with a row every pageview/event and
stg_sessions
has one only row per session).

library(rauth0)
library(tidyverse)
library(dbplyr)

con = dwh_connect()
all_events = tbl(con, 'events_metrics') %>% filter(received > '2012-01-01')

# Only use pageviews, tracked events and identify calls
filteredEvents = filter(all_events, type %in% c('track', 'page', 'identify'), sql("url_protocol(url)") %in% c('http', 'https'))


dim_ref = tbl(con, 'stg_sessions')
map_ref = tbl(con, 'stg_session_events')
dim_page = tbl(con, in_schema('prod', 'dim_page'))
dim_sessions = tbl(con, in_schema('prod', 'dim_session'))

dwh_set_execution_slots(con, 3)

onlyNew = select(map_ref, id=event_id, session_key) %>%
  anti_join(dim_sessions, by='session_key') %>%
  compute()

basePageviews = onlyNew %>% # Only sessions not in dimension already, recalculating is costly
  inner_join(filteredEvents, by=c('id')) %>%
  inner_join(dim_page, by=c('url'='page_full_url')) %>%
  # Remove consecutive visits/tracks to the same url
  mutate(
    prev_page=sql('lag(page_clean_url) over (partition by session_key order by received, id)')
  ) %>%
  filter(prev_page!=page_clean_url)

actionNum = basePageviews %>%
  group_by(session_key) %>%
  tally()

actionSequence = basePageviews %>%
  # Get sequence numbers
  mutate(
    url=sql('left(coalesce(page_clean_url, url), 500)::varchar(500)'),
    seq=sql('row_number() over(partition by session_key order by received, id)'),
    revseq=sql('row_number() over(partition by session_key order by received desc, id)')
  ) %>%
  # Build string for first five and last 5 pageviews
  filter(seq<=5 | revseq <=5) %>%
  mutate(
    path_firstfive = ifelse(seq<=5, url, NA),
    path_lastfive = ifelse(revseq<=5 & seq > 5, url, NA)
  ) %>%
  group_by(session_key) %>%
  summarize(
    firstfive=sql("listagg(path_firstfive, ' -> ') within group (order by received)"),
    lastfive=sql("listagg(path_lastfive, ' -> ') within group (order by received)")
  ) %>%
  ungroup() %>%
  mutate(
    firstfive=sql('firstfive::varchar(2520)'),
    lastfive=sql('lastfive::varchar(2520)')
  ) %>%
  inner_join(actionNum, by='session_key') %>%
  mutate(
    action_sequence=case_when(
      n > 10 ~ concat(concat(firstfive, ' -> [...] -> '), lastfive),
      !is.na(lastfive) ~ concat(concat(firstfive, ' -> '), lastfive),
      TRUE ~ firstfive
    )
  ) %>%
  inner_join(dim_ref, by='session_key') %>%
  select(session_key, action_sequence, session_id) %>%
  compute()

dwh_upsert_table_from_temp_table(con, "prod.dim_session", remote_name(actionSequence), keys=c('session_key'))

Final Notes

We use R and the other technologies mentioned for our ETL processing and it has been very fruitful and smooth for us, we hope some of the ideas on this article can be useful for your Data Warehousing efforts.

You can access and download our internal library here: https://github.com/auth0/rauth0