TL;DR: Auth0 has replatformed our private cloud offering onto Kubernetes. This post details the architecture of the new platform and all the major infrastructure and networking components.
Auth0 is excited to relaunch our private cloud offering on our new platform. Recently, Auth0 has invested in retooling the private cloud platform to support more automation, adopt current technologies, and help scale the private cloud to thousands of environments. It’s been an enormous effort across many teams, and we’re excited to share an overview of the system in this post.
This post will serve as a technical primer for the new platform, and subsequent posts in the series will go in depth on specific features from a technical perspective. In future posts, we’ll cover the architecture in more depth, release orchestration, geo-failover, security principles, and the data pipeline of the new system.
You will hear directly from engineers who built the product and learn how Auth0 was able to deliver on a multi-cloud, Kubernetes-based platform with one-click provisioning for customers.
But first, a quick refresher on our private cloud deployments. Auth0 is a software-as-a-service (SaaS) solution that has two deployment models — a public and private cloud. The public cloud is a standard multi-tenant environment where resources are shared between customers and is available in the United States (US), Europe, Australia, and Japan. Private cloud deployments, on the other hand, are single-subscriber environments that provide customers with dedicated infrastructure and can be deployed in nearly any region across Amazon Web Services (AWS) and Microsoft Azure.
So why choose a private cloud? There are a number of reasons. Private cloud deployments offer cloud and region choice, which can be important for compliance and data residency requirements. Private instances also offer greater performance guarantees above what we can offer for public cloud deployments in terms of throughput on a per-customer basis. Of course, there are the added benefits of having isolated infrastructure solely for your use case. Finally, since private clouds are single subscriber, we can take advantage of private linking back to your Virtual Private Cloud (VPC), which keeps any extensibility-related functionality that originates from Auth0 off the public internet.
Ok! With that out of the way, we can start to introduce the components of the new platform.
Central, PoPs, and Spaces
Our new architecture is composed of three tiers, Central, Points of Presence (PoPs), and Spaces. The relationship here between the components is one to many. A globally singular central has many PoPs, and PoPs can have many spaces, which in our platform is the terminology for individual customer environments. Before going too much deeper, we’ll put some simple definitions here:
- Central: A globally unique component that contains the platform control plane and other centralized global infrastructure. The control plane manages the creation of customer environments and the release pipeline to keep them updated.
- PoPs: An intermediate point of infrastructure pinned to a specific cloud region (i.e., Azure EastUS or AWS us-east-1). The PoP hosts internal visibility tools and securely proxies messages from the central down to the customer environments.
- Space: The customer environment itself, is located ‘beneath’ a PoP in a particular cloud region. Spaces align directly with the type of features selected by the customer. Currently, we support three different packages with geo-failover and payment card industry (PCI) compliance as add-ons.
In addition to the items laid out above, the new platform also has data pipeline components within each tier of the infrastructure. The data pipeline is responsible for the treatment of sensitive data between the tiers. In a future post, we’ll cover the data pipeline in depth and describe how it removes sensitive data as needed.
We will now go into a bit more detail on each tier of the architecture.
The Central and Control Plane
At the highest level, the Central is comprised of the control plane for the new platform and globally centralized services. To an end user, this is exposed as an application programming interfaces (APIs) only reachable on a trusted network behind a Virtual Private Network (VPN) and secured further by identity verification (i.e., Zero Trust). Global services also run in the Central. The control plane APIs allow for creation of the other architecture components, PoPs, and Spaces (more on the components later). Engineering operators use a Command Line Interface (CLI) to interact with this API, while Auth0 customer-facing teams use an internal user interface (UI) to provision customer spaces.
In line with our overall approach to the high availability of the platform, the Central is geo-redundant and can failover to a secondary region if required. Within both the primary and secondary regions, all Central components are deployed across three availability zones.
It is worth noting that in a catastrophic failure of both Central regions, the individual Spaces can still function, meaning the Auth0 service would still be available. This is because the Auth0 service is not dependent on connectivity to the control plane. In such a case, these ‘floating’ Spaces couldn’t receive directives from the control plane but would still function for end users without issue. In essence, their configuration would be ‘frozen’ until connectivity was re-established. The control plane is not a single point of failure for service delivery. This adds yet another layer of redundancy in addition to the Central being geo-redundant. To summarize, authentication and authorization services in the Space will continue to function, even when connectivity to the control plane is interrupted. The end-user authentication and authorization experience won’t be affected.
While we are available to deploy spaces in either Azure or AWS, we have one global Central, and that is hosted in the European Union (EU) in AWS. It is critical to note here that communication from the Central to the Space, which again is the customer environment, is one way. That is to say, a Space will receive directives from the Central (ex: ‘scale to a larger size’ or ‘accept the newest release manifest’) but will never send data to the Central. The Central is aware of the type of Space that is deployed (which corresponds to different service offerings) and which release the Space is running. Needless to say, all information within the internal network is encrypted by an internal Certificate Authority (CA) and, as you’ll see later on, off of the public internet entirely.
We do have a globally centralized data warehouse as well — but that is via a 3rd party SaaS provider, and it receives data only after personally identifiable information (PII) has been removed as part of the data pipeline process. This unlocks internal reporting and analytics features for customers. As mentioned previously, an upcoming post will detail how data moves through the new platform.
The Central also hosts the release orchestrator for the new platform. At a high level, all spaces are running one of three different versions of the Auth0 service. The three versions correspond to release channels for development, staging, and production. Each Space gets a new version once per week at a day and time window selected by the customer (for some service offerings). Each week the releases all move forward in unison, and this allows a release to cycle through both development and staging environments before being promoted to production. Additionally, we are already using the container images used in the public cloud today, which further battle tests the releases as those images get heavy use before hitting the private cloud.
All releases enforce backward compatibility by one version, meaning we can roll back in an emergency. Naturally, this also means we must take a multi-week approach to roll out database changes, but we feel the tradeoff is worth it. Traditionally, database changes are done in phases, typically with a dual-write period. The biggest tradeoff here is the cleanup (i.e., dropping the column no longer in use) will take a little extra time. The image below gives you an idea of what this looks like across development, staging, and production environment in the new platform. You can observe how each environment advances their version each week and how pre-production environments run a newer version of the Auth0 software at all times.
The weekly cadence was chosen as a bridge between what customers on the previous private cloud platform expected and what our product teams felt aligned with modern release principles.
Releases are a complex and interesting topic — we will follow up with an in depth post on the specifics of releases in the new platform straight from the tech lead on the project.
In its simplest form the Central looks like:
A PoP or ‘point-of-presence (PoP)’ serves as an intermediate tier of infrastructure between the central and the Space. Like a space, a PoP is tied to a combination of the cloud provider and region (i.e., ‘AWS us-east-1 or Azure EastUS). PoPs communicate with the central via Internet Protocol Security (IPSec) tunnels, and in the case of Azure PoPs, across clouds as well!
PoPs serve two primary functions. The first responsibility of the PoP is hosting observability tools. We run an Open Search cluster in each PoP for troubleshooting logs. Additionally, we upstream data to the data warehouse from the PoP, which unlocks customer analytics features and internal tools. As data moves between the PoP and other consumers of data, our data pipeline ensures we are treating sensitive data appropriately.
Finally, PoPs also host some shared services which can be used by all spaces that fall under that PoP. As an example, we host a database that stores signatures of bad actors in each PoP (as part of our Bot Detection feature). This way, the spaces can get a quick response in-line with an authentication event with minimal latency and immediately block a request if required. This approach requires less overhead and cost than provisioning this database in each customer space.
You can visualize a PoP like this:
Finally, the customer-facing component of the platform. The Space comprises all the components required to make the Auth0 service work in a single customer configuration. This includes a Kubernetes cluster, a variety of data stores, Apache Kafka, and an edge provider configuration. Spaces are available in three configurations currently, with two additional add-on features. All spaces in a single cloud region connect via a private link up to the PoP for that region, which then connects via a secure tunnel to the central.
We size spaces based on the throughput they can handle and sell them as Basic, Performance, and Performance Plus tiers. Additionally, customers may add both a geo-failover add-on and a PCI compliance add-on if desired. In a geo-failover configuration, customers can select a secondary region (subject to any cloud provider availability), and Auth0 can failover with a single CLI command to the secondary region. We’re extremely proud of how the new platform supports geo-failover, and readers can expect an in-depth blog post on the topic written by engineering. The feature was built with stringent Recovery Point Objective (RPO) capabilities in mind and is routinely load tested before, during, and after failover.
As outlined as part of the ‘Central’ topic, Spaces are running one of three different versions of the Auth0 service and receive a new update every week.
One of the most exciting features of the new platform is the reduced onboarding time for customers. We can spin up environments now in a matter of hours, and the entire provisioning process is completely automated. Our customer-facing teams use an internal tool called the ‘Hub’, which is a UI for platform API interaction. To give you an idea of how simple we’ve made this, see the screenshot below:
If anything captures the power of this platform, it’s the picture above. A few clicks power a multi-cloud platform across any geography our customers desire. We’re incredibly proud of the work we’ve accomplished and are excited to give customers the best experience we can.
Putting It All Together
Now that we all have a common understanding of the components of the new platform, we can examine the diagram below.
As you can see, we have a central component connected via ipsec to PoPs in both cloud providers across many different regions. Within those PoPs, multiple customer environments exist. The ‘public’ identifier in all Spaces signifies the endpoints that users will use for authentication. These are separate from the private network that connects Spaces to PoPs and PoPs to the Central.
We hope you enjoyed this post about the new Auth0 Private Cloud platform. As hinted in this post, you can expect a number of follow-up posts from engineers themselves on specific components of this platform. This post is meant to introduce the major components and give a general network overview of the entire platform. You can always refer back to this post as useful background information for future topics that dive deeper into specific components of the platform.
If you’re a prospect or customer and you’d like to learn more about Auth0’s private cloud deployments, reach out to our team here. Also, if you thought this post was interesting and want to join the team — we are hiring!
Auth0 by Okta takes a modern approach to customer identity and enables organizations to provide secure access to any application, for any user. Auth0 is a highly customizable platform that is as simple as development teams want, and as flexible as they need. Safeguarding billions of login transactions each month, Auth0 delivers convenience, privacy, and security so customers can focus on innovation. For more information, visit https://auth0.com.