Over the last few months, there has been an increase in service degradation and outages that has shaken our customer’s confidence in Auth0. Our number one mission is to provide you with the highest level of service and reliability at all times, for the full spectrum of customers from free users to enterprise. The production environments in US-1 and EU have been impacted the most frequently and, in extreme cases, has caused downtime for some of our customers. As the Auth0 CPO, I sincerely apologize for the pain this caused you and your customers. We take ownership for the failures and recognize how disappointed you are.
We know that identity plays a critical role in your company, and it is our responsibility to make our service reliable. It is also our responsibility to provide guidance to our customers on appropriate architecture patterns. When we experience failures or degradation of our service, we correct them as swiftly as possible. Today, I wanted to share the active measures we are taking to protect against similar issues from occurring in the future.
First, What Caused the Outages?
Before we explore all of the actions we have taken, it’s good to understand the patterns we have observed in these environments:
Noisy neighbors in our large multi-tenant environments create throughput bottlenecks
The noisy neighbor phenomenon is a well-known issue in multi-tenant environments. In our case, one co-tenant may experience a spike in traffic which can lead them to monopolize resources available and thus reduce the throughput (requests-per-second) on the environment for the other co-tenants.
Most of our customers in the US-1 and EU environments have enjoyed many years of uninterrupted service. These environments have grown significantly and are now the longest-running and largest, which exacerbates the noisy neighbor impact.
We have not strictly enforced restrictions to the frequency of customer API calls (rate limits) and at times allowed for a 10X increase in requests-per-second (RPS) for certain tenants in a single environment. Typically, our architecture would allow us to absorb these spikes as it is the case in most of our environments, but given the volume and size of the customers in US-1 and EU, these concurrent requests restrict throughput for other tenants. We do not fault our customers for their peak traffic needs and it is on us to put in place the right protections and guardrails around each tenant. Unfortunately, we are not able to actively load shed today (i.e. automatically move tenants to other environments such as US-3, because we don’t currently have the migration tooling nor can we do it without any downtime). There are some customer scenarios where a planned migration by a customer is possible, but it requires time and effort on your part. I will cover migration scenarios and timing in the ‘Looking beyond the next 60 days’ section below.
Region Mean Time To Recovery (MTTR) outside of our 15-min target
The other item we are solving is meeting our 15 minutes target for region failover. This is caused by dependencies on an underlying managed service that doesn’t support automated failover within our target timeframe. Our teams are in the final stages of completing the switch, a managed service that allows us to automatically failover within 15-minutes, in Q1 2022.
We can successfully failover Availability Zones (AZ) today within our 1-minute target window as validated by quarterly testing.
What Actions Have We Taken Already?
Slowing down and invoking our change freeze earlier
Effective Sept. 30, 2021, we’ve significantly reduced the number and frequency of changes that we’re introducing into production for the remainder of the year. This is to introduce increased testing and soak time in pre-production environments. Heading into the holiday season, our protocols may change slightly, however, we anticipate continuing a conservative posture to maintain the stability of our production environments. We are mostly allowing changes that are related to our resiliency initiatives or security patches. Being an authentication platform, security-related changes are non-negotiable, as is our mission to protect you and your customers.
New staging environment
As of early October, we have a staging environment that has the same load as US-1, so that we can simulate test cases reproducing production traffic at large scale. The volume and high load-related issues described above can now be more easily detected.
Additional resources and re-prioritization of our roadmap
We have added team members and reprioritized resilience projects in order to complete these improvements in time for the holiday season. I am personally doing a daily stand-up with our Engineering leadership team to ensure there is an elevated review process and all blockers are actively being addressed. In addition, all new product and feature initiatives have been deferred until the new year.
Resilience and holiday readiness efforts
After our outage on April 20th, a dedicated team of 40 engineers completed a 10-week task force to reduce the load on our environments that resulted in:
- Reduced cache load by removing unused feature flags and added in-memory caching of multiple per-environment feature flags.
- Removed 2 collections from our database that reduced total database writes by 10%.
- Reduced reads to the collection with the most reads within the database by 12%.
- Added limits to refresh tokens queries from the Management API.
- Added missing indexes for new collections in the database.
Some of the improvements identified were multi-quarter projects that are still underway and are expected to be completed this November. These projects will:
- Reduce the read/write workload by 10%.
- Reduce 40% of inserts/deletes from our database.
- Reduce 30% of inserts/deletes from our database in US-1 enviroment. This requires a deprecation of opaque access tokens and will take 6 months to complete.
In October, we started the following remediation projects to further reduce the probability of these failures:
- Refactoring our connections schema to remove high cardinality issues.
- Addressing noisy neighbor issues in our extensibility stack (also known as Webtask).
- Implementing an additional caching layer for reading requests.
- Proactively and significantly over-provisioning our stack in an effort to get ready for holiday traffic.
- Enforcing global active rate management per tenant at the edge.
The projects listed above are also scheduled to be live and in production by November 10. We know that all of the initiatives above will reduce the load on our environments and should significantly reduce the noisy neighbor issue I referenced above.
We are also initiating code change freeze windows over the holidays:
🟧 Orange Change Freeze 10/12 - 11/11
🟥 Red Change Freeze 11/12 - 11/29
🟧 Orange Change Freeze 11/30 - 12/16
🟥 Red Change Freeze 12/17 - 01/03
🟧 Orange Change Freeze: During this window, all production deployments are halted, and only work related to stability for Mission Critical Services, security patches or urgent releases on other components may be deployed. Changes must meet testing and approval criteria.
🟥 Red Change Freeze: During this window, all production deployments are halted. Exceptions allowed only to resolve critical customer issues or security patches. Changes must meet testing and approval criteria.
Looking Beyond the Next 60 Days
New environments and migration capabilities
We are now actively working on creating new environments in our US and EU regions. Our target is to progressively introduce 4 new environments in the US and up to 3 new environments in the EU in CY Q1 2022. The new environments allow us to dynamically allocate load as part of our demand response strategies and significantly reduce the blast radius of disruptions and outages. Our team is also actively scoping and designing an automated tenant migration process. We expect to have an automated tenant migration tool that will enable us to work with you and migrate you over to one of our newer environments beginning late CY Q1 2022.
New platform rollout across clouds and regions
We began working on a modern platform last year that is poised to roll out across the regions and clouds that we operate in. This month we began the gradual rollout of this new platform and made it available on Microsoft's Azure cloud. This new platform architecture has been purposefully designed from the ground up to deliver a scalable, reliable, and resilient service that meets the current and future needs of our customers. We are targeting to make this available on AWS starting CY 1H 2022.
Finally, we truly appreciate your trust in Auth0 and will do everything that we can to keep that trust. I will post another public update in early CY Q1 2022 to share the results of all of our work in future blog updates including the progress of the improvement initiatives listed above.
Please note: any unreleased products, features, or functionality referenced in this disclosure may not be currently available and may not be delivered on time or at all. Product roadmaps do not represent a commitment, obligation, or promise to deliver any product, feature, or functionality, and you should not rely on them to make your purchase decisions.