Private SaaS (PSaaS) Appliance: High Availability Geo Cluster (Geo HA)
The high availability geo cluster is a PSaaS Appliance implementation that provides data center redundancy and automatic failure handling. This is the highest form of PSaaS Appliance availability offered by Auth0.
Auth0 adds to the single data center high availability solution by extending the cluster with a geographically distributed data center where the recommended maximum round-trip latency should not exceed 100 ms. The result is the high availability geo cluster, which is an active hot standby configuration with automated failure handling that can survive a regional outage.
The standard configuration is a stretched cluster that consists of the following pieces:
- One global load balancer/DNS failover configuration;
- One primary data center with three PSaaS Appliance instances;
- One standby data center with three PSaaS Appliance instances;
- One arbiter, a seventh instance that is located in its own data center.
The standby data center instances possess the same PSaaS Appliance configuration as the primary data center instances, and continuous synchronization ensure that the data on the primary and standby data centers mirror each other. The GEOHA stretched cluster should be in the same provider (all nodes in AWS or all nodes in Azure or all nodes on-prem).
The Arbiter node acts as an independent witness to the primary and secondary data centers.
The Arbiter does not store data or execute application logic, but acts as a witness between the primary and standby data centers. By independently verifying if a data center is down or not, it prevents both from becoming active (such a scenario is known as the "split-brain" condition).
Since the Arbiter isn’t storing data and doesn’t run any services, it can be a small instance with two cores and 4GB of memory.
Global Load Balancer/DNS Failover Configuration
You will need to deploy a global load balancer that supports an active/standby configuration. This will be configured to begin using the secondary site if the primary site load balancer is unavailable.
Two examples of products that support this configuration are the F5 Global Traffic Manager and the AWS Route 53 DNS service. The global load balancer is typically positioned in front of the local load balancers in each data center.
Auth0 requires the use of a load balancer or DNS failover solution that prefers to serve application requests using the primary data center, despite the fact that the PSaaS Appliance instances in the hot standby data center are active and able to serve the requests.
The application tier remains unaware of the locality of the primary data node, whereas the data layer resolves the location of the primary node (the only node that receives application queries). The primary node serves all read and write activities. For example, if the data centers for the geo cluster are in State A and State B and the ones in State B are active, any request serviced by the State A nodes at the application level requires data to then be written to a node in State B. This generally results in poorer performance due to the requests (and resulting round-trips) required to obtain the necessary data. Using a global load balancer or DNS failover solution to prefer the primary location, unless it is not healthy, mitigates this performance issue.
The data tier operates independently of the application tier as a single cluster stretched across two geographically distributed data centers. Within each data center are three nodes, each provided local data redundancy and failover.
Only one of the two data centers is designated as primary, and the instances within that data center are weighted such that those are always preferred for incoming requests. All read and write activities then pass through the single, active primary node.
As long as all instances are visible to each other, the primary data center will always be elected. If the data center fails (that is, the instances are not visible to the witness and PSaaS Appliance in the standby data center), then a node in the secondary data center will be elected to become primary.
If the primary data center becomes available again and its instances are visible to those of the arbiter and the secondary data center, the primary data center will again take precedence over the standby data center when handling incoming requests.
- The primary data center fails. The nodes in the standby data center and the arbiter can no longer communicate with the nodes in the primary data center.
- The standby data center becomes the primary data center. Because the standby data center instances and the arbiter form a majority, they will elect the standby data center instances to become the primary instances.
- The PSaaS Appliance associated with the primary data center begin failing. Because the primary data center instances cannot communicate with the instances in the standby data center, the associated PSaaS Appliance instances begin to fail.
- The global load balancer/DNS failover configuration detects via health checking that the nodes in the primary data center aren't serving. It will then switch over to sending requests to the instances of the standby data center.
- The PSaaS Appliance associated with the standby data center are now serving requests and acting as the primary data node due to its election in step 2.