Skip to main content

Triggering failovers

Temporal Cloud automatically initiates failovers when an incident or outage affects a replicated Namespace. Standby Namespace replicas duplicate data and prevent data loss during failover.

Triggering failovers

Temporal automatically initiates failovers when an incident or outage affects a replicated Namespace. You can also trigger a failover based on your own custom alerts and for testing purposes. This section explains how to manually trigger a failover and what to expect afterward.

Check Your Replication Lag

Always check the metric replication lag before initiating a failover. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress.

Performing manual failovers

You can trigger a failover manually using the Temporal Cloud Web UI or the tcld CLI, depending on your preference and setup. The following table outlines the steps for each method:

Need to update the CLI instructions
MethodInstructions
Temporal Cloud Web UI1. Visit the Namespace page on the Temporal Cloud Web UI.
2. Navigate to your Namespace details page and select the Trigger a failover option from the menu.
3. After confirmation, Temporal initiates the failover.
Temporal tcld CLITo manually trigger a failover, run the following command in your terminal:
tcld namespace failover \
    --namespace <namespace_id>.<account_id> \
    --region <target_region>
Temporal fails over the Namespace to the target region. High availability Namespaces using a single region will failover to the standby isolation domain.

Disabling Temporal-initiated failovers

When you add a replica to a Namespace, in the event of an incident or an outage Temporal Cloud automatically fails over the Namespace to its replica. This is the recommended and default option.

However if you prefer to disable Temporal-initiated failovers and handle your own failovers, you can do so by navigating to the Namespace detail page in Temporal Cloud. Choose the "Disable Temporal-initiated failovers" option.

If you prefer to disable Temporal-initiated failovers and handle your own failovers, you can do so by navigating to the Namespace detail page in Temporal Cloud. Choose the "Disable Temporal-initiated failovers" option.

Post-failover event information

After any failover, whether triggered by you or by Temporal, event information appears in both the Temporal Cloud Web UI (on the Namespace detail page) and in your audit logs. The audit log entry for Failover uses the "operation": "FailoverNamespace" event. After failover, the replica becomes active, taking over and the Namespace is active in the new isolation domain or region.

You don't need to monitor Temporal Cloud's failover response in real-time. Whenever there is a failover event, users with the Account Owner and Global Admin roles automatically receive an alert email.

Failbacks

After Temporal-initiated failovers, Temporal Cloud shifts Workflow Execution processing back to the original region or isolation zone that was active before the incident once the incident is resolved. This is called a "failback".

Failover testing

Microservices and external dependencies will fail at some point. Testing failovers ensures your app can handle these failures effectively. Temporal recommends regular and periodic failover testing for mission-critical applications in production. By testing in non-emergency conditions, you verify that your app continues to function even when parts of the infrastructure fail.

Dive deeper — Why test?    [Toggle to Open]

Regular failover testing ensures your app can handle disruptions and continue running smoothly in production.

Microservices and external dependencies will fail at some point. Testing failovers ensures your app can handle these failures effectively.

Temporal recommends regular and periodic failover testing for mission-critical applications in production. By testing in non-emergency conditions, you verify that your app continues to function even when parts of the infrastructure fail.

Safety First

If this is your first time performing a failover test, run it with a test-specific namespace and application. This helps you gain operational experience before applying it to your production environment. Practice runs help ensure the process runs smoothly during real incidents in production.

Trigger testing can:

  • Validate replicated deployments: In multi-region setups, failover testing ensures your app can run from another region when the primary region experiences outages. In single-region setups, failover testing instead works with isolation domain. This maintains high availability in mission-critical deployments. Manual testing confirms the failover mechanism works as expected, so your system handles incidents effectively.

  • Assess replication lag: In multi-region deployment, monitoring replication lag between regions is crucial. Check the lag before initiating a failover to avoid rolling back Workflow progress. This is less important when using isolation domains as failover is usually instantaneous. Manual testing helps you practice this critical step and understand its impact. When there's no real incident, the switch over (recovery) should happen almost instantly. A switch over within a single region should also be nearly instantaneous.

  • Assess recovery time: Manual testing helps you measure actual recovery time. You can check if it meets your expected Recovery Time Objective (RTO) of 20 minutes or less, as stated in the High availability Namespace SLA.

  • Identify potential issues: Failover testing uncovers problems not visible during normal operation. This includes issues like backlogs and capacity planning and how external dependencies behave during a failover event.

  • Validate fault-oblivious programming: Temporal uses a "fault-oblivious programming" model, where your app doesn’t need to explicitly handle many types of failures. Testing failovers ensures that this model works as expected in your app.

  • Operational readiness: Regular testing familiarizes your team with the failover process, improving their ability to handle real incidents when they arise.

Testing failovers regularly ensures your Temporal-based applications remain resilient and reliable, even when infrastructure fails.

Preparing Worker deployment

Enabling high availability for Namespaces doesn't require specific Worker configuration. The process is invisible to the Workers. When a Namespace fails over to the replica, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption. More details are available in the Routing section.

  • When a Namespace fails over to a replica in a different region, Workers will be communicating cross-region. If your application can’t tolerate this latency, deploy a second set of Workers in this region or opt for a replica in the same region.

  • In case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. To keep Workflows moving during this level of outage, deploy a second set of Workers to the secondary region.

tip

When a Namespace fails over to a replica in a different region, Workers will be communicating cross-region. If your application can’t tolerate this latency, deploy a second set of Workers in this region or opt for a replica in the same region. In case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. To keep Workflows moving during this level of outage, deploy a second set of Workers to the secondary region.

Set up secure routing for failovers

This section needs fixing for regions vs isolation domains

When using a high availability Namespace, the Namespace's DNS record <ns>.<acct>.<tmprl_domain> targets a regional DNS record in the format <region>.region.<tmprl_domain>. Here, <region> is the currently active region for your Namespace. Clients resolving the Namespace’s DNS record are directed to connect to the active region for that Namespace, thanks to the regional DNS record.

During failover, Temporal Cloud changes the target of the Namespace DNS record from one region to another. Namespace DNS records are configured with a 15 seconds TTL. Any DNS cache should re-resolve the record within this delay. As a rule of thumb, DNS reconciliation takes no longer than twice (2x) the TTL. Clients should converge to the newly targeted region within, at, most a 30-second delay.

This section needs fixing for regions vs isolation domains
info

Some networking configuration is required for failover to be transparent to clients and workers when using PrivateLink. This section describes how to configure routing for multi-region Namespaces for PrivateLink customers only.

PrivateLink customers may need to change certain configurations for multi-region Namespace use. Routing configuration depends on networking setup and use of PrivateLink. You may need to:

  • override a DNS zone; and
  • ensure the network connectivity between the two regions.

Customer side solution example

When using PrivateLink, you connect to Temporal Cloud using IP addresses local to your network. The region.<tmprl_domain> zone is configured in the Temporal systems as an independent zone. This allows you to override it to make sure traffic is routed internally for the regions in use. You can check the Namespace's active region using the Namespace record CNAME, which is public.

To set up the DNS override, you override specific regions to target the relevant IP addresses (e.g. aws-us-west-1.region.tmprl.cloud to target 192.168.1.2). Using AWS, this can be done using a private hosted zone in Route53 for region.<tmprl_domain>. Link that private zone to the VPCs you use for Workers. Private Link is not yet offered for GCP multi-region Namespaces.

When your Workers connect to the Namespace, they first resolve the <ns>.<acct>.<tmprl_domain> record. This targets <active>.region.<tmprl_domain> using a CNAME. Your private zone overrides that second DNS resolution, leading traffic to reach the internal IP you're using.

Consider how you'll configure Workers to run in this scenario. You might set Workers to run in both regions at all times. Alternately, you could establish connectivity between the regions to redirect Workers once failover occurs.

The following table lists Temporal's available regions, PrivateLink endpoints, and DNS record overrides. The sa-east-1 region listed here is not yet available for use with multi-region Namespaces.

RegionPrivateLink Service NameDNS Record Override
ap-northeast-1com.amazonaws.vpce.ap-northeast-1.vpce-svc-08f34c33f9fb8a48aaws-ap-northeast-1.region.tmprl.cloud
ap-northeast-2com.amazonaws.vpce.ap-northeast-2.vpce-svc-08c4d5445a5aad308aws-ap-northeast-2.region.tmprl.cloud
ap-south-1com.amazonaws.vpce.ap-south-1.vpce-svc-0ad4f8ed56db15662aws-ap-south-1.region.tmprl.cloud
ap-south-2com.amazonaws.vpce.ap-south-2.vpce-svc-08bcf602b646c69c1aws-ap-south-2.region.tmprl.cloud
ap-southeast-1com.amazonaws.vpce.ap-southeast-1.vpce-svc-05c24096fa89b0ccdaws-ap-southeast-1.region.tmprl.cloud
ap-southeast-2com.amazonaws.vpce.ap-southeast-2.vpce-svc-0634f9628e3c15b08aws-ap-southeast-2.region.tmprl.cloud
ca-central-1com.amazonaws.vpce.ca-central-1.vpce-svc-080a781925d0b1d9daws-ca-central-1.region.tmprl.cloud
eu-central-1com.amazonaws.vpce.eu-central-1.vpce-svc-073a419b36663a0f3aws-eu-central-1.region.tmprl.cloud
eu-west-1com.amazonaws.vpce.eu-west-1.vpce-svc-04388e89f3479b739aws-eu-west-1.region.tmprl.cloud
eu-west-2com.amazonaws.vpce.eu-west-2.vpce-svc-0ac7f9f07e7fb5695aws-eu-west-2.region.tmprl.cloud
sa-east-1com.amazonaws.vpce.sa-east-1.vpce-svc-0ca67a102f3ce525aaws-sa-east-1.region.tmprl.cloud
us-east-1com.amazonaws.vpce.us-east-1.vpce-svc-0822256b6575ea37faws-us-east-1.region.tmprl.cloud
us-east-2com.amazonaws.vpce.us-east-2.vpce-svc-01b8dccfc6660d9d4aws-us-east-2.region.tmprl.cloud
us-west-2com.amazonaws.vpce.us-west-2.vpce-svc-0f44b3d7302816b94aws-us-west-2.region.tmprl.cloud