Multi-Region Architecture on AWS — AWS for Backend Engineers

Running in a single AWS region is fine until it is not. A region-wide outage, a compliance requirement that mandates data residency, or users on the other side of the planet experiencing 300ms latency — these are the reasons teams go multi-region.

Multi-region architecture is one of the most complex things you can build on AWS. It touches networking, data replication, DNS, deployment, and operational processes. This lesson covers the patterns, services, and tradeoffs you need to make informed decisions.

AWS multi-region active-active architecture

Why Multi-Region

There are exactly three reasons to go multi-region. If none apply, stay single-region.

Latency. Physics is undefeated. Light takes about 70ms to cross the Atlantic. If your users are in Europe and your servers are in us-east-1, every API call starts with 140ms of round-trip latency before your code even runs. For real-time applications, this is unacceptable.

Compliance. GDPR, data residency laws, and industry regulations may require data to stay within specific geographic boundaries. A healthcare application serving EU patients may need EU data to remain in eu-west-1.

Disaster Recovery. AWS regions are independent infrastructure. A region-wide outage — while rare — can happen. If your business cannot tolerate hours of downtime, you need a second region ready to take traffic.

Route 53 Routing Policies

Route 53 is the front door of multi-region. It decides which region receives each DNS query. Understanding routing policies is fundamental.

Simple Routing

One record, one (or multiple) values. Route 53 returns all values and the client picks one randomly. No health checks, no intelligence. Not useful for multi-region.

Weighted Routing

Distributes traffic by percentage across multiple targets.

# 80% to us-east-1, 20% to eu-west-1
aws route53 change-resource-record-sets --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.myapp.com",
          "Type": "A",
          "SetIdentifier": "us-east-1",
          "Weight": 80,
          "AliasTarget": {
            "HostedZoneId": "Z35SXDOTRQ7X7K",
            "DNSName": "us-east-1-alb.myapp.com",
            "EvaluateTargetHealth": true
          }
        }
      },
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.myapp.com",
          "Type": "A",
          "SetIdentifier": "eu-west-1",
          "Weight": 20,
          "AliasTarget": {
            "HostedZoneId": "Z32O12XQLNTSW2",
            "DNSName": "eu-west-1-alb.myapp.com",
            "EvaluateTargetHealth": true
          }
        }
      }
    ]
  }'

Use case: Gradually migrating traffic to a new region. Start at 5 percent, validate, increase.

Latency-Based Routing

Routes each user to the region with the lowest latency from their location. Route 53 maintains a latency database.

aws route53 change-resource-record-sets --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.myapp.com",
          "Type": "A",
          "SetIdentifier": "us-east-1",
          "Region": "us-east-1",
          "AliasTarget": {
            "HostedZoneId": "Z35SXDOTRQ7X7K",
            "DNSName": "us-east-1-alb.myapp.com",
            "EvaluateTargetHealth": true
          }
        }
      }
    ]
  }'

Use case: Active-active deployments where you want each user routed to the nearest healthy region.

Failover Routing

Primary and secondary. Traffic goes to the primary unless its health check fails, then Route 53 switches to secondary.

# Primary record
{
  "Name": "api.myapp.com",
  "Type": "A",
  "SetIdentifier": "primary",
  "Failover": "PRIMARY",
  "HealthCheckId": "hc-us-east-1",
  "AliasTarget": {
    "DNSName": "us-east-1-alb.myapp.com",
    "EvaluateTargetHealth": true
  }
}

# Secondary record
{
  "Name": "api.myapp.com",
  "Type": "A",
  "SetIdentifier": "secondary",
  "Failover": "SECONDARY",
  "AliasTarget": {
    "DNSName": "eu-west-1-alb.myapp.com",
    "EvaluateTargetHealth": true
  }
}

Use case: Active-passive DR setups. Cheapest multi-region option.

Geolocation Routing

Routes based on the user’s geographic location — continent, country, or US state. Unlike latency-based routing, this is explicit geographic control.

Use case: Compliance. EU users must hit EU infrastructure. US users must hit US infrastructure.

Active-Active vs Active-Passive

This is the most important architectural decision in multi-region.

Active-Passive

One region handles all traffic. The second region is on standby, ready to take over if the primary fails.

Data flow: All writes go to the primary region. Data replicates asynchronously to the secondary.

Failover: When the primary region fails, Route 53 health checks detect it and redirect DNS to the secondary. Failover typically takes 60 to 120 seconds (DNS TTL dependent).

Tradeoffs:

Simpler to build and reason about.
Lower cost — the standby region runs minimal infrastructure.
Some data loss during failover (whatever had not replicated yet).
Failover is not instant — DNS propagation takes time.

Active-Active

Both regions handle traffic simultaneously. Users are routed to the nearest region. Both regions can process reads and writes.

Data flow: Writes in any region replicate to all other regions. This is where complexity explodes.

Tradeoffs:

Best latency for global users.
No failover delay — if one region dies, others keep serving.
Write conflicts are possible — you need conflict resolution strategies.
Significantly more expensive — full infrastructure in every region.
Much harder to debug — a request might touch data replicated from another region.

My recommendation: Start with active-passive. Only move to active-active when you have a clear latency or availability requirement that justifies the complexity.

Data Replication Strategies

Data is the hardest part of multi-region. Compute is stateless and easy to duplicate. Data has consistency requirements.

DynamoDB Global Tables

DynamoDB Global Tables provide active-active replication across regions with sub-second replication latency.

# Create a global table spanning two regions
aws dynamodb create-table \
  --table-name Users \
  --attribute-definitions AttributeName=PK,AttributeType=S AttributeName=SK,AttributeType=S \
  --key-schema AttributeName=PK,KeyType=HASH AttributeName=SK,KeyType=RANGE \
  --billing-mode PAY_PER_REQUEST \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES

aws dynamodb create-table-replica \
  --table-name Users \
  --region eu-west-1

Conflict resolution: Last-writer-wins based on the timestamp. If two regions write to the same item simultaneously, the write with the later timestamp wins. This is usually fine for most applications but can be problematic for counters or financial data.

Key design consideration: Add a region attribute to items that are region-specific. Use conditional writes to prevent conflicts on critical updates:

const { DynamoDBClient, UpdateItemCommand } = require('@aws-sdk/client-dynamodb');

const client = new DynamoDBClient({ region: process.env.AWS_REGION });

// Conditional write to prevent conflicts
const command = new UpdateItemCommand({
  TableName: 'Users',
  Key: { PK: { S: 'USER#123' }, SK: { S: 'PROFILE' } },
  UpdateExpression: 'SET #email = :email, #version = #version + :one',
  ConditionExpression: '#version = :expectedVersion',
  ExpressionAttributeNames: {
    '#email': 'email',
    '#version': 'version',
  },
  ExpressionAttributeValues: {
    ':email': { S: '[email protected]' },
    ':one': { N: '1' },
    ':expectedVersion': { N: '5' },
  },
});

Aurora Global Database

Aurora Global Database replicates from a primary cluster to up to five secondary regions with typical replication lag under one second.

Primary cluster handles all writes.
Secondary clusters serve read traffic with near-real-time data.
Failover promotes a secondary to primary in under a minute.

# Create global cluster
aws rds create-global-cluster \
  --global-cluster-identifier myapp-global \
  --source-db-cluster-identifier arn:aws:rds:us-east-1:123456789:cluster:myapp-primary \
  --engine aurora-postgresql

# Add secondary region
aws rds create-db-cluster \
  --db-cluster-identifier myapp-secondary \
  --engine aurora-postgresql \
  --global-cluster-identifier myapp-global \
  --region eu-west-1

Write forwarding is a game-changing feature. Secondary clusters can accept write queries and forward them to the primary. Your application code does not need to know which region is primary:

// Application code is identical in both regions
// Aurora handles write forwarding automatically
const pool = new Pool({
  host: process.env.AURORA_ENDPOINT, // Regional endpoint
  database: 'myapp',
  ssl: true,
});

// This works in any region — writes are forwarded to primary
await pool.query('INSERT INTO orders (user_id, amount) VALUES ($1, $2)', [userId, amount]);

S3 Cross-Region Replication (CRR)

S3 replication copies objects between buckets in different regions.

{
  "Role": "arn:aws:iam::123456789:role/S3ReplicationRole",
  "Rules": [
    {
      "Status": "Enabled",
      "Priority": 1,
      "Filter": { "Prefix": "" },
      "Destination": {
        "Bucket": "arn:aws:s3:::myapp-assets-eu-west-1",
        "StorageClass": "STANDARD"
      },
      "DeleteMarkerReplication": { "Status": "Enabled" }
    }
  ]
}

Replication is asynchronous — typically completes within 15 minutes for most objects, but can take longer for large files. S3 Replication Time Control (RTC) guarantees 99.99 percent of objects replicate within 15 minutes.

ElastiCache Global Datastore

Redis Global Datastore replicates across regions with sub-second latency. The primary cluster accepts writes, and secondary clusters serve reads.

aws elasticache create-global-replication-group \
  --global-replication-group-id-suffix myapp-cache \
  --primary-replication-group-id myapp-redis-us-east-1

Stateless Service Design

Multi-region only works if your services are stateless. Every request must be self-contained — no local session state, no local file storage, no in-memory caches that cannot be lost.

Rules for stateless multi-region services:

Sessions go in DynamoDB or ElastiCache, never in local memory.
File uploads go directly to S3, never to local disk.
Configuration comes from SSM Parameter Store or Secrets Manager, not environment-specific config files.
Every service instance must be replaceable without data loss.

Global Accelerator and CloudFront

AWS Global Accelerator

Global Accelerator provides static anycast IP addresses that route to the nearest healthy endpoint. Unlike Route 53 (DNS-based), Global Accelerator works at the TCP/UDP level.

Advantages over Route 53 failover:

No DNS TTL caching issues — failover is near-instant.
Static IPs — no DNS resolution needed.
TCP termination at the nearest AWS edge location, then private AWS backbone to your region.

aws globalaccelerator create-accelerator \
  --name myapp-global \
  --ip-address-type IPV4

aws globalaccelerator create-endpoint-group \
  --listener-arn arn:aws:globalaccelerator::123456789:accelerator/abc123/listener/def456 \
  --endpoint-group-region us-east-1 \
  --endpoint-configurations "EndpointId=arn:aws:elasticloadbalancing:us-east-1:123456789:loadbalancer/app/myapp-alb/abc123,Weight=100"

Cost: $0.025/hour per accelerator plus data transfer. Not cheap, but worth it for latency-critical applications.

CloudFront for Edge Caching

CloudFront caches content at 400+ edge locations worldwide. For multi-region APIs, it reduces latency for cacheable responses:

{
  "Origins": [
    {
      "Id": "us-east-1-origin",
      "DomainName": "us-east-1-alb.myapp.com",
      "OriginPath": "",
      "CustomOriginConfig": {
        "HTTPPort": 80,
        "HTTPSPort": 443,
        "OriginProtocolPolicy": "https-only"
      }
    }
  ],
  "DefaultCacheBehavior": {
    "TargetOriginId": "us-east-1-origin",
    "CachePolicyId": "658327ea-f89d-4fab-a63d-7e88639e58f6",
    "ViewerProtocolPolicy": "redirect-to-https"
  }
}

Use CloudFront for static assets and cacheable API responses. Do not use it for real-time or user-specific data.

RPO, RTO, and DR Tiers

Two metrics define your disaster recovery requirements:

RPO (Recovery Point Objective) — How much data can you afford to lose? An RPO of 1 hour means you accept losing up to 1 hour of data.

RTO (Recovery Time Objective) — How long can your service be down? An RTO of 15 minutes means you must be back online within 15 minutes of an outage.

These metrics drive which DR tier you need:

Tier 1: Backup and Restore

RPO: Hours to days
RTO: Hours to days
Cost: Very low
How: Regular backups to S3 (or S3 Cross-Region Replication). On failure, launch infrastructure in the DR region and restore from backups.
Use when: Non-critical systems where hours of downtime are acceptable.

Tier 2: Pilot Light

RPO: Minutes to hours
RTO: Minutes to hours
Cost: Low to moderate
How: Core infrastructure runs in DR region (databases replicating, minimal compute). On failure, scale up compute and switch DNS.
Use when: Important systems that need faster recovery but do not justify full redundancy.

# Pilot light: DB replica running, compute scaled to zero
# On failover, scale up the ASG
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name dr-asg \
  --min-size 2 \
  --desired-capacity 4

# Promote the read replica
aws rds promote-read-replica-db-cluster \
  --db-cluster-identifier dr-cluster

Tier 3: Warm Standby

RPO: Seconds to minutes
RTO: Minutes
Cost: Moderate to high
How: Full infrastructure in both regions, but DR runs at reduced capacity. On failure, scale DR to full capacity and switch traffic.
Use when: Business-critical systems where downtime costs real money.

Tier 4: Multi-Site Active-Active

RPO: Near zero
RTO: Near zero
Cost: High (double or more)
How: Both regions handle production traffic. If one fails, the other absorbs the load.
Use when: Zero-tolerance systems — payments, real-time trading, critical SaaS.

Cost Implications

Multi-region is expensive. Here is what adds up:

Component	Cost Impact
Compute (EC2/Fargate)	2x for active-active, 1.3x for warm standby
Database replication	Aurora Global: additional reader instances. DynamoDB Global Tables: replicated write capacity
Data transfer	Cross-region transfer is $0.02/GB. This adds up fast with replication
Load balancers	One ALB per region ($16/month + LCU charges)
Route 53 health checks	$0.50/month per health check, $1/month with HTTPS
Global Accelerator	$0.025/hour + data transfer
Operational overhead	More regions = more monitoring, more on-call complexity

Rough estimate: Active-active multi-region typically costs 2 to 2.5 times what a single-region deployment costs.

Testing Failover

Multi-region is useless if you never test failover. Schedule regular Game Day exercises:

DNS failover test — Fail the Route 53 health check for the primary region. Verify traffic shifts to secondary within the expected RTO.
Database failover test — Promote the Aurora secondary to primary. Verify the application reconnects and resumes operations.
Full region evacuation — Simulate a complete region failure. Verify all services recover in the DR region.
Failback test — After failover, test returning to the original region. This is often harder than the initial failover.

# Simulate failover by updating Route 53 health check
aws route53 update-health-check \
  --health-check-id hc-us-east-1 \
  --inverted

# Monitor failover
watch -n 5 'dig +short api.myapp.com'

Critical rule: If you have not tested failover in the last 90 days, you do not have a DR plan. You have a DR wish.

What You Should Remember

Multi-region exists for latency, compliance, or DR — pick the reason that applies. Start with active-passive before attempting active-active. DynamoDB Global Tables and Aurora Global Database handle replication but have different consistency models. Route 53 routing policies control traffic, but Global Accelerator gives faster failover. Know your RPO and RTO numbers before choosing a DR tier. And test your failover regularly — an untested DR plan is not a plan at all.