DeployU
Interviews / DevOps & Cloud Infrastructure / Design a highly available, multi-region Kubernetes architecture for a financial services application.

Design a highly available, multi-region Kubernetes architecture for a financial services application.

architecture High Availability & Architecture Interactive Quiz

The Scenario

You’re the Lead Cloud Architect at a Fortune 500 financial services company. The business is launching a new real-time trading platform that must meet these requirements:

  • 99.99% uptime SLA (less than 53 minutes of downtime per year)
  • Handles 50,000 transactions per second during market hours
  • Regulatory requirement: Data must remain in specific regions (US data in US, EU data in EU)
  • RTO (Recovery Time Objective): 5 minutes
  • RPO (Recovery Point Objective): Zero data loss
  • Global user base: Users from North America, Europe, and Asia-Pacific

The CTO asks you: “Design our Kubernetes infrastructure to meet these requirements. We have budget for AWS, GCP, or Azure—your choice.”

The Challenge

Design a comprehensive, production-ready multi-region Kubernetes architecture. Your design must address:

  1. Cluster topology: How many clusters? Where? Why?
  2. High availability: How do you ensure 99.99% uptime?
  3. Data residency: How do you meet regulatory requirements?
  4. Disaster recovery: What happens if an entire region fails?
  5. Traffic routing: How do users reach the right region?

Draw the architecture and explain your design decisions.

How Different Experience Levels Approach This
Junior Engineer
Surface Level

A junior architect might propose using one large Kubernetes cluster spanning multiple regions with a simple load balancer distributing traffic globally and replicated databases without considering data residency. This fails because cross-region latency kills performance, it violates data residency regulations like GDPR, network costs are enormous, there's a single point of failure for the control plane, and it creates a compliance nightmare.

Senior Engineer
Production Ready

A Principal Engineer designs a multi-cluster, multi-region architecture with active-active setup. The architecture includes three separate regional clusters (US-East, EU-West, AP-Southeast) for data residency and fault isolation, multi-AZ within each region for high availability, active-active traffic routing using AWS Global Accelerator for fast failover, and RDS with cross-region async replication for disaster recovery. This achieves 99.99% uptime with under 2-minute RTO.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    Global Load Balancer                       │
│          (AWS Global Accelerator / GCP Cloud CDN)            │
│        Routes users to nearest healthy region                │
└────────────┬────────────────────┬────────────────────────────┘
             │                    │
   ┌─────────▼─────────┐  ┌──────▼──────────┐  ┌─────────────┐
   │  US-EAST (Primary)│  │  EU-WEST        │  │  AP-SOUTHEAST│
   │  ┌──────────────┐ │  │  ┌────────────┐ │  │  ┌─────────┐│
   │  │  EKS Cluster │ │  │  │ EKS Cluster│ │  │  │EKS Cluster││
   │  │  Multi-AZ    │ │  │  │  Multi-AZ  │ │  │  │ Multi-AZ ││
   │  │  3 Nodes/AZ  │ │  │  │ 3 Nodes/AZ │ │  │  │3 Nodes/AZ││
   │  └──────────────┘ │  │  └────────────┘ │  │  └─────────┘│
   │                    │  │                 │  │             │
   │  RDS Multi-AZ     │  │  RDS Multi-AZ   │  │ RDS Multi-AZ│
   └───────────────────┘  └─────────────────┘  └─────────────┘
             │                    │                    │
             └────────────────────┴────────────────────┘
                           Cross-Region
                        Async Replication

Key Design Decisions

1. Three Separate Regional Clusters (Not One Global Cluster)

Why:

  • Data Residency: GDPR requires EU customer data to stay in EU
  • Fault Isolation: If one region fails, others continue operating
  • Latency: Users connect to geographically closest cluster (50ms vs 200ms)
  • Compliance: Financial regulations require data sovereignty

2. Multi-AZ Within Each Region

Each cluster (e.g., us-east-1) spans 3 availability zones with 3 worker nodes and 1 control plane per AZ. This provides high availability where if one AZ fails, pods reschedule to healthy AZs, achieves 99.99% SLA (single AZ gives 99.9%, multi-AZ gives 99.99%), and enables zero-downtime deployments by draining one AZ while others handle traffic.

3. Active-Active Traffic Routing

Users are routed to their nearest region: New York users go to US-EAST, London users to EU-WEST, and Tokyo users to AP-SOUTHEAST. AWS Global Accelerator provides 10x faster failover than Route53, detects failures in seconds not minutes, uses static anycast IPs with no DNS propagation delays, and routes traffic to healthy regions within 30 seconds automatically.

4. Database Strategy

US-EAST has RDS Multi-AZ as the primary with sync replica and async replication to EU-WEST. EU-WEST has an RDS Read Replica that can be promoted to primary with async replication to AP-SOUTHEAST. The automated failover process detects failures within 10 seconds, shifts traffic to EU-WEST in 20 seconds, promotes the EU-WEST replica to primary in 1 minute, for a total RTO under 2 minutes (well under the 5-minute requirement).

What Makes the Difference?
  • Context over facts: Explains when and why, not just what
  • Real examples: Provides specific use cases from production experience
  • Trade-offs: Acknowledges pros, cons, and decision factors

Complete Deployment with Multi-AZ

# Deployment with pod anti-affinity to spread across AZs
apiVersion: apps/v1
kind: Deployment
metadata:
  name: trading-api
spec:
  replicas: 9  # 3 per AZ
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: trading-api
            topologyKey: topology.kubernetes.io/zone
      containers:
      - name: api
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"

Cost Optimization

Estimated Monthly Cost (AWS):

  • EKS Clusters (3): $219/month
  • EC2 Nodes (27 instances, m5.2xlarge): ~$7,000/month
  • RDS Multi-AZ (3 db.r5.2xlarge): ~$3,000/month
  • Global Accelerator: ~$500/month
  • Data Transfer: ~$2,000/month
  • Total: ~$12,700/month (handles 50K TPS with 99.99% uptime)

Practice Question

Your financial services application must guarantee 99.99% uptime. You're deciding between running a single-AZ cluster with 10 nodes vs a multi-AZ cluster with 9 nodes (3 per AZ). Which should you choose and why?