What is High Availability in Kubernetes?

High Availability means your cluster keeps running even when parts fail. It eliminates single points of failure using multiple control planes.

What is etcd quorum and why does it matter?

Quorum is the minimum nodes that must agree before changes are saved (n/2 + 1). It prevents split-brain where both sides make conflicting changes.

What's the difference between stacked and external etcd?

Stacked etcd runs on control plane nodes (simpler, fewer machines). External etcd uses dedicated nodes (more robust, better isolation).

Why use odd numbers for etcd cluster size?

Odd numbers are more efficient. A 4-node cluster tolerates 1 failure just like 3 nodes, but costs more without added fault tolerance.

Kubernetes High Availability | K8s HA Guide

Kubernetes High Availability: Your Cluster’s Safety Net 🛡️

The Story of the Unbreakable Restaurant

Imagine you run the busiest restaurant in town. Customers line up every day. But what happens if:

Your one chef gets sick?
Your one cash register breaks?
Your one recipe book catches fire?

Disaster! The restaurant closes. Customers leave hungry.

Now imagine a smarter restaurant:

Three chefs who can cover for each other
Multiple cash registers that sync automatically
Three copies of the recipe book in different rooms

One chef sick? No problem. The other two keep cooking.

This is High Availability (HA) for Kubernetes!

What is High Availability?

High Availability means your system keeps running even when parts break.

Think of it like a three-legged stool:

If one leg breaks on a regular stool → you fall
If one leg breaks on a stool with three backup legs → you stay seated

Simple Rule: No single point of failure.

Real Life Example

WITHOUT HA:
┌─────────────┐
│  One Server │ ← Server dies = EVERYTHING DIES
└─────────────┘

WITH HA:
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Server 1   │  │  Server 2   │  │  Server 3   │
└─────────────┘  └─────────────┘  └─────────────┘
       ↓                ↓                ↓
  One dies? → Others keep working!

The Control Plane: Your Cluster’s Brain

The Control Plane is like the manager’s office in our restaurant.

It makes all the important decisions:

Where to run your apps (scheduling)
Watching over everything (monitoring)
Keeping track of what’s running (state management)

Control Plane Components

graph TD
    A["API Server"] --> B["Scheduler"]
    A --> C["Controller Manager"]
    A --> D["etcd Database"]
    B --> E["Decides where pods run"]
    C --> F["Ensures desired state"]
    D --> G["Stores all cluster data"]

Component	Job	Restaurant Analogy
API Server	Front door for all requests	Reception desk
Scheduler	Assigns work to nodes	Seating host
Controller Manager	Makes sure things match desired state	Quality manager
etcd	Stores all cluster data	Recipe book

Control Plane HA: Multiple Brains Working Together

The Problem with One Brain

One control plane = One point of failure

SINGLE CONTROL PLANE:
┌──────────────────────┐
│  Control Plane       │ ← Dies = Cluster blind!
│  ┌────┐ ┌────┐ ┌────┐│    - No new pods
│  │API │ │Sched│ │etcd││    - No healing
│  └────┘ └────┘ └────┘│    - No updates
└──────────────────────┘

The HA Solution: Multiple Control Planes

HA CONTROL PLANE:
┌──────────────────────┐
│  Control Plane 1     │
│  ┌────┐ ┌────┐ ┌────┐│
│  │API │ │Sched│ │etcd││
│  └────┘ └────┘ └────┘│
└──────────────────────┘
          ↕ Sync
┌──────────────────────┐
│  Control Plane 2     │
│  ┌────┐ ┌────┐ ┌────┐│
│  │API │ │Sched│ │etcd││
│  └────┘ └────┘ └────┘│
└──────────────────────┘
          ↕ Sync
┌──────────────────────┐
│  Control Plane 3     │
│  ┌────┐ ┌────┐ ┌────┐│
│  │API │ │Sched│ │etcd││
│  └────┘ └────┘ └────┘│
└──────────────────────┘

If one dies → The other two take over instantly!

How Requests Reach Control Planes

A Load Balancer sits in front of all control planes:

graph TD
    U["User Request"] --> LB["Load Balancer"]
    LB --> CP1["Control Plane 1"]
    LB --> CP2["Control Plane 2"]
    LB --> CP3["Control Plane 3"]

The load balancer:

Sends requests to healthy control planes
Skips broken ones automatically
Users never notice failures!

etcd: The Memory of Your Cluster

What is etcd?

etcd is a distributed key-value database. Think of it as your cluster’s permanent memory.

Everything Kubernetes knows is stored here:

Pod configurations
Service definitions
Secrets and ConfigMaps
Node information

If etcd dies without backup → You lose EVERYTHING.

Why etcd is Special

etcd uses the Raft consensus algorithm.

Think of it like this:

Three friends deciding where to eat:

Friend 1: “Pizza!”

Friend 2: “Pizza!”

Friend 3: “Sushi!”

Result: Pizza wins! (2 out of 3 agree)

This is how etcd makes decisions. The majority must agree before any change is saved.

etcd HA Patterns: Two Ways to Deploy

Pattern 1: Stacked etcd (Simple)

etcd runs on the same machines as control plane components.

graph TD
    subgraph Node1["Node 1"]
        CP1["Control Plane"]
        E1["etcd"]
    end
    subgraph Node2["Node 2"]
        CP2["Control Plane"]
        E2["etcd"]
    end
    subgraph Node3["Node 3"]
        CP3["Control Plane"]
        E3["etcd"]
    end
    E1 <--> E2
    E2 <--> E3
    E1 <--> E3

Pros:

Simpler setup
Fewer machines needed
Easier to manage

Cons:

Node failure = lose both control plane AND etcd member
Resources shared between components

Best For: Smaller clusters, cost-conscious setups

Pattern 2: External etcd (Robust)

etcd runs on separate dedicated machines.

graph TD
    subgraph CP["Control Plane Nodes"]
        CP1["Control Plane 1"]
        CP2["Control Plane 2"]
        CP3["Control Plane 3"]
    end
    subgraph ETCD["etcd Cluster"]
        E1["etcd 1"]
        E2["etcd 2"]
        E3["etcd 3"]
    end
    CP1 --> E1
    CP1 --> E2
    CP1 --> E3
    CP2 --> E1
    CP2 --> E2
    CP2 --> E3
    CP3 --> E1
    CP3 --> E2
    CP3 --> E3
    E1 <--> E2
    E2 <--> E3
    E1 <--> E3

Pros:

More resilient (failures are isolated)
Better performance (dedicated resources)
Easier to scale etcd independently

Cons:

More machines needed (6+ total)
More complex setup

Best For: Production clusters, large-scale deployments

etcd Quorum: The Voting System

What is Quorum?

Quorum = The minimum number of members that must agree.

Formula: (n / 2) + 1 (rounded down, then +1)

Cluster Size	Quorum Needed	Can Lose
3 nodes	2 must agree	1 node
5 nodes	3 must agree	2 nodes
7 nodes	4 must agree	3 nodes

Why Odd Numbers?

Odd numbers are always better!

3 nodes vs 4 nodes:

3 NODES:          4 NODES:
Quorum = 2        Quorum = 3
Can lose = 1      Can lose = 1

Same fault tolerance, but 4 nodes costs more!

Adding that 4th node doesn’t help. It just costs more money and adds complexity.

The Split Brain Problem

What happens when network splits the cluster?

NETWORK PARTITION:
┌─────────────────┐    ╳    ┌─────────────────┐
│ Node 1  Node 2  │  SPLIT  │      Node 3     │
│   ↕        ↕    │         │                 │
│  Can talk to    │         │  Alone, can't   │
│  each other     │         │  reach quorum   │
└─────────────────┘         └─────────────────┘
      2 nodes                    1 node
    HAS QUORUM!              NO QUORUM 😢

With 3 nodes and quorum of 2:

Side A (2 nodes): Has quorum → Can make changes
Side B (1 node): No quorum → Read-only mode

This prevents “split brain” where both sides make conflicting changes!

Putting It All Together

HA Architecture Checklist

✅ Multiple control plane nodes (3, 5, or 7)
✅ Load balancer in front of API servers
✅ etcd cluster with odd number of members
✅ Nodes spread across failure zones
✅ Regular etcd backups

Example: 3-Node HA Cluster

graph TD
    LB["Load Balancer"] --> N1
    LB --> N2
    LB --> N3

    subgraph N1["Node 1 - Zone A"]
        API1["API Server"]
        SCHED1["Scheduler"]
        CM1["Controller"]
        ETCD1["etcd"]
    end

    subgraph N2["Node 2 - Zone B"]
        API2["API Server"]
        SCHED2["Scheduler"]
        CM2["Controller"]
        ETCD2["etcd"]
    end

    subgraph N3["Node 3 - Zone C"]
        API3["API Server"]
        SCHED3["Scheduler"]
        CM3["Controller"]
        ETCD3["etcd"]
    end

    ETCD1 <-.-> ETCD2
    ETCD2 <-.-> ETCD3
    ETCD1 <-.-> ETCD3

Key Takeaways 🎯

High Availability = No single point of failure
Control Plane HA = Multiple control planes behind a load balancer
Stacked etcd = etcd on same nodes (simpler)
External etcd = etcd on separate nodes (more robust)
Quorum = Majority must agree (n/2 + 1)
Always use odd numbers for etcd cluster size

Your Next Step

You now understand why Kubernetes needs multiple brains and how they work together.

The best way to remember this?

Think of the restaurant:

3 chefs (control planes)
3 recipe books that sync (etcd with quorum)
Never let one failure close the kitchen!

You’ve got this! 🚀

High Availability

Unable to load concept

Coming Soon...

Kubernetes High Availability: Your Cluster’s Safety Net 🛡️

The Story of the Unbreakable Restaurant

What is High Availability?

Real Life Example

The Control Plane: Your Cluster’s Brain

Control Plane Components

Control Plane HA: Multiple Brains Working Together

The Problem with One Brain

The HA Solution: Multiple Control Planes

How Requests Reach Control Planes

etcd: The Memory of Your Cluster

What is etcd?

Why etcd is Special

etcd HA Patterns: Two Ways to Deploy

Pattern 1: Stacked etcd (Simple)

Pattern 2: External etcd (Robust)

etcd Quorum: The Voting System

What is Quorum?

Why Odd Numbers?

The Split Brain Problem

Putting It All Together

HA Architecture Checklist

Example: 3-Node HA Cluster

Key Takeaways 🎯

Your Next Step

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue