🏥 Kubernetes Cluster Health: Be the Doctor for Your Computer City!
The Story: Your Computer City Needs a Health Checkup!
Imagine you run a magical city made of computers. This city has many buildings (called Nodes), and in each building, little workers (called Pods) do important jobs. There’s also a Mayor’s Office (the API Server) that tells everyone what to do.
But what happens when a building gets sick? Or when the Mayor’s Office phone stops working? The whole city could stop!
That’s why YOU need to be the City Doctor — checking that everything is healthy, finding problems early, and fixing them before anyone notices!
🏢 Node Health Monitoring: Are Your Buildings Healthy?
What is a Node?
A Node is like a building in your city. Each building has:
- Electricity (CPU power)
- Storage rooms (Memory)
- Loading docks (Network)
Simple Example: Checking if a Building is OK
kubectl get nodes
This is like walking past each building and asking: “Are you open today?”
You’ll see something like:
NAME STATUS ROLES AGE VERSION
node-1 Ready worker 5d v1.28.0
node-2 Ready worker 5d v1.28.0
node-3 NotReady worker 5d v1.28.0
- ✅ Ready = Building is open and working!
- ❌ NotReady = Building is closed — something is wrong!
Real Life Example
Think of it like checking if a shop is open:
- You walk by and the lights are ON → Ready ✅
- The lights are OFF and door is locked → NotReady ❌
🌡️ Node Status and Conditions: What’s Wrong with This Building?
When a building says “I’m not feeling well,” you need to know exactly what hurts. Kubernetes tells you with Conditions.
The 4 Important Health Checks
graph TD A["Node Health Check"] --> B["Ready?"] A --> C["Memory OK?"] A --> D["Disk OK?"] A --> E["Network OK?"] B -->|True| F["✅ Can do work"] C -->|False| G["🧠 MemoryPressure"] D -->|False| H["💾 DiskPressure"] E -->|False| I["🌐 NetworkUnavailable"]
| Condition | What It Means | Like This in Real Life |
|---|---|---|
| Ready | Node can accept work | Shop is open for business |
| MemoryPressure | Running out of memory | Storage room is too full |
| DiskPressure | Running out of disk space | File cabinets are overflowing |
| NetworkUnavailable | Can’t talk to network | Phone lines are cut |
How to Check a Node’s Health Details
kubectl describe node node-1
This shows you everything about that building — like reading its full medical report!
Look for the Conditions section:
Conditions:
Type Status
---- ------
MemoryPressure False
DiskPressure False
NetworkUnavailable False
Ready True
Good news! All conditions are healthy:
- MemoryPressure = False (memory is fine!)
- DiskPressure = False (disk is fine!)
- Ready = True (everything works!)
🏙️ Cluster Health Monitoring: Is the Whole City OK?
Now let’s zoom out! Instead of checking one building, let’s check the entire city.
Quick City Overview
kubectl get nodes -o wide
This shows ALL buildings with extra details like their addresses (IP) and what type they are.
Counting Healthy vs Sick Buildings
kubectl get nodes | grep -c Ready
kubectl get nodes | grep -c NotReady
Simple Example:
- If you have 10 buildings and 2 say “NotReady”…
- That’s like 2 shops closed on Main Street — you need to investigate!
Using Metrics Server: The City Thermometer 🌡️
Want to see how hard each building is working?
kubectl top nodes
Output:
NAME CPU(cores) MEMORY
node-1 500m 1200Mi
node-2 200m 800Mi
This is like checking:
- How many lights are on in each building (CPU)
- How full are the storage rooms (Memory)
📞 API Server Health Checks: Is the Mayor’s Office Working?
The API Server is like the Mayor’s Office. Every order, every request, every question goes through it. If it stops working, your whole city is stuck!
Simple Test: Can We Call the Mayor?
kubectl cluster-info
If it responds, the Mayor is answering! You’ll see:
Kubernetes control plane is running at
https://10.0.0.1:6443
Health Endpoint: The Mayor’s Direct Line
The API Server has special “health phones” you can call:
# Is the API server alive?
kubectl get --raw='/livez'
# Is it ready to work?
kubectl get --raw='/readyz'
# Full health report
kubectl get --raw='/healthz'
What they mean:
| Endpoint | Question | Good Answer |
|---|---|---|
/livez |
“Are you alive?” | ok |
/readyz |
“Ready for work?” | ok |
/healthz |
“Overall health?” | ok |
Real Example: Checking Everything
kubectl get --raw='/readyz?verbose'
This gives you a detailed report of every part:
[+] ping ok
[+] etcd ok
[+] poststarthook/start ok
healthz check passed
Each [+] is a healthy system. If you see [-], something needs attention!
🔍 Quick Troubleshooting Flow
graph TD A["🏥 Problem Detected!"] --> B{Can I reach<br>API Server?} B -->|No| C["Check API Server<br>/livez /readyz"] B -->|Yes| D{Are Nodes Ready?} D -->|No| E["Run: kubectl<br>describe node"] D -->|Yes| F{Are Pods<br>Running?} E --> G["Check Conditions:<br>Memory/Disk/Network"] F -->|No| H["Check Pod logs<br>and events"] F -->|Yes| I["✅ Cluster is<br>Healthy!"]
🎯 The Doctor’s Checklist
Every good City Doctor checks these things:
- 🏢 Node Health →
kubectl get nodes - 🔍 Node Details →
kubectl describe node <name> - 📊 Resource Usage →
kubectl top nodes - 📞 API Server →
kubectl get --raw='/healthz'
💡 Remember This!
| What to Check | Command | What You’re Looking For |
|---|---|---|
| All buildings | kubectl get nodes |
All should be “Ready” |
| Building health | kubectl describe node |
Conditions all healthy |
| City overview | kubectl top nodes |
CPU/Memory not maxed |
| Mayor’s office | kubectl get --raw='/healthz' |
Should say “ok” |
🌟 You’re Now a Cluster Doctor!
You learned how to:
- ✅ Check if individual Nodes (buildings) are healthy
- ✅ Understand Node Conditions (memory, disk, network)
- ✅ Monitor the whole Cluster (city) at once
- ✅ Verify the API Server (Mayor’s office) is working
Next time your Kubernetes city feels sick, you know exactly where to look! 🏥🔍
