🔧 Kubernetes Troubleshooting: When Pods Go Wrong
The Story of a Pod Hospital 🏥
Imagine Kubernetes is a hospital for little robot workers called Pods. These robots want to do their jobs—running your apps—but sometimes they get sick! When a robot (Pod) gets sick, it shows error symptoms. Your job? Be the Pod Doctor and heal them!
Let’s meet the six most common sicknesses that happen to Pods:
🔄 CrashLoopBackOff: The Robot That Keeps Falling Down
What’s Happening?
Think of a toy robot that tries to stand up, falls down, tries again, falls down again… over and over. That’s CrashLoopBackOff!
Your Pod starts, crashes, Kubernetes restarts it, and it crashes again. This loop keeps going with longer and longer waits between restarts.
Why Does This Happen?
graph TD A["Pod Starts"] --> B["App Crashes"] B --> C["Kubernetes Waits"] C --> D["Kubernetes Restarts Pod"] D --> A style A fill:#4ECDC4 style B fill:#FF6B6B style C fill:#FFE66D style D fill:#4ECDC4
Common Causes:
- 🐛 Bug in your code - The app has an error and exits
- 📦 Missing files - App can’t find something it needs
- 🔑 Wrong secrets - Database password is incorrect
- 💾 Can’t connect - Database or service unreachable
How to Fix It
Step 1: Check the logs
kubectl logs pod-name
kubectl logs pod-name --previous
Step 2: Look at events
kubectl describe pod pod-name
Step 3: Find the real error
- Read the last lines before the crash
- Fix the bug in your application
- Make sure all secrets and configs are correct
Real Example
# Pod keeps crashing because
# it can't find DATABASE_URL
containers:
- name: my-app
env:
- name: DATABASE_URL
value: "" # Empty! That's the problem!
The Fix: Add the correct database URL!
📦 ImagePullBackOff: Can’t Get the Robot Parts
What’s Happening?
Imagine ordering robot parts from a store, but:
- The store doesn’t exist
- You gave the wrong address
- You don’t have permission to buy
ImagePullBackOff means Kubernetes can’t download the container image your Pod needs!
Why Does This Happen?
graph TD A["Pod Needs Image"] --> B{Can Find Image?} B -->|No| C["ImagePullBackOff"] B -->|Yes| D{Has Permission?} D -->|No| C D -->|Yes| E["Pod Runs!"] style C fill:#FF6B6B style E fill:#4ECDC4
Common Causes:
- ❌ Typo in image name -
nignxinstead ofnginx - 🏷️ Wrong tag -
nginx:v999doesn’t exist - 🔐 Private registry - Need login credentials
- 🌐 Network problems - Can’t reach the registry
How to Fix It
Step 1: Check the image name
kubectl describe pod pod-name | grep Image
Step 2: Test manually
docker pull your-image:tag
Step 3: Check for typos
# Wrong:
image: nignx:latest
# Correct:
image: nginx:latest
Step 4: Add image pull secrets (for private registries)
spec:
imagePullSecrets:
- name: my-registry-secret
⏳ Pending Pod Issues: The Robot Waiting in Line
What’s Happening?
Your robot is ready to work but there’s no desk available! The Pod is created but stays in “Pending” state—waiting, waiting, waiting…
Why Does This Happen?
graph TD A["Pod Created"] --> B{Resources Available?} B -->|No CPU/Memory| C["Pending - No Resources"] B -->|No Matching Node| D["Pending - No Node"] B -->|PVC Not Ready| E["Pending - Volume Issue"] C --> F["Pod Waits..."] D --> F E --> F style F fill:#FFE66D
Common Causes:
- 💻 Not enough CPU or memory - Cluster is full
- 🏷️ Node selector mismatch - No node has the required label
- 💾 Volume not ready - PersistentVolumeClaim pending
- 🚫 Taints and tolerations - Pod not allowed on available nodes
How to Fix It
Step 1: See why it’s pending
kubectl describe pod pod-name
Look at the Events section at the bottom!
Step 2: Check resources
kubectl describe nodes | grep -A 5 "Allocated"
Step 3: Solutions
For no resources:
# Reduce your requests
resources:
requests:
memory: "64Mi" # Ask for less
cpu: "100m"
For node selector issues:
# Check available labels
kubectl get nodes --show-labels
💥 OOMKilled: The Robot Ate Too Much Memory
What’s Happening?
Imagine giving a robot a small backpack, but it tries to stuff a giant teddy bear inside. The backpack explodes!
OOMKilled = Out Of Memory Killed. Your app used more memory than allowed, so Kubernetes stopped it.
Why Does This Happen?
graph TD A["App Uses Memory"] --> B{Under Limit?} B -->|Yes| C["App Runs Happy"] B -->|No - Over Limit| D["OOMKilled!"] D --> E["Pod Restarts"] style C fill:#4ECDC4 style D fill:#FF6B6B
Common Causes:
- 📊 Memory limit too low - App needs more than you allowed
- 🐛 Memory leak - App keeps using more and more memory
- 📈 Traffic spike - Sudden load uses extra memory
How to Fix It
Step 1: Confirm the problem
kubectl describe pod pod-name | grep OOMKilled
kubectl get pod pod-name -o yaml | grep -A3 lastState
Step 2: Check current limits
resources:
limits:
memory: "128Mi" # Too small?
requests:
memory: "64Mi"
Step 3: Increase memory (if needed)
resources:
limits:
memory: "512Mi" # Give more room
requests:
memory: "256Mi"
Step 4: Fix memory leaks
- Profile your application
- Check for objects that never get cleaned up
🔧 CreateContainerConfigError: Wrong Robot Instructions
What’s Happening?
You’re giving the robot assembly instructions, but some parts are missing or the instructions have errors. The robot can’t even start!
CreateContainerConfigError means Kubernetes can’t configure the container properly before starting it.
Why Does This Happen?
graph TD A["Pod Starting"] --> B{Config Valid?} B -->|Secret Missing| C["ConfigError"] B -->|ConfigMap Missing| C B -->|Mount Error| C B -->|All Good| D["Container Starts"] style C fill:#FF6B6B style D fill:#4ECDC4
Common Causes:
- 🔑 Secret doesn’t exist - Referenced secret not found
- 📄 ConfigMap missing - Referenced ConfigMap not found
- 📁 Key not found - Secret/ConfigMap exists but key is wrong
- 🔒 Wrong permissions - Can’t access the resource
How to Fix It
Step 1: Find the exact error
kubectl describe pod pod-name
Look for messages like:
secret "my-secret" not foundconfigmap "my-config" not found
Step 2: Check if resources exist
kubectl get secrets
kubectl get configmaps
Step 3: Create missing resources
# Create a secret
kubectl create secret generic my-secret \
--from-literal=password=mypassword
# Create a ConfigMap
kubectl create configmap my-config \
--from-literal=key=value
Step 4: Verify the key names
# Make sure this key actually exists
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: my-secret
key: password # Does this key exist?
🚪 Pod Stuck Terminating: The Robot Won’t Leave
What’s Happening?
It’s closing time, but one robot refuses to leave the building! You told the Pod to stop, but it’s stuck in “Terminating” state forever.
Why Does This Happen?
graph TD A["Delete Pod"] --> B["Send SIGTERM"] B --> C{App Responds?} C -->|Yes| D["Pod Stops"] C -->|No| E["Wait Grace Period"] E --> F["Send SIGKILL"] F --> G{Still Stuck?} G -->|Finalizers| H["Terminating Forever"] G -->|Volume Issues| H style D fill:#4ECDC4 style H fill:#FF6B6B
Common Causes:
- ⏰ App ignores shutdown signal - Doesn’t handle SIGTERM
- 🔗 Finalizers blocking - Cleanup tasks stuck
- 💾 Volume unmount issues - Can’t detach storage
- 🌐 Network problems - Webhook timeouts
How to Fix It
Step 1: Check what’s blocking
kubectl describe pod pod-name
kubectl get pod pod-name -o yaml | grep finalizers
Step 2: Wait for grace period (default 30 seconds)
Step 3: Force delete (use carefully!)
kubectl delete pod pod-name --grace-period=0 --force
Step 4: Remove finalizers (last resort)
kubectl patch pod pod-name \
-p '{"metadata":{"finalizers":null}}'
⚠️ Warning: Force deleting can leave resources behind. Always try to fix the root cause first!
🎯 Quick Diagnosis Flowchart
graph TD A["Pod Not Working"] --> B{What's the Status?} B -->|CrashLoopBackOff| C["Check logs for app errors"] B -->|ImagePullBackOff| D["Verify image name & auth"] B -->|Pending| E["Check resources & node selectors"] B -->|OOMKilled| F["Increase memory limits"] B -->|CreateContainerConfigError| G["Check secrets & configmaps"] B -->|Terminating| H["Check finalizers & volumes"] style A fill:#667eea style C fill:#4ECDC4 style D fill:#4ECDC4 style E fill:#4ECDC4 style F fill:#4ECDC4 style G fill:#4ECDC4 style H fill:#4ECDC4
🏆 You’re Now a Pod Doctor!
Remember these golden rules:
- Always start with
kubectl describe pod- It tells the story - Check logs with
kubectl logs- See what your app says - Don’t panic! - Every error has a solution
- Learn the patterns - Most issues fall into these 6 categories
You’ve got this! Every Kubernetes expert started by fixing these same errors. Each bug you fix makes you stronger! 💪
