đ§ Kubernetes Troubleshooting: When Pods Go Wrong
The Story of a Pod Hospital đ„
Imagine Kubernetes is a hospital for little robot workers called Pods. These robots want to do their jobsârunning your appsâbut sometimes they get sick! When a robot (Pod) gets sick, it shows error symptoms. Your job? Be the Pod Doctor and heal them!
Letâs meet the six most common sicknesses that happen to Pods:
đ CrashLoopBackOff: The Robot That Keeps Falling Down
Whatâs Happening?
Think of a toy robot that tries to stand up, falls down, tries again, falls down again⊠over and over. Thatâs CrashLoopBackOff!
Your Pod starts, crashes, Kubernetes restarts it, and it crashes again. This loop keeps going with longer and longer waits between restarts.
Why Does This Happen?
graph TD A["Pod Starts"] --> B["App Crashes"] B --> C["Kubernetes Waits"] C --> D["Kubernetes Restarts Pod"] D --> A style A fill:#4ECDC4 style B fill:#FF6B6B style C fill:#FFE66D style D fill:#4ECDC4
Common Causes:
- đ Bug in your code - The app has an error and exits
- đŠ Missing files - App canât find something it needs
- đ Wrong secrets - Database password is incorrect
- đŸ Canât connect - Database or service unreachable
How to Fix It
Step 1: Check the logs
kubectl logs pod-name
kubectl logs pod-name --previous
Step 2: Look at events
kubectl describe pod pod-name
Step 3: Find the real error
- Read the last lines before the crash
- Fix the bug in your application
- Make sure all secrets and configs are correct
Real Example
# Pod keeps crashing because
# it can't find DATABASE_URL
containers:
- name: my-app
env:
- name: DATABASE_URL
value: "" # Empty! That's the problem!
The Fix: Add the correct database URL!
đŠ ImagePullBackOff: Canât Get the Robot Parts
Whatâs Happening?
Imagine ordering robot parts from a store, but:
- The store doesnât exist
- You gave the wrong address
- You donât have permission to buy
ImagePullBackOff means Kubernetes canât download the container image your Pod needs!
Why Does This Happen?
graph TD A["Pod Needs Image"] --> B{Can Find Image?} B -->|No| C["ImagePullBackOff"] B -->|Yes| D{Has Permission?} D -->|No| C D -->|Yes| E["Pod Runs!"] style C fill:#FF6B6B style E fill:#4ECDC4
Common Causes:
- â Typo in image name -
nignxinstead ofnginx - đ·ïž Wrong tag -
nginx:v999doesnât exist - đ Private registry - Need login credentials
- đ Network problems - Canât reach the registry
How to Fix It
Step 1: Check the image name
kubectl describe pod pod-name | grep Image
Step 2: Test manually
docker pull your-image:tag
Step 3: Check for typos
# Wrong:
image: nignx:latest
# Correct:
image: nginx:latest
Step 4: Add image pull secrets (for private registries)
spec:
imagePullSecrets:
- name: my-registry-secret
âł Pending Pod Issues: The Robot Waiting in Line
Whatâs Happening?
Your robot is ready to work but thereâs no desk available! The Pod is created but stays in âPendingâ stateâwaiting, waiting, waitingâŠ
Why Does This Happen?
graph TD A["Pod Created"] --> B{Resources Available?} B -->|No CPU/Memory| C["Pending - No Resources"] B -->|No Matching Node| D["Pending - No Node"] B -->|PVC Not Ready| E["Pending - Volume Issue"] C --> F["Pod Waits..."] D --> F E --> F style F fill:#FFE66D
Common Causes:
- đ» Not enough CPU or memory - Cluster is full
- đ·ïž Node selector mismatch - No node has the required label
- đŸ Volume not ready - PersistentVolumeClaim pending
- đ« Taints and tolerations - Pod not allowed on available nodes
How to Fix It
Step 1: See why itâs pending
kubectl describe pod pod-name
Look at the Events section at the bottom!
Step 2: Check resources
kubectl describe nodes | grep -A 5 "Allocated"
Step 3: Solutions
For no resources:
# Reduce your requests
resources:
requests:
memory: "64Mi" # Ask for less
cpu: "100m"
For node selector issues:
# Check available labels
kubectl get nodes --show-labels
đ„ OOMKilled: The Robot Ate Too Much Memory
Whatâs Happening?
Imagine giving a robot a small backpack, but it tries to stuff a giant teddy bear inside. The backpack explodes!
OOMKilled = Out Of Memory Killed. Your app used more memory than allowed, so Kubernetes stopped it.
Why Does This Happen?
graph TD A["App Uses Memory"] --> B{Under Limit?} B -->|Yes| C["App Runs Happy"] B -->|No - Over Limit| D["OOMKilled!"] D --> E["Pod Restarts"] style C fill:#4ECDC4 style D fill:#FF6B6B
Common Causes:
- đ Memory limit too low - App needs more than you allowed
- đ Memory leak - App keeps using more and more memory
- đ Traffic spike - Sudden load uses extra memory
How to Fix It
Step 1: Confirm the problem
kubectl describe pod pod-name | grep OOMKilled
kubectl get pod pod-name -o yaml | grep -A3 lastState
Step 2: Check current limits
resources:
limits:
memory: "128Mi" # Too small?
requests:
memory: "64Mi"
Step 3: Increase memory (if needed)
resources:
limits:
memory: "512Mi" # Give more room
requests:
memory: "256Mi"
Step 4: Fix memory leaks
- Profile your application
- Check for objects that never get cleaned up
đ§ CreateContainerConfigError: Wrong Robot Instructions
Whatâs Happening?
Youâre giving the robot assembly instructions, but some parts are missing or the instructions have errors. The robot canât even start!
CreateContainerConfigError means Kubernetes canât configure the container properly before starting it.
Why Does This Happen?
graph TD A["Pod Starting"] --> B{Config Valid?} B -->|Secret Missing| C["ConfigError"] B -->|ConfigMap Missing| C B -->|Mount Error| C B -->|All Good| D["Container Starts"] style C fill:#FF6B6B style D fill:#4ECDC4
Common Causes:
- đ Secret doesnât exist - Referenced secret not found
- đ ConfigMap missing - Referenced ConfigMap not found
- đ Key not found - Secret/ConfigMap exists but key is wrong
- đ Wrong permissions - Canât access the resource
How to Fix It
Step 1: Find the exact error
kubectl describe pod pod-name
Look for messages like:
secret "my-secret" not foundconfigmap "my-config" not found
Step 2: Check if resources exist
kubectl get secrets
kubectl get configmaps
Step 3: Create missing resources
# Create a secret
kubectl create secret generic my-secret \
--from-literal=password=mypassword
# Create a ConfigMap
kubectl create configmap my-config \
--from-literal=key=value
Step 4: Verify the key names
# Make sure this key actually exists
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: my-secret
key: password # Does this key exist?
đȘ Pod Stuck Terminating: The Robot Wonât Leave
Whatâs Happening?
Itâs closing time, but one robot refuses to leave the building! You told the Pod to stop, but itâs stuck in âTerminatingâ state forever.
Why Does This Happen?
graph TD A["Delete Pod"] --> B["Send SIGTERM"] B --> C{App Responds?} C -->|Yes| D["Pod Stops"] C -->|No| E["Wait Grace Period"] E --> F["Send SIGKILL"] F --> G{Still Stuck?} G -->|Finalizers| H["Terminating Forever"] G -->|Volume Issues| H style D fill:#4ECDC4 style H fill:#FF6B6B
Common Causes:
- â° App ignores shutdown signal - Doesnât handle SIGTERM
- đ Finalizers blocking - Cleanup tasks stuck
- đŸ Volume unmount issues - Canât detach storage
- đ Network problems - Webhook timeouts
How to Fix It
Step 1: Check whatâs blocking
kubectl describe pod pod-name
kubectl get pod pod-name -o yaml | grep finalizers
Step 2: Wait for grace period (default 30 seconds)
Step 3: Force delete (use carefully!)
kubectl delete pod pod-name --grace-period=0 --force
Step 4: Remove finalizers (last resort)
kubectl patch pod pod-name \
-p '{"metadata":{"finalizers":null}}'
â ïž Warning: Force deleting can leave resources behind. Always try to fix the root cause first!
đŻ Quick Diagnosis Flowchart
graph TD A["Pod Not Working"] --> B{What's the Status?} B -->|CrashLoopBackOff| C["Check logs for app errors"] B -->|ImagePullBackOff| D["Verify image name & auth"] B -->|Pending| E["Check resources & node selectors"] B -->|OOMKilled| F["Increase memory limits"] B -->|CreateContainerConfigError| G["Check secrets & configmaps"] B -->|Terminating| H["Check finalizers & volumes"] style A fill:#667eea style C fill:#4ECDC4 style D fill:#4ECDC4 style E fill:#4ECDC4 style F fill:#4ECDC4 style G fill:#4ECDC4 style H fill:#4ECDC4
đ Youâre Now a Pod Doctor!
Remember these golden rules:
- Always start with
kubectl describe pod- It tells the story - Check logs with
kubectl logs- See what your app says - Donât panic! - Every error has a solution
- Learn the patterns - Most issues fall into these 6 categories
Youâve got this! Every Kubernetes expert started by fixing these same errors. Each bug you fix makes you stronger! đȘ
