Container Lifecycle - Observability 🔍
The Story of the Magic Window
Imagine you have a giant toy factory with hundreds of tiny robot workers (containers). They’re all busy making toys, but you can’t see inside each little robot. How do you know if they’re happy? If they’re working hard? If something is broken?
You need a magic window that lets you peek inside! That’s exactly what observability is — your superpower to see what’s happening inside your Kubernetes containers.
🎭 The Analogy: Your Robot Factory
Throughout this guide, think of:
- Containers = Tiny robot workers in your factory
- Logs = The robots’ diaries (what they did)
- Metrics = Their health reports (how they feel)
- Alerts = Emergency bells when something’s wrong
📓 Container Logging
What Is It?
Every container writes a diary. Each time something happens, it writes it down. This diary is called a log.
Simple Example
When your robot worker makes a toy:
[INFO] Started making teddy bear
[INFO] Added fluffy stuffing
[INFO] Sewed the button eyes
[SUCCESS] Teddy bear complete!
When something goes wrong:
[ERROR] Oops! Ran out of stuffing!
How to See Container Logs
kubectl logs my-robot-pod
This shows you what your robot wrote in its diary!
Real Life
- Your web app crashes? Check the logs to see what went wrong
- User says they got an error? Logs tell you exactly what happened
graph TD A["Container Runs"] --> B["Writes to stdout/stderr"] B --> C["Kubelet Captures"] C --> D["You Read with kubectl logs"]
🏭 Cluster-Level Logging
The Problem
What if you have 1000 robots? You can’t read 1000 diaries one by one!
The Solution
Build a central library where ALL diaries are collected automatically.
Simple Example
Instead of going to each robot:
Robot 1: "Made 5 toys"
Robot 2: "Made 3 toys"
Robot 3: "ERROR: Out of paint!"
You go to ONE place and search:
Show me all robots that had errors today
→ Robot 3: "ERROR: Out of paint!"
Popular Tools
- Fluentd - The librarian who collects all diaries
- Elasticsearch - The giant bookshelf that stores them
- Kibana - The search tool to find entries
graph TD A["Robot 1 Logs"] --> D["Fluentd"] B["Robot 2 Logs"] --> D C["Robot 3 Logs"] --> D D --> E["Elasticsearch"] E --> F["Kibana Dashboard"]
🤝 Sidecar Logging Pattern
What Is a Sidecar?
Imagine your robot worker has a tiny helper sitting right next to it. This helper’s ONLY job is to take the robot’s diary and send it to the library.
Why Use a Sidecar?
- Your robot doesn’t need to know about the library
- The robot just writes; the helper does the rest
- If you change libraries, only update the helper!
Simple Example
Pod:
- main-robot: # Does the real work
writes: logs to /var/log/app.log
- log-helper: # The sidecar
reads: /var/log/app.log
sends: to central logging
Visual
graph LR A["Main Container"] -->|Writes logs| B["Shared Volume"] B -->|Reads logs| C["Sidecar Container"] C -->|Ships to| D["Central Logging"]
Real Life
Your app writes logs to a file. A Fluentd sidecar container reads that file and sends logs to Elasticsearch. Your app never needs to know about Elasticsearch!
📊 Metrics Overview
Logs vs Metrics
- Logs = Story of what happened (words)
- Metrics = Numbers that measure things
Simple Example
Log: “Made a teddy bear at 2:30 PM”
Metric: toys_made = 47
Why Metrics?
Numbers are easy to:
- Add up
- Make graphs
- Set alerts
Common Metrics
| What | Metric |
|---|---|
| CPU usage | cpu_percent = 75% |
| Memory | memory_mb = 512 |
| Requests | requests_per_second = 100 |
| Errors | error_count = 3 |
graph TD A["Container"] -->|Every 15 sec| B["Collect Metrics"] B --> C["CPU: 75%"] B --> D["Memory: 512MB"] B --> E["Requests: 100/sec"]
📈 Metrics Server
What Is It?
The official reporter that Kubernetes provides. It checks on every robot and writes down their health numbers.
What It Tracks
- CPU usage (how hard is the brain working?)
- Memory usage (how full is the memory?)
Simple Example
kubectl top pods
Output:
NAME CPU MEMORY
robot-1 100m 256Mi
robot-2 50m 128Mi
robot-3 200m 512Mi
Now you know robot-3 is working the hardest!
Why You Need It
- Auto-scaling (add more robots when busy)
- Debugging (find the slow robot)
- Capacity planning (do we need more factory space?)
graph TD A["Metrics Server"] --> B["Collects from all Pods"] B --> C["kubectl top"] B --> D["Horizontal Pod Autoscaler"]
🔄 Metrics Pipeline
What Is It?
The assembly line that moves metrics from your containers to your dashboards.
The Journey
- Generate - Container creates a metric
- Collect - Something grabs it
- Store - Save it somewhere
- Query - Ask questions about it
- Visualize - Show pretty graphs
Simple Example
Container makes metric: requests_total = 1000
↓
Prometheus scrapes it
↓
Stored in time-series database
↓
Grafana asks: "Show me requests over time"
↓
Beautiful graph appears! 📈
graph TD A["Your App"] -->|Expose /metrics| B["Prometheus"] B -->|Store| C["Time Series DB"] C -->|Query| D["Grafana"] D -->|Display| E["Dashboard"]
🔥 Prometheus Integration
What Is Prometheus?
The super detective that visits every robot, asks “How are you?”, and writes down all the answers.
How It Works
- Your app exposes metrics at
/metrics - Prometheus visits and reads them (called “scraping”)
- Prometheus stores the data
- You query it later
Simple Example
Your app’s /metrics endpoint:
# HELP toys_made Total toys made
toys_made 47
# HELP errors_total Total errors
errors_total 3
Prometheus config:
scrape_configs:
- job_name: 'toy-factory'
static_configs:
- targets: ['robot-1:8080']
Query Examples
# How many toys made?
toys_made
# Error rate last 5 minutes?
rate(errors_total[5m])
graph TD A["App /metrics"] -->|Scrape every 15s| B["Prometheus"] B --> C["Store Time Series"] C --> D["PromQL Queries"] D --> E["Grafana Dashboards"] D --> F["Alertmanager"]
🚨 Alerting Basics
What Is Alerting?
The alarm system that wakes you up when something’s wrong.
How It Works
- You set a rule: “If errors > 10, alert me!”
- Prometheus checks this rule constantly
- When the rule is true, it sends an alert
- You get a message (Slack, email, PagerDuty)
Simple Example
Alert rule:
groups:
- name: factory-alerts
rules:
- alert: TooManyErrors
expr: errors_total > 10
for: 5m
labels:
severity: critical
annotations:
summary: "Too many errors!"
Translation: “If errors stay above 10 for 5 minutes, ring the alarm!”
Alert Lifecycle
graph TD A["Prometheus"] -->|Checks rule| B{errors > 10?} B -->|No| A B -->|Yes for 5m| C["Alert Fires!"] C --> D["Alertmanager"] D --> E["Slack Message"] D --> F["Email"] D --> G["PagerDuty"]
Best Practices
- Don’t alert on everything - Only important stuff
- Use “for” duration - Avoid false alarms
- Include helpful info - Tell people what to do
🎯 Quick Summary
| Topic | What It Does | One-Liner |
|---|---|---|
| Container Logging | Records what happens | Robot’s diary |
| Cluster Logging | Collects all logs | Central library |
| Sidecar Pattern | Helper ships logs | Robot’s assistant |
| Metrics | Numbers about health | Health report |
| Metrics Server | Built-in K8s metrics | Official reporter |
| Metrics Pipeline | Flow from app to graph | Assembly line |
| Prometheus | Scrapes & stores metrics | Super detective |
| Alerting | Notifies on problems | Alarm system |
🚀 You Did It!
Now you understand how to see inside your Kubernetes containers:
- Logs tell you the story
- Metrics give you the numbers
- Alerts wake you up when needed
You have the magic window into your robot factory. Go forth and observe! 🔍✨
