Observability

Back

Loading concept...

Container Lifecycle - Observability 🔍

The Story of the Magic Window

Imagine you have a giant toy factory with hundreds of tiny robot workers (containers). They’re all busy making toys, but you can’t see inside each little robot. How do you know if they’re happy? If they’re working hard? If something is broken?

You need a magic window that lets you peek inside! That’s exactly what observability is — your superpower to see what’s happening inside your Kubernetes containers.


🎭 The Analogy: Your Robot Factory

Throughout this guide, think of:

  • Containers = Tiny robot workers in your factory
  • Logs = The robots’ diaries (what they did)
  • Metrics = Their health reports (how they feel)
  • Alerts = Emergency bells when something’s wrong

📓 Container Logging

What Is It?

Every container writes a diary. Each time something happens, it writes it down. This diary is called a log.

Simple Example

When your robot worker makes a toy:

[INFO] Started making teddy bear
[INFO] Added fluffy stuffing
[INFO] Sewed the button eyes
[SUCCESS] Teddy bear complete!

When something goes wrong:

[ERROR] Oops! Ran out of stuffing!

How to See Container Logs

kubectl logs my-robot-pod

This shows you what your robot wrote in its diary!

Real Life

  • Your web app crashes? Check the logs to see what went wrong
  • User says they got an error? Logs tell you exactly what happened
graph TD A["Container Runs"] --> B["Writes to stdout/stderr"] B --> C["Kubelet Captures"] C --> D["You Read with kubectl logs"]

🏭 Cluster-Level Logging

The Problem

What if you have 1000 robots? You can’t read 1000 diaries one by one!

The Solution

Build a central library where ALL diaries are collected automatically.

Simple Example

Instead of going to each robot:

Robot 1: "Made 5 toys"
Robot 2: "Made 3 toys"
Robot 3: "ERROR: Out of paint!"

You go to ONE place and search:

Show me all robots that had errors today
→ Robot 3: "ERROR: Out of paint!"

Popular Tools

  • Fluentd - The librarian who collects all diaries
  • Elasticsearch - The giant bookshelf that stores them
  • Kibana - The search tool to find entries
graph TD A["Robot 1 Logs"] --> D["Fluentd"] B["Robot 2 Logs"] --> D C["Robot 3 Logs"] --> D D --> E["Elasticsearch"] E --> F["Kibana Dashboard"]

🤝 Sidecar Logging Pattern

What Is a Sidecar?

Imagine your robot worker has a tiny helper sitting right next to it. This helper’s ONLY job is to take the robot’s diary and send it to the library.

Why Use a Sidecar?

  • Your robot doesn’t need to know about the library
  • The robot just writes; the helper does the rest
  • If you change libraries, only update the helper!

Simple Example

Pod:
  - main-robot:      # Does the real work
      writes: logs to /var/log/app.log
  - log-helper:      # The sidecar
      reads: /var/log/app.log
      sends: to central logging

Visual

graph LR A["Main Container"] -->|Writes logs| B["Shared Volume"] B -->|Reads logs| C["Sidecar Container"] C -->|Ships to| D["Central Logging"]

Real Life

Your app writes logs to a file. A Fluentd sidecar container reads that file and sends logs to Elasticsearch. Your app never needs to know about Elasticsearch!


📊 Metrics Overview

Logs vs Metrics

  • Logs = Story of what happened (words)
  • Metrics = Numbers that measure things

Simple Example

Log: “Made a teddy bear at 2:30 PM” Metric: toys_made = 47

Why Metrics?

Numbers are easy to:

  • Add up
  • Make graphs
  • Set alerts

Common Metrics

What Metric
CPU usage cpu_percent = 75%
Memory memory_mb = 512
Requests requests_per_second = 100
Errors error_count = 3
graph TD A["Container"] -->|Every 15 sec| B["Collect Metrics"] B --> C["CPU: 75%"] B --> D["Memory: 512MB"] B --> E["Requests: 100/sec"]

📈 Metrics Server

What Is It?

The official reporter that Kubernetes provides. It checks on every robot and writes down their health numbers.

What It Tracks

  • CPU usage (how hard is the brain working?)
  • Memory usage (how full is the memory?)

Simple Example

kubectl top pods

Output:

NAME          CPU    MEMORY
robot-1       100m   256Mi
robot-2       50m    128Mi
robot-3       200m   512Mi

Now you know robot-3 is working the hardest!

Why You Need It

  • Auto-scaling (add more robots when busy)
  • Debugging (find the slow robot)
  • Capacity planning (do we need more factory space?)
graph TD A["Metrics Server"] --> B["Collects from all Pods"] B --> C["kubectl top"] B --> D["Horizontal Pod Autoscaler"]

🔄 Metrics Pipeline

What Is It?

The assembly line that moves metrics from your containers to your dashboards.

The Journey

  1. Generate - Container creates a metric
  2. Collect - Something grabs it
  3. Store - Save it somewhere
  4. Query - Ask questions about it
  5. Visualize - Show pretty graphs

Simple Example

Container makes metric: requests_total = 1000
       ↓
Prometheus scrapes it
       ↓
Stored in time-series database
       ↓
Grafana asks: "Show me requests over time"
       ↓
Beautiful graph appears! 📈
graph TD A["Your App"] -->|Expose /metrics| B["Prometheus"] B -->|Store| C["Time Series DB"] C -->|Query| D["Grafana"] D -->|Display| E["Dashboard"]

🔥 Prometheus Integration

What Is Prometheus?

The super detective that visits every robot, asks “How are you?”, and writes down all the answers.

How It Works

  1. Your app exposes metrics at /metrics
  2. Prometheus visits and reads them (called “scraping”)
  3. Prometheus stores the data
  4. You query it later

Simple Example

Your app’s /metrics endpoint:

# HELP toys_made Total toys made
toys_made 47

# HELP errors_total Total errors
errors_total 3

Prometheus config:

scrape_configs:
  - job_name: 'toy-factory'
    static_configs:
      - targets: ['robot-1:8080']

Query Examples

# How many toys made?
toys_made

# Error rate last 5 minutes?
rate(errors_total[5m])
graph TD A["App /metrics"] -->|Scrape every 15s| B["Prometheus"] B --> C["Store Time Series"] C --> D["PromQL Queries"] D --> E["Grafana Dashboards"] D --> F["Alertmanager"]

🚨 Alerting Basics

What Is Alerting?

The alarm system that wakes you up when something’s wrong.

How It Works

  1. You set a rule: “If errors > 10, alert me!”
  2. Prometheus checks this rule constantly
  3. When the rule is true, it sends an alert
  4. You get a message (Slack, email, PagerDuty)

Simple Example

Alert rule:

groups:
  - name: factory-alerts
    rules:
      - alert: TooManyErrors
        expr: errors_total > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Too many errors!"

Translation: “If errors stay above 10 for 5 minutes, ring the alarm!”

Alert Lifecycle

graph TD A["Prometheus"] -->|Checks rule| B{errors > 10?} B -->|No| A B -->|Yes for 5m| C["Alert Fires!"] C --> D["Alertmanager"] D --> E["Slack Message"] D --> F["Email"] D --> G["PagerDuty"]

Best Practices

  • Don’t alert on everything - Only important stuff
  • Use “for” duration - Avoid false alarms
  • Include helpful info - Tell people what to do

🎯 Quick Summary

Topic What It Does One-Liner
Container Logging Records what happens Robot’s diary
Cluster Logging Collects all logs Central library
Sidecar Pattern Helper ships logs Robot’s assistant
Metrics Numbers about health Health report
Metrics Server Built-in K8s metrics Official reporter
Metrics Pipeline Flow from app to graph Assembly line
Prometheus Scrapes & stores metrics Super detective
Alerting Notifies on problems Alarm system

🚀 You Did It!

Now you understand how to see inside your Kubernetes containers:

  • Logs tell you the story
  • Metrics give you the numbers
  • Alerts wake you up when needed

You have the magic window into your robot factory. Go forth and observe! 🔍✨

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.