What is Kubernetes observability?

Observability is your ability to see inside Kubernetes containers. It uses logs (what happened), metrics (health numbers), and alerts (problem notifications).

What's the difference between logs and metrics in Kubernetes?

Logs are the story of what happened (text entries). Metrics are numbers that measure things like CPU, memory, and request counts.

How does Prometheus work in Kubernetes?

Prometheus scrapes metrics from your apps at /metrics endpoints, stores them in a time-series database, and lets you query and alert on them.

Kubernetes Observability | Logs, Metrics & Alerts

Container Lifecycle - Observability 🔍

The Story of the Magic Window

Imagine you have a giant toy factory with hundreds of tiny robot workers (containers). They’re all busy making toys, but you can’t see inside each little robot. How do you know if they’re happy? If they’re working hard? If something is broken?

You need a magic window that lets you peek inside! That’s exactly what observability is — your superpower to see what’s happening inside your Kubernetes containers.

🎭 The Analogy: Your Robot Factory

Throughout this guide, think of:

Containers = Tiny robot workers in your factory
Logs = The robots’ diaries (what they did)
Metrics = Their health reports (how they feel)
Alerts = Emergency bells when something’s wrong

📓 Container Logging

What Is It?

Every container writes a diary. Each time something happens, it writes it down. This diary is called a log.

Simple Example

When your robot worker makes a toy:

[INFO] Started making teddy bear
[INFO] Added fluffy stuffing
[INFO] Sewed the button eyes
[SUCCESS] Teddy bear complete!

When something goes wrong:

[ERROR] Oops! Ran out of stuffing!

How to See Container Logs

kubectl logs my-robot-pod

This shows you what your robot wrote in its diary!

Real Life

Your web app crashes? Check the logs to see what went wrong
User says they got an error? Logs tell you exactly what happened

graph TD
    A["Container Runs"] --> B["Writes to stdout/stderr"]
    B --> C["Kubelet Captures"]
    C --> D["You Read with kubectl logs"]

🏭 Cluster-Level Logging

The Problem

What if you have 1000 robots? You can’t read 1000 diaries one by one!

The Solution

Build a central library where ALL diaries are collected automatically.

Simple Example

Instead of going to each robot:

Robot 1: "Made 5 toys"
Robot 2: "Made 3 toys"
Robot 3: "ERROR: Out of paint!"

You go to ONE place and search:

Show me all robots that had errors today
→ Robot 3: "ERROR: Out of paint!"

Popular Tools

Fluentd - The librarian who collects all diaries
Elasticsearch - The giant bookshelf that stores them
Kibana - The search tool to find entries

graph TD
    A["Robot 1 Logs"] --> D["Fluentd"]
    B["Robot 2 Logs"] --> D
    C["Robot 3 Logs"] --> D
    D --> E["Elasticsearch"]
    E --> F["Kibana Dashboard"]

🤝 Sidecar Logging Pattern

What Is a Sidecar?

Imagine your robot worker has a tiny helper sitting right next to it. This helper’s ONLY job is to take the robot’s diary and send it to the library.

Why Use a Sidecar?

Your robot doesn’t need to know about the library
The robot just writes; the helper does the rest
If you change libraries, only update the helper!

Simple Example

Pod:
  - main-robot:      # Does the real work
      writes: logs to /var/log/app.log
  - log-helper:      # The sidecar
      reads: /var/log/app.log
      sends: to central logging

Visual

graph LR
    A["Main Container"] -->|Writes logs| B["Shared Volume"]
    B -->|Reads logs| C["Sidecar Container"]
    C -->|Ships to| D["Central Logging"]

Real Life

Your app writes logs to a file. A Fluentd sidecar container reads that file and sends logs to Elasticsearch. Your app never needs to know about Elasticsearch!

📊 Metrics Overview

Logs vs Metrics

Logs = Story of what happened (words)
Metrics = Numbers that measure things

Simple Example

Log: “Made a teddy bear at 2:30 PM” Metric: toys_made = 47

Why Metrics?

Numbers are easy to:

Add up
Make graphs
Set alerts

Common Metrics

What	Metric
CPU usage	`cpu_percent = 75%`
Memory	`memory_mb = 512`
Requests	`requests_per_second = 100`
Errors	`error_count = 3`

graph TD
    A["Container"] -->|Every 15 sec| B["Collect Metrics"]
    B --> C["CPU: 75%"]
    B --> D["Memory: 512MB"]
    B --> E["Requests: 100/sec"]

📈 Metrics Server

What Is It?

The official reporter that Kubernetes provides. It checks on every robot and writes down their health numbers.

What It Tracks

CPU usage (how hard is the brain working?)
Memory usage (how full is the memory?)

Simple Example

kubectl top pods

Output:

NAME          CPU    MEMORY
robot-1       100m   256Mi
robot-2       50m    128Mi
robot-3       200m   512Mi

Now you know robot-3 is working the hardest!

Why You Need It

Auto-scaling (add more robots when busy)
Debugging (find the slow robot)
Capacity planning (do we need more factory space?)

graph TD
    A["Metrics Server"] --> B["Collects from all Pods"]
    B --> C["kubectl top"]
    B --> D["Horizontal Pod Autoscaler"]

🔄 Metrics Pipeline

What Is It?

The assembly line that moves metrics from your containers to your dashboards.

The Journey

Generate - Container creates a metric
Collect - Something grabs it
Store - Save it somewhere
Query - Ask questions about it
Visualize - Show pretty graphs

Simple Example

Container makes metric: requests_total = 1000
       ↓
Prometheus scrapes it
       ↓
Stored in time-series database
       ↓
Grafana asks: "Show me requests over time"
       ↓
Beautiful graph appears! 📈

graph TD
    A["Your App"] -->|Expose /metrics| B["Prometheus"]
    B -->|Store| C["Time Series DB"]
    C -->|Query| D["Grafana"]
    D -->|Display| E["Dashboard"]

🔥 Prometheus Integration

What Is Prometheus?

The super detective that visits every robot, asks “How are you?”, and writes down all the answers.

How It Works

Your app exposes metrics at /metrics
Prometheus visits and reads them (called “scraping”)
Prometheus stores the data
You query it later

Simple Example

Your app’s /metrics endpoint:

# HELP toys_made Total toys made
toys_made 47

# HELP errors_total Total errors
errors_total 3

Prometheus config:

scrape_configs:
  - job_name: 'toy-factory'
    static_configs:
      - targets: ['robot-1:8080']

Query Examples

# How many toys made?
toys_made

# Error rate last 5 minutes?
rate(errors_total[5m])

graph TD
    A["App /metrics"] -->|Scrape every 15s| B["Prometheus"]
    B --> C["Store Time Series"]
    C --> D["PromQL Queries"]
    D --> E["Grafana Dashboards"]
    D --> F["Alertmanager"]

🚨 Alerting Basics

What Is Alerting?

The alarm system that wakes you up when something’s wrong.

How It Works

You set a rule: “If errors > 10, alert me!”
Prometheus checks this rule constantly
When the rule is true, it sends an alert
You get a message (Slack, email, PagerDuty)

Simple Example

Alert rule:

groups:
  - name: factory-alerts
    rules:
      - alert: TooManyErrors
        expr: errors_total > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Too many errors!"

Translation: “If errors stay above 10 for 5 minutes, ring the alarm!”

Alert Lifecycle

graph TD
    A["Prometheus"] -->|Checks rule| B{errors > 10?}
    B -->|No| A
    B -->|Yes for 5m| C["Alert Fires!"]
    C --> D["Alertmanager"]
    D --> E["Slack Message"]
    D --> F["Email"]
    D --> G["PagerDuty"]

Best Practices

Don’t alert on everything - Only important stuff
Use “for” duration - Avoid false alarms
Include helpful info - Tell people what to do

🎯 Quick Summary

Topic	What It Does	One-Liner
Container Logging	Records what happens	Robot’s diary
Cluster Logging	Collects all logs	Central library
Sidecar Pattern	Helper ships logs	Robot’s assistant
Metrics	Numbers about health	Health report
Metrics Server	Built-in K8s metrics	Official reporter
Metrics Pipeline	Flow from app to graph	Assembly line
Prometheus	Scrapes & stores metrics	Super detective
Alerting	Notifies on problems	Alarm system

🚀 You Did It!

Now you understand how to see inside your Kubernetes containers:

Logs tell you the story
Metrics give you the numbers
Alerts wake you up when needed

You have the magic window into your robot factory. Go forth and observe! 🔍✨

Observability

Unable to load concept

Coming Soon...

Container Lifecycle - Observability 🔍

The Story of the Magic Window

🎭 The Analogy: Your Robot Factory

📓 Container Logging

What Is It?

Simple Example

How to See Container Logs

Real Life

🏭 Cluster-Level Logging

The Problem

The Solution

Simple Example

Popular Tools

🤝 Sidecar Logging Pattern

What Is a Sidecar?

Why Use a Sidecar?

Simple Example

Visual

Real Life

📊 Metrics Overview

Logs vs Metrics

Simple Example

Why Metrics?

Common Metrics

📈 Metrics Server

What Is It?

What It Tracks

Simple Example

Why You Need It

🔄 Metrics Pipeline

What Is It?

The Journey

Simple Example

🔥 Prometheus Integration

What Is Prometheus?

How It Works

Simple Example

Query Examples

🚨 Alerting Basics

What Is Alerting?

How It Works

Simple Example

Alert Lifecycle

Best Practices

🎯 Quick Summary

🚀 You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue