What is the difference between state and observation in RL?

State is everything about the world right now. Observation is what the agent can actually see, which may be incomplete.

What is the difference between value function and Q-function?

Value function asks 'how good is this place?' Q-function asks 'how good is taking this specific action in this place?'

RL Fundamentals | Machine Learning Guide

Q: What is Reinforcement Learning?

RL teaches computers to learn by trying things and seeing what happens. Agents get rewards for good actions and punishments for bad ones.

Q: What is a policy in reinforcement learning?

A policy is the agent's strategy that answers: when I see this situation, what should I do? The goal is finding the best policy.

Reinforcement Learning Fundamentals 🤖

The Story of Max the Robot Dog

Imagine you have a smart robot dog named Max. Max doesn’t know anything when you first turn him on. But here’s the magic—Max can learn from experience!

Every time Max does something good (like sitting when you say “sit”), you give him a treat. Every time he does something wrong (like chewing your shoes), you say “No!” Max remembers what works and what doesn’t. Over time, Max becomes the best-behaved robot dog ever!

This is exactly how Reinforcement Learning works.

What is Reinforcement Learning?

RL is teaching a computer to learn by trying things and seeing what happens.

graph TD
    A["🤖 Agent tries something"] --> B{What happened?}
    B -->|Good result| C["✅ Got a reward!"]
    B -->|Bad result| D["❌ Got punished"]
    C --> E["Do this more!"]
    D --> F["Avoid this next time"]
    E --> A
    F --> A

Real Life Examples:

A game AI learning to beat you at chess
A robot learning to walk without falling
YouTube learning what videos you like

RL Problem Formulation

Think of RL as a simple loop:

Look at what’s happening
Do something
Get feedback (reward or punishment)
Learn from what happened
Repeat!

Example: Teaching a toddler to walk

Toddler looks around (sees the room)
Toddler takes a step (action)
Either stays standing (reward!) or falls (oops!)
Learns what movements work
Tries again and again

Agent and Environment

The Agent 🤖

The Agent is the learner—the one making decisions.

Think of the agent as a player in a video game. The player doesn’t control the world, but they can choose what to do in it.

Examples of Agents:

A robot vacuum deciding where to clean
A chess program deciding which piece to move
Max the robot dog deciding whether to sit or run

The Environment 🌍

The Environment is everything the agent interacts with.

It’s the world around the agent—the game board, the room, the maze.

Examples of Environments:

The chess board and pieces
Your house (for the robot vacuum)
The park (for Max the robot dog)

graph LR
    A["🤖 Agent"] -->|takes action| B["🌍 Environment"]
    B -->|sends back| C["📊 State + Reward"]
    C --> A

State and Observation

What is State? 📍

State = Everything about the world right now.

Imagine taking a photo of a chess game. That photo shows:

Where every piece is
Whose turn it is
Has anyone castled?

That complete picture is the state.

What is Observation? 👀

Sometimes the agent can’t see everything. What it CAN see is called an observation.

Example: In a card game:

Full state = All cards (yours, opponent’s, deck)
Observation = Only your cards + cards on table

Simple Analogy:

State = The whole room with lights on
Observation = What you see with a flashlight

Actions

Actions are things the agent can do.

In any situation, the agent picks ONE action from its list of possible moves.

Examples:

Agent	Possible Actions
Chess AI	Move pawn, move knight, castle…
Robot vacuum	Go forward, turn left, turn right, dock
Max the dog	Sit, bark, fetch, lie down

Key Point: The agent chooses actions. The environment responds to those actions.

graph TD
    A["Agent sees state"] --> B["Thinks about options"]
    B --> C["Picks best action"]
    C --> D["Does the action"]
    D --> E["Environment changes"]

Rewards

Rewards tell the agent if it did well or poorly.

Think of rewards like points in a video game:

+10 points for eating a cherry 🍒
-5 points for hitting a wall 🧱
+100 points for winning! 🏆

The Goal

The agent’s only goal: Get as many reward points as possible over time.

Examples:

Situation	Reward
Robot vacuum cleans spot	+1
Robot vacuum hits furniture	-2
Game AI wins the game	+100
Game AI loses the game	-100

Important: Rewards can be:

Positive (good job, do more of this!)
Negative (bad move, avoid this!)
Zero (nothing special happened)

Policy

Policy = The agent’s strategy or game plan.

It answers: “When I see THIS situation, what should I DO?”

Simple Example:

Max the robot dog has a policy:

See “sit” command → Action: Sit down
See food bowl → Action: Walk to bowl
See stranger → Action: Bark

Written as Math (but simple!)

Policy is often written as π (the Greek letter “pi”).

π(state) = action

English: “My policy tells me what action to take in each state.”

graph LR
    A["📍 Current State"] --> B["🧠 Policy π"]
    B --> C["✋ Action to take"]

Goal: Find the BEST policy—the one that gets the most rewards!

Value Function

Value Function answers: “How good is it to be HERE?”

Think of it like this:

You’re in a maze looking for treasure
Some spots are close to treasure (HIGH value)
Some spots are dead ends (LOW value)

Why It Matters

The value function helps the agent make smart choices.

Example:

Two paths in a video game:

Path A: Leads to a room with coins (Value = HIGH)
Path B: Leads to a monster (Value = LOW)

The value function says: “Go to Path A!”

Written Simply

V(state) = Expected total future rewards from this state

English: “How many points can I probably get from here?”

Q-Function (Action-Value Function)

Q-Function answers: “How good is it to take THIS action in THIS situation?”

It’s like the value function, but more specific.

The Difference

Function	Question
Value (V)	“How good is this place?”
Q-Function	“How good is doing THIS action in this place?”

Example

Max is in the living room. He can:

Sit: Q = 10 (owner gives treat!)
Bark: Q = -5 (owner says “No!”)
Fetch ball: Q = 15 (owner plays with him!)

The Q-function tells Max: “Fetching the ball is the best choice here!”

Written Simply

Q(state, action) = Expected total rewards if I do this action here

graph TD
    A["📍 State: Living Room"] --> B["Sit → Q=10"]
    A --> C["Bark → Q=-5"]
    A --> D["Fetch → Q=15"]
    D --> E["🏆 Best Choice!"]

Putting It All Together

Let’s see how all pieces work together with Max the robot dog:

Concept	Example
Agent	Max the robot dog
Environment	Your house
State	Where Max is, what he sees
Observation	What Max can actually sense
Actions	Sit, bark, fetch, run
Rewards	+5 for good behavior, -3 for bad
Policy	Max’s strategy for getting treats
Value Function	“The kitchen is great!” (often gets food)
Q-Function	“Sitting when told = high reward”

The RL Learning Loop

Here’s how learning happens:

graph TD
    A["1. Agent sees State"] --> B["2. Policy picks Action"]
    B --> C["3. Agent does Action"]
    C --> D["4. Environment responds"]
    D --> E["5. Agent gets Reward"]
    E --> F["6. Agent updates Q/Value"]
    F --> G["7. Policy improves"]
    G --> A

Over many tries, the agent gets better and better!

Why This Matters 🌟

Reinforcement Learning is everywhere:

Self-driving cars learn to navigate roads
Game AIs like AlphaGo beat world champions
Robots learn to walk, grab, and dance
Recommendation systems learn what you like

You now understand the foundation! Every RL system uses:

An Agent making decisions
An Environment responding
States showing what’s happening
Actions the agent can take
Rewards guiding learning
Policies encoding strategies
Value/Q-Functions measuring goodness

You’ve just learned how machines learn to think! 🧠✨

RL Fundamentals

Unable to load concept

Coming Soon...

Reinforcement Learning Fundamentals 🤖

The Story of Max the Robot Dog

What is Reinforcement Learning?

RL Problem Formulation

Agent and Environment

The Agent 🤖

The Environment 🌍

State and Observation

What is State? 📍

What is Observation? 👀

Actions

Rewards

The Goal

Policy

Written as Math (but simple!)

Value Function

Why It Matters

Written Simply

Q-Function (Action-Value Function)

The Difference

Example

Written Simply

Putting It All Together

The RL Learning Loop

Why This Matters 🌟

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue