Active Inference, The Learn Arc — Part 26: Session §4.4 — The MDP World, Where the A/B/C/D Stack Meets Reality

Series: The Learn Arc — 50 posts teaching Active Inference through a live BEAM-native workbench. ← Part 25: Session 4.3. This is Part 26.

The session

Chapter 4, §4. Session title: MDP world. Route: /learn/session/4/s4_mdp_world.

Sessions 4.1–4.3 introduced the three lists, the A matrix, and the concrete EFE computation. Session 4.4 wires them all together for the first time: a fully-specified MDP world where you can watch every matrix do its job against a Jido agent running the full Perceive → Plan → Act loop.

The world

The canonical teaching MDP in the Workbench is tiny_open_goal — a 3×3 maze with:

States: 9 grid cells {(0,0), (0,1), ..., (2,2)}.
Observations: a 6-channel Markov blanket (wall-north/south/east/west + at-goal + position-hint) with moderate noise on A.
Actions: {north, south, east, west} — four cardinal moves.
Preferences: concentrated mass on the goal cell.
Dynamics: B is mostly deterministic (walls block, open cells move), with a small chance of slipping.

All four matrices. One episode. Run button. The next 60 seconds of watching that run is where every piece you've been learning lands.

What you actually see

The LiveView surface on /world with Tiny Open Goal:

Maze grid — 3×3 cells, the agent's position marked @, the goal marked ★.
Belief heatmap — 3×3 cells shaded by Q(s) marginal. Bright = likely current state.
Policy posterior table — top-5 policies with F, G, risk, ambiguity columns.
Step history — a strip of action labels scrolling right.
Controls — Step, Run, Pause, Reset.

Five panels. Four matrices. One agent. Press Step ten times and you've seen the whole chapter fold into motion.

The agent's internal state

Every time you press Step, the agent's state vector updates in place. It holds:

%{
  agent_id: "agent-tiny-...",
  Q_s: [[0.02, 0.03, ...], ...],     # belief over 9 states, per tick
  policy_posterior: [...],            # Q(π) = softmax(-γG - F)
  best_policy: [:east, :east, :north, :north], # argmax first-step
  best_f: -0.41, best_g: 0.18,       # policy-posterior stats
  last_action: :east,
  bundle: %{A: ..., B: ..., C: ..., D: ...},  # the four matrices
  ...
}

The bundle field is where A, B, C, D live. They're not recomputed each tick (unless the agent is Dirichlet-learning them) — they're the fixed structure the agent uses to perceive and plan.

Why MDPs get a dedicated session

MDPs are a degenerate case of POMDPs where observation = state (A is identity). Chapter 4 covers them anyway because:

1. MDPs are where a lot of RL intuition lives. If you've done gridworld RL, you've lived in MDP world. Session 4.4 bridges the vocabularies: in Active Inference, an MDP is just a POMDP with identity A, and every RL technique you know maps onto a special case of the framework.

2. Ambiguity drops out. When A is identity, H[A_col(s)] = 0 for every s. The ambiguity term in G vanishes; only risk remains. The agent becomes purely pragmatic. This is useful pedagogically — you can isolate the risk term and study its behavior without ambiguity muddying the picture.

3. Debugging is easier. When the agent's belief doesn't match reality in a POMDP, you don't know if it's A's fault, B's fault, or a sensor glitch. In an MDP, observation = state, so if belief is wrong, it's B or the prior.

The worked example

The session's worked example opens tiny_open_goal and walks tick-by-tick:

Tick 0: agent starts at (0,0) with uniform prior; belief heatmap is flat.
Tick 1: observes walls-north + walls-west; belief sharpens toward (0,0).
Tick 2: picks east (G minimized by approach-plus-discover); arrives at (1,0); belief jumps.
Tick 3: observes new wall pattern; belief at (1,0).
...continues through to the goal.

Every tick the book predicts what the agent should do next. Every tick the Workbench shows you what it actually did. The match is the proof Chapters 1–4 earned.

The Glass audit

Open /glass/agent/<id> while /world is running. You'll see one signal per equation-fire per tick:

12:34:56.789  perceive    eq_4_13_state_belief_update  Q_s updated
12:34:56.792  plan_G      eq_4_14_policy_posterior    G computed per policy
12:34:56.793  plan_F      eq_2_6_vfe                  F computed per policy
12:34:56.794  softmax     eq_4_14_policy_posterior    policy_posterior produced
12:34:56.795  act         eq_4_14_policy_posterior    action emitted

That's Chapter 4 running. Click any equation ID → land on the equation page → read the math → come back to the signal and inspect its payload. The whole chapter in one audit trail.

The concepts this session surfaces

MDP — Markov Decision Process; A is identity.
POMDP degenerate case — MDP is a POMDP with deterministic sensor.
Bundle — the A/B/C/D struct carried by every agent.
Agent state — the GenServer's live reconstruction of Q and its cache.

The quiz

Q: In the tiny-open-goal MDP, why does the ambiguity term in G stay near zero?

☐ The world is too small for ambiguity to matter.

☐ The A matrix is near-identity, so H[A_col(s)] ≈ 0 for every s. ✓

☐ The softmax temperature is too high.

☐ The Dirichlet prior is too strong.

Why: In an MDP the observation deterministically identifies the state, so the per-column entropy of A is near zero. With H≈0, the ambiguity term in G drops out; only risk drives policy selection. That's exactly what MDPs are: fully-observed states.

Run it yourself

/learn/session/4/s4_mdp_world — session page.
/world — Tiny Open Goal, press Step 10 times.
/cookbook/pomdp-tiny-corridor — a proper POMDP for contrast (non-identity A).
/glass — per-tick signal audit.

The mental move

After Session 4.4, Active Inference stops being a theoretical commitment and starts being an engineering style. You pick three lists, fill four matrices, compute two equations per tick, and softmax the result. The world shows you whether your choices were right.

Part 27: Session §4.5 — Practice. Chapter 4's closing session is a full-session workshop: build your own generative model from scratch, save it as a spec, instantiate a Studio-tracked agent, attach it to a world, run and inspect. The whole chapter as a single extended exercise.

⭐ Repo: github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench · MIT license

📖 Active Inference, Parr, Pezzulo, Friston — MIT Press 2022, CC BY-NC-ND: mitpress.mit.edu/9780262045353/active-inference

← Part 25: Session 4.3 · Part 26: Session 4.4 (this post) · Part 27: Session 4.5 → coming soon

Active Inference, The Learn Arc — Part 26: Session §4.4 — The MDP World, Where the A/B/C/D Stack Meets Reality

The session

The world

What you actually see

The agent's internal state

Why MDPs get a dedicated session

The worked example

The Glass audit

The concepts this session surfaces

The quiz

Run it yourself

The mental move

Next

Tags

Author

Stats

Published

You Might Also Like

Active Inference — The Learn Arc, Part 50: Series capstone

Active Inference, The Learn Arc — Part 8: Chapter 7 — POMDPs, Sophisticated Planning, and Dirichlet Learning

Active Inference, The Learn Arc — Part 25: Session §4.3 — Expected Free Energy, Introduced Concretely

Active Inference, The Learn Arc — Part 2: Chapter 1 — Perception, Action, Learning as One Loop

Active Inference, The Learn Arc — Part 23: Session §4.1 — States, Observations, Actions — The Three Lists

Active Inference, The Learn Arc — Part 13: Session §1.2 — Perception and Action, One Loop Up Close