Series: The Learn Arc โ 50 posts teaching Active Inference through a live BEAM-native workbench. โ Part 19: Session 3.1. This is Part 20.
The session
Chapter 3, ยง2. Session title: Epistemic vs pragmatic value. Route: /learn/session/3/s2_epistemic_pragmatic.
Session 3.1 wrote the equation. Session 3.2 walks both columns โ one at a time โ and shows you the crossover point where a good Active Inference agent transitions from exploring to exploiting.
The two columns, one at a time
Risk (pragmatic): KL[ Q(o|ฯ) โ C ]
What this quantity is saying, word for word: "under this policy, the observations I'll likely see differ from the observations I'd prefer to see, by this many nats." Small risk = plan lands near preferences. Large risk = plan lands far from preferences.
Risk is zero when the expected observation distribution under the policy exactly matches the preference distribution C. That happens only when the world is fully characterized and you picked a policy that exactly takes you where you want to go.
Ambiguity (epistemic): E_Q[ H[ P(o|s) ] ]
Word for word: "averaged over what I currently believe about hidden states, how uncertain is my own sensor model about what I'd see?" Small ambiguity = "I know what my sensors would tell me under any of these states." Large ambiguity = "my sensors are noisy or under-specified."
Ambiguity is zero when the sensor model P(o|s) is deterministic for every plausible state โ i.e., when seeing the observation tells you unambiguously which state produced it. In a well-lit room with precise sensors, ambiguity is small. In fog, ambiguity is large.
The crossover
Here's where it gets interesting: the two terms can trade off, but not symmetrically.
- Early in a run (lots of uncertainty about
s), ambiguity is high across most policies. Plans that reduce uncertainty โ pointing sensors, moving to vantage points, asking questions โ have the lowestG. The agent explores. - As observations accumulate,
Q(s)sharpens. Ambiguity drops across all policies. Risk starts to dominate. Plans that bring expected observations close toCwin. The agent exploits.
No tunable ฮต. No scheduled annealing. The transition is a consequence of the posterior sharpening, which is Chapter 2's machinery running forward.
The side-by-side recipe
/cookbook/epistemic-info-gain-vs-reward runs two agents in the same world. One has a sharp preference distribution (low-temperature C concentrated on one cell). The other has a diffuse preference (high-temperature C spread across many cells). Same world, same agent architecture, different C.
The sharp-preference agent beelines. The diffuse-preference agent wanders and gathers information. The divergence is predicted by Chapter 3 without any parameter tweak โ the shape of C changes which column of G dominates, which changes what the softmax selects.
An important nuance
The book is careful here: ambiguity is not the same as "exploration." Ambiguity is a specific, computable quantity โ the expected entropy of the sensor model averaged over Q(s). Exploration is a behavior that can arise from high ambiguity, but it's not a mechanism; it's a consequence.
That distinction matters in practice. When an RL engineer says "we add an exploration bonus," they're admitting their theory doesn't predict exploration, so they bolt it on. Chapter 3's whole move is to say: no, you don't need to bolt it on. The epistemic term is in the math already.
The mini-recipe corpus
Session 3.2 foregrounds three runnable demos, each isolating one side of the crossover:
-
/cookbook/epistemic-curiosity-driverโ flatC, so risk is always zero. The agent explores forever. -
/cookbook/epistemic-disambiguate-before-exploitโ sharpCbut initial ambiguity. Agent explores, then exploits. Classic crossover. -
/cookbook/epistemic-risk-vs-ambiguityโ worlds where the two terms disagree on the right action. WatchGresolve the tension.
The concepts this session surfaces
- Pragmatic value โ the negative of the risk term.
- Epistemic value โ the negative of the ambiguity term.
- Crossover โ when ambiguity drops below risk in dominance.
-
Sensor entropy โ
H[P(o|s)], the per-state uncertainty.
The quiz
Q: An Active Inference agent in a new environment initially explores and then exploits. Why?
- โ It has a schedule that switches modes at a fixed tick.
- โ Ambiguity dominates G early; risk dominates once beliefs sharpen. โ
- โ A meta-controller toggles between two policies.
- โ The softmax temperature anneals.
Why: There's no schedule, no meta-controller, no annealing. The crossover falls out of Q(s) sharpening over time, which reduces ambiguity across all policies, which lets risk dominate policy selection. One functional, one softmax โ the behavior emerges.
Run it yourself
-
/learn/session/3/s2_epistemic_pragmaticโ session page. -
/cookbook/epistemic-info-gain-vs-rewardโ two agents, two priors. -
/cookbook/epistemic-disambiguate-before-exploitโ crossover live. -
/cookbook/epistemic-curiosity-driverโ pure exploration. -
/cookbook/epistemic-uniform-prior-blind-spotโ when priors mask ambiguity.
The mental move
Active Inference's most impressive trick is that it derives the explore-exploit transition from math you already had. No new mechanism. No new scalar. Just the decomposition Session 3.2 spells out. That's why this chapter takes the rhetoric seriously and earns it.
Next
Part 21: Session ยง3.3 โ The softmax policy. Chapter 3's third session. The precision parameter on Q(ฯ) = softmax(โG/ฯ) is itself a meaningful biological quantity โ and the hinge on which Chapter 5's neuromodulator story swings. We unpack it.
โญ Repo: github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench ยท MIT license
๐ Active Inference, Parr, Pezzulo, Friston โ MIT Press 2022, CC BY-NC-ND: mitpress.mit.edu/9780262045353/active-inference
โ Part 19: Session 3.1 ยท Part 20: Session 3.2 (this post) ยท Part 21: Session 3.3 โ coming soon









