Skill Series (05): Skill Workflow Chaining — 4 Patterns, Real Data, Parallel Speedup 1.5x

The Limit of a Single Skill

A single Skill is designed to do one thing. A "technical writer" Skill can't search for competitor data, analyze it, and produce a formatted report — that requires multiple steps, multiple specializations, chained together.

Workflow chaining composes single Skills into pipelines that span those limits.

Four Chaining Patterns

Pattern 1: Sequential Chain

A → B → C
each Skill's output is the next Skill's input

Simplest structure, appropriate for linear tasks. Critical constraint: one step fails, the whole chain stops.

Pattern 2: Parallel Fan-out

         → B1 →
A → split → B2 → merge → C
         → B3 →

Multiple Skills run concurrently, results merged. Theoretical speedup = number of branches. Actual speedup depends on how long the merge step takes.

Pattern 3: Conditional Routing

A → Router → [technical] → tech-writer
             [marketing] → marketing-writer
             [default]   → general-writer

A router Skill outputs an enum value; the workflow branches based on the result. The router must return a specific enumerated type — not free text.

Pattern 4: Feedback Loop

A → Evaluator → [score ≥ 7] → output
                    ↓
               [score < 7] → feedback → A (retry, max 3)

Quality gate: if output doesn't meet threshold, rewrite with evaluator feedback. Always set a max retry count to prevent infinite loops.

Demo Design

All four patterns use real LLM calls:

Pattern	Implementation	What's measured
Sequential	LangGraph 3-node graph: keywords → outline → write	End-to-end latency
Parallel fan-out	`ThreadPoolExecutor` × 3 → merge	Fan-out time, total time, speedup ratio
Conditional routing	LLM classifies input type → routes to 3 writers	Routing accuracy, output style comparison
Feedback loop	Write → score (1-10) → rewrite with feedback, max 3 rounds	Iteration count, score per round

Run Results

Pattern 1: Sequential Chain

Topic:    Python async/await: from coroutines to production-ready patterns
Keywords: async programming, coroutines, await, production-ready patterns
Outline:  - Introduction to Async Programming in Python
          - Understanding Coroutines and the `async` Keyword
          - Implementing `await` ...
Article:  ### Introduction to Async Programming in Python
          Async programming in Python has revolutionized the way we handle I/O...
Time: 35.1s (3 sequential LLM calls)

Pattern 2: Parallel Fan-out

Company: Notion
Product:  Notion stands out with its comprehensive suite...
Market:   Notion's market positioning is as a versatile productivity platform...
Tech:     Notion's technology stack is notable for its robust collaboration...
Merged:   Notion's competitive edge lies in its versatile productivity suite...

Fan-out time: 12.4s  |  Total (incl. merge): 24.5s
Sequential equiv: ~37.2s  |  Speedup: ~1.5x

Pattern 3: Conditional Routing

Input:  "Explain how Kubernetes pod scheduling works with a code example"
Route:  technical  (18.9s)
Output: Kubernetes pod scheduling is the process of assigning a pod to a node...

Input:  "Write a compelling product description for our new AI writing tool"
Route:  marketing  (10.7s)
Output: Unleash Your Words with Unmatched Precision! Transform your writing...

Input:  "What is machine learning and why does it matter?"
Route:  technical  (23.5s)
Output: Machine learning (ML) is a subset of artificial intelligence...

Pattern 4: Feedback Loop

Topic: Write a technical article about Redis Cluster sharding strategy

  Iteration 1: score=8/10  ✓ PASS
               feedback: The article provides a clear explanation of Redis Cluster
               sharding, but could benefit from...

Final score: 8/10  |  Iterations: 1/3  |  Time: 44.8s

Three Findings

Finding 1: Parallel Speedup 1.5x, Not the Theoretical 3x

Three analyzers ran concurrently. Fan-out phase: 12.4s. Sequential equivalent: 37.2s (12.4 × 3). Fan-out phase alone: ~3x speedup. Total pipeline (fan-out + merge): 24.5s. Total speedup: 37.2 / 24.5 ≈ 1.5x.

Amdahl's Law at work:

Total speedup = 1 / (serial fraction + parallel fraction / N)

This run:
  Parallel portion (fan-out):  12.4s / 24.5s ≈ 51%
  Serial portion (merge):      12.1s / 24.5s ≈ 49%

When 49% of the pipeline is serial, max speedup ≈ 2x,
regardless of how many concurrent branches you add.

The optimization lever is the merge step, not more concurrent branches. Cut the merge prompt's token load, and the serial bottleneck shrinks — raising the effective speedup ceiling.

Finding 2: The Third Routing Result Was "Technical," Not "General"

"What is machine learning and why does it matter?" routed to the technical writer, not general.

The classification is reasonable but audience-dependent. For an engineering team, ML is a technical topic. For a product manager, it's general. The classifier saw only the question text, not who was asking.

Production fix: include audience information in the classifier input:

classifier_input = f"Request: {request}\nTarget audience: {workflow_input.audience}"

Without it, the router makes an implicit assumption about who's asking. That assumption is invisible and wrong for at least one audience type.

Finding 3: Feedback Loop Passed First Iteration — That's the Point

Score 8/10 on iteration 1, no retries needed. Time: 44.8s.

The gate let the first attempt through because 8/10 is good enough. That's the design working correctly — quality gates filter genuinely poor outputs, not every output.

Threshold calibration:

5/10 is too low: almost nothing triggers; the gate becomes decorative
9/10 is too high: almost everything triggers; token cost doubles
7/10 is a useful starting point: blocks real quality gaps, allows a solid first draft through

Feedback quality determines retry effectiveness:

Effective: "Missing code examples for write-behind pattern; clarify TTL vs eviction policy"
Ineffective: "Make it better and more comprehensive"

The second feedback gives the writer nothing to act on. The loop runs, tokens are spent, and the output barely changes.

Error Handling

Chained workflows encounter four error types, each requiring a different response:

Transient (LLM timeout, rate limit)
  → Retry 3x with exponential backoff: 1s, 2s, 4s

Quality gate failure
  → Retry with feedback, max 3 rounds
  → After max retries: return best result + quality annotation

Fatal (permission denied, malformed input)
  → Abort immediately, surface clear error to user

Partial completion (one parallel branch failed)
  → Merge available results, annotate missing branch
  → Don't fail the whole pipeline for one missing piece

State Schema

In a chain, the upstream Skill's output is the downstream Skill's input. Inconsistent schema breaks the chain the first time output changes.

{
  "status": "success",
  "output": {
    "main_content": "...",
    "metadata": { "word_count": 2500, "confidence": 0.92 }
  },
  "trace_id": "skill-abc-123"
}

Context compression for long pipelines: when upstream output exceeds ~5000 tokens, downstream Skills rarely need the full content. Three options:

Insert a summarizer Skill to compress before passing downstream
Pass only the fields the downstream Skill needs (output.metadata, not output.main_content)
Store large intermediate artifacts externally; downstream retrieves on demand

Design Checklist

Sequential chain

[ ] Each step's output format is defined (downstream can parse it)
[ ] Critical steps have fallback logic (not just abort-on-failure)

Parallel fan-out

[ ] Merge Skill handles partial branch failure (annotate, don't fail)
[ ] Measure merge step latency before deciding how many branches to add

Conditional routing

[ ] Router outputs an enumerated type, not free text
[ ] Default branch covers unclassified inputs
[ ] Routing input includes audience or context metadata

Feedback loop

[ ] Max retry count is set (3 is a reasonable default)
[ ] Feedback targets specific issues, not general "be better"
[ ] After max retries: return best result + annotation, not an error

Summary

Parallel speedup was 1.5x, not 3x: fan-out phase ran 3x faster, but the merge step took as long as the fan-out — Amdahl's Law caps the total at 1.5x; the fix is a lighter merge step
Conditional routing needs audience context: topic alone is ambiguous; the same question routes differently for technical vs general audiences
Feedback loop efficiency depends on threshold design: first-attempt pass shows the threshold is calibrated correctly; the gate's job is filtering real quality gaps, not forcing retries

References

LangGraph StateGraph documentation
Full demo code: skill-05-workflow

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

Skill Series (05): Skill Workflow Chaining — 4 Patterns, Real Data, Parallel Speedup 1.5x

The Limit of a Single Skill

Four Chaining Patterns

Pattern 1: Sequential Chain

Pattern 2: Parallel Fan-out

Pattern 3: Conditional Routing

Pattern 4: Feedback Loop

Demo Design

Run Results

Pattern 1: Sequential Chain

Pattern 2: Parallel Fan-out

Pattern 3: Conditional Routing

Pattern 4: Feedback Loop

Three Findings

Finding 1: Parallel Speedup 1.5x, Not the Theoretical 3x

Finding 2: The Third Routing Result Was "Technical," Not "General"

Finding 3: Feedback Loop Passed First Iteration — That's the Point

Error Handling

State Schema

Design Checklist

Summary

References

Tags

Author

Stats

Published

You Might Also Like

The Principle of Least AI

. .. . ... . .... . .... . ... .

I'm not a developer, but I built a calendar app to fix my most annoying work task

Too cheap to be good? Think again.

The 80/20 Rule of AI Code — Why the Last 20% Takes 80% of Your Time

Internmaxxing vs. Old Man Shakes Fist at Cloud