What I learned building a scripted two-host video pipeline with edge-tts and ffmpeg

One limitation I kept hitting with the YouTube Shorts pipeline I wrote about earlier was format. Shorts lock you into vertical, 60-second content. For longer, explanatory content — the kind that covers "NocoDB vs Baserow vs Teable, which one for your use case" in enough depth to actually help someone — I needed 16:9, 8–12 minutes, and a format that doesn't feel like a robot reading a list aloud.

A two-host dialogue is the simplest format that creates that feeling. Two voices alternate, disagree on minor points, and hand off naturally. That rhythm holds attention across 10 minutes in a way that a single-voice narration doesn't. I built it from scratch rather than buying a video tool because the CI-first constraint ruled out most options: the render has to produce a deterministic MP4 from a commit, unattended, in a GitHub Actions job.

The spec JSON format

Everything flows from a JSON file with video metadata and a segments array.

{
  "title": "NocoDB vs Baserow vs Teable: which Airtable alternative to pick",
  "description": "We compare three open-source Airtable alternatives...",
  "tags": ["opensource", "selfhosted", "nocode"],
  "privacy": "public",
  "segments": [
    {
      "speaker": "A",
      "text": "Let's talk about NocoDB, Baserow, and Teable. All three replace Airtable. None of them replace it identically.",
      "slide": { "kind": "title", "title": "NocoDB vs Baserow vs Teable", "subtitle": "Which one for your use case?" }
    },
    {
      "speaker": "B",
      "text": "The differences matter more than the star counts. Teable doesn't own your schema — which is either a huge plus or a confusing minus depending on your setup."
    }
  ]
}

A few design choices in that structure:

Slides are optional per segment. If a segment omits the slide key, the renderer carries the previous slide forward. This lets the script be dense with dialogue without requiring a new visual for every sentence. Transitions happen when the topic changes, not when the sentence changes.

Speaker is either A or B. The pipeline maps A → en-US-AndrewNeural and B → en-US-AvaNeural. Both are edge-tts neural voices: Microsoft's Text to Speech API accessed through the open-source edge-tts Python package, which calls the same endpoint Edge browser uses for Immersive Reader — no API key, no cost.

Text segments are short conversational sentences. Long paragraphs sound unnatural when synthesized. Keeping each segment under 25 words produces more natural speech rhythm, including appropriate pauses at sentence boundaries. I generate the spec with Claude — the same shared Haiku client I use for ETL content on the three directory sites — which handles the awkward task of splitting continuous prose into A/B dialogue naturally.

Rendering slides with Pillow

The slide renderer generates 1920x1080 PNG files from JSON slide specs using Pillow. No browser. No Playwright. No headless Chrome. Playwright-based screenshot rendering works for OG images but adds 30–60 seconds to CI startup for browser binary download and launch. For a render job that produces 40+ slide images, that overhead isn't justified.

The slide spec supports five kind values:

kind	Layout	Use case
`title`	Centered title + subtitle	Section transitions, intro, outro
`bullets`	Heading + unordered list	Key points, comparison criteria
`table`	Heading + column/row grid	Side-by-side feature comparison
`tool`	Name, star count, license, take	Individual tool review card
`outro`	CTA + URL	End card with subscribe prompt

Every slide draws the same brand chrome: a 10px accent bar at the top, the channel wordmark, a footer with site URLs, and an optional page number. The chrome function runs before any slide-specific content, so brand consistency is automatic and can't be accidentally omitted from a new slide kind.

Font resolution is a CI-vs-local problem I didn't anticipate. On Ubuntu (GitHub Actions), DejaVu Sans is at /usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf. On macOS local development, it's /System/Library/Fonts/Supplemental/Arial Bold.ttf. The resolver iterates a candidate list for each weight and raises a clear error if none are found. Running sudo apt-get install -y fonts-dejavu-core in the CI setup step covers the Ubuntu case cleanly.

The _wrap() helper deserves a note. Pillow's draw.textlength() method measures rendered pixel width, not character count — the word-wrap algorithm uses that to split text at word boundaries while staying within the slide margin. This matters for proportional fonts: "WWW" is much wider than "iii" at the same font size, and a naive character count would produce lines that overflow the slide on the right.

Synthesizing dialogue with edge-tts

Each text segment gets synthesized independently:

def tts(text: str, speaker: str, out_mp3: str):
    for voice in (VOICE.get(speaker), VOICE_FALLBACK.get(speaker)):
        r = subprocess.run(
            ["edge-tts", "--voice", voice, "--text", text, "--write-media", out_mp3],
            capture_output=True, text=True,
        )
        if r.returncode == 0 and os.path.isfile(out_mp3) and os.path.getsize(out_mp3) > 1000:
            return voice
    sys.exit(f"ERROR: edge-tts failed for speaker {speaker}: {text[:50]}")

Two things surprised me here.

The 1000-byte file size check. edge-tts can return exit code 0 but write an almost-empty MP3 (a valid MP3 header, no actual audio) when the requested voice isn't available on the Microsoft endpoint that day. Without the size check, ffmpeg would silently produce a clip with a duration of 0.001 seconds. The check catches this before ffmpeg runs.

The voice fallback. en-US-AndrewNeural and en-US-AvaNeural are the primary pair; en-US-GuyNeural and en-US-AriaNeural are the fallbacks. Microsoft occasionally rotates neural voice availability between edge-tts package versions. Both voices in a pair have similar enough tone and cadence that the audio doesn't sound jarring if one call uses the fallback.

The pipeline calls edge-tts as a subprocess rather than using its Python API directly because the CLI version handles voice negotiation internally. Calling the Python API exposed an async event loop conflict with the Pillow rendering code that wasn't worth tracing.

Assembling clips into a video with ffmpeg

Each segment produces two artifacts: a PNG slide image and an MP3 audio file. These combine into a single .ts transport stream clip:

run(["ffmpeg", "-y",
    "-loop", "1", "-i", slide_img,
    "-i", audio_mp3,
    "-c:v", "libx264", "-tune", "stillimage",
    "-c:a", "aac", "-b:a", "128k",
    "-pix_fmt", "yuv420p",
    "-shortest",
    clip_ts])

-loop 1 makes ffmpeg hold the still image for the duration of the audio. -tune stillimage uses x264's stillimage preset, which allocates no bits to inter-frame motion — correct, since the image doesn't change within a clip. -shortest ends the clip when the audio ends, since the image stream has an implicit infinite duration.

After all clips render, they concatenate with ffmpeg -f concat:

with open(clips_list, "w") as f:
    for p in clip_paths:
        f.write(f"file '{os.path.abspath(p)}'\n")

run(["ffmpeg", "-y", "-f", "concat", "-safe", "0",
    "-i", clips_list, "-c", "copy", out_mp4])

-c copy means no re-encode at the concat step — it stitches the transport streams together without touching the encoded data, which is fast even for a 60-clip video. I write clip paths to a text file rather than passing them as arguments for the same reason the thumbnail pipeline uses a manifest: 80 paths would exceed shell argument length limits.

Total render time for a 10-minute, 80-segment video is 4–5 minutes in CI: roughly 3 minutes of TTS network round trips to the Microsoft endpoint, and 1 minute of Pillow rendering plus ffmpeg work.

The CI workflow

The GitHub Actions workflow triggers on any push to main that modifies a file in content/yt-longform-queue/*.json. The path filter is the key: it means none of the other pipelines in the shared CI — article publishing, ETL cron jobs, Bluesky queue drains — accidentally trigger a 5-minute video render.

After a successful render and upload, the workflow moves the JSON spec from the queue directory into an uploaded/ subdirectory and commits that move back to main. The commit message includes [skip yt-longform], which the workflow's if: condition checks to avoid re-triggering on its own commit. Without this, the workflow would fire on its own commit, find no new queue file, and exit cleanly — but you'd burn a job startup every time.

This pattern — queue directory → render → move to uploaded → commit with skip token — is the same one I use for the Bluesky image-upload pipeline. Once I had it working reliably in one place, reusing it for the video queue was about 20 minutes of YAML editing.

The upload step uses the YouTube Data API v3. It reads credentials from a GitHub Actions secret (YT_SERVICE_ACCOUNT_JSON), a JSON service account key I covered in more detail in the Google Service Account article. The long-form pipeline uses google-api-python-client rather than a raw JWT because the resumable upload API for large MP4 files is complex enough that the library earns its weight.

What I'd do differently

TTS synthesis is the bottleneck, and it's sequential. Each segment round-trips to a Microsoft endpoint; 80 segments at roughly 2 seconds each is about 160 seconds of pure network wait. Parallelizing with asyncio.gather() and a semaphore to cap concurrent requests would cut that to 20–30 seconds. I kept the sequential version to simplify debugging — when a segment fails, the error output is unambiguous — but I'll switch to async before scaling past 120 segments.

Slide hand-off between segments isn't visible in the spec. A segment that omits slide silently inherits the previous one. This is fine when writing sequentially but confusing when reordering. An explicit "slide": "carry" field would make the intent obvious without changing the render logic.

I should have added spec validation first. The first three specs I wrote had typos in slide.kind that the renderer handled by outputting a blank slide — silent failure. The JSON-LD audit pattern I use for structured data gives me a template: validate at schema level before any expensive render work. I'll add a JSON schema validator as a pre-flight step in the next iteration.

Script generation quality varies. Claude writes reasonable dialogue specs, but the A/B voice split sometimes drifts toward one speaker dominating a section. Adding an explicit constraint in the prompt — "alternate speakers every 1–3 lines, max 4 consecutive lines per speaker" — would help. I don't know how much of the quality problem is the prompt vs. Haiku 4.5 vs. the single-shot generation approach. The pairwise evaluation setup I built for the AI tools site would be useful here too: compare specs generated by different prompts, pick the one where the dialogue feels more balanced.

FAQ

Why edge-tts instead of OpenAI TTS or ElevenLabs?

Cost and CI ergonomics. OpenAI TTS costs $0.015 per 1,000 characters. A 10-minute video script is roughly 5,000 characters per voice — $0.15 per video at scale. edge-tts calls the same endpoint as the Edge browser's Immersive Reader: no API key, no billing. ElevenLabs has better voice quality for long-form, but their free tier exhausts in one video. I'll revisit if the channel earns enough to justify it.

How long does a full render take in GitHub Actions?

4–6 minutes for an 80-segment, 10-minute video. TTS is the dominant cost, not ffmpeg. The ubuntu-latest runner has enough CPU for x264 encoding without a GPU; the -tune stillimage flag keeps encoding fast since there's no motion to encode.

Can I add slide transitions or animations?

Not with the current approach. The pipeline builds static PNGs held as still images; smooth transitions would require either rendering individual frames (extremely slow) or using ffmpeg xfade filter graphs between clip segments. The xfade approach is doable but adds complexity to the concat step. Abrupt cuts between slides are fine for an educational talking-heads format.

What happens if edge-tts fails midway through a 60-segment spec?

The script exits immediately with a clear error including the segment index and the failing text snippet. output.mp4 is not written — ffmpeg never runs. Re-runs are safe because the pipeline regenerates everything from scratch; there's no partial state to clean up.

How do I test a spec locally before pushing to CI?

python3 scripts/yt-longform/build_longform.py my-spec.json \
  --workdir /tmp/lf --outdir /tmp/lf/slides
open /tmp/lf/output.mp4

The only local requirements are ffmpeg and edge-tts (pip install edge-tts). The render is identical to CI — the pipeline doesn't use any GitHub-specific environment variables during render, only during the upload step.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

What I learned building a scripted two-host video pipeline with edge-tts and ffmpeg

The spec JSON format

Rendering slides with Pillow

Synthesizing dialogue with edge-tts

Assembling clips into a video with ffmpeg

The CI workflow

What I'd do differently

FAQ

Tags

Author

Stats

Published

You Might Also Like

Three post-deploy checks I run after every Cloudflare Pages build

Three post-deploy checks I run after every Cloudflare Pages build

Three post-deploy checks I run after every Cloudflare Pages build

Three post-deploy checks I run after every Cloudflare Pages build

Three post-deploy checks I run after every Cloudflare Pages build

Three Security Checks for Any AWS Pipeline