Corrective Action: Media Transcription Pipeline Silent Failure — 36 Episodes Stuck for 11 Weeks

Corrective Action: Media Transcription Pipeline Silent Failure — 36 Episodes Stuck for 11 Weeks

Date: March 30, 2026 Category: Verification Failure + Missing Monitoring Impact: 36 episodes (Jan 13 – Mar 29) accumulated in transcriptionStatus: "processing" without completing. No agent or system detected the stall. Transcription intelligence (summaries, key points, trap detection) missing from these episodes. Content multiplication pipeline operated on incomplete data. Resolution Time: ~30 minutes (batch reset). Root cause investigation ongoing.


Incident

What Happened

During the March 30 /media-prep briefing, Encore flagged that the Tech Stack Reimagined episode (Mar 29, 153 minutes with Nico) had been in transcriptionStatus: "processing" for over 24 hours. Investigation revealed this was not an isolated case — 36 episodes spanning January 13 through March 29, 2026 were stuck in the same state. The Inngest transcription workflow was receiving events (status was being set to "processing") but never completing. Meanwhile, 71 other episodes have completed transcriptions, indicating the pipeline works intermittently.

Contributing Factors

  1. Inngest credentials were only added to Vercel on March 25, 2026. The 33 episodes from Jan 13 – Mar 24 were set to "processing" by the Mux webhook handler but the Inngest workflow had no credentials to execute.
  2. No monitoring exists between webhook dispatch and workflow completion. The Mux webhook fires, sets status to "processing", sends an Inngest event, and returns 200. Nothing checks whether Inngest actually picks up and completes the workflow.
  3. /media-recap only checks yesterday's recordings when manually run. It does not scan for systemic pipeline stalls.
  4. /media-prep morning briefing did not surface transcription health. The dashboard shows individual episode transcript status but not pipeline-wide health.
  5. get-social-schedule.js contained a hardcoded "26 undistributed articles" string (lines 180, 223) that was presented as live data, masking the actual distribution state.

Timeline

Date Event
Jan 13 Earliest episode set to processing (AI Daily, Data)
Jan 13 – Mar 24 33 episodes accumulate in processing without Inngest credentials
Mar 25 INNGEST_EVENT_KEY and INNGEST_SIGNING_KEY added to Vercel
Mar 24-29 3 more episodes enter processing post-credential-add (different failure mode)
Mar 30 05:58 /media-prep run. Encore flags Tech Stack Reimagined stall
Mar 30 06:15 Investigation reveals 36 total stuck episodes
Mar 30 06:25 Batch reset: all 36 episodes set from processingpending

Root Cause

Two distinct failures:

Failure 1 (33 episodes, Jan 13 – Mar 24): The Mux webhook handler at apps/website/src/pages/api/webhooks/mux.ts successfully received video.asset.ready events and dispatched Inngest events via inngest.send(). However, the Inngest runtime on Vercel had no credentials (INNGEST_EVENT_KEY, INNGEST_SIGNING_KEY) until March 25. The inngest.send() call likely failed silently or the events were accepted but no function could authenticate to process them.

Failure 2 (3 episodes, Mar 24-29): After credentials were added, 3 more episodes still stalled. This indicates a secondary issue — possibly the Inngest function itself failing (missing GEMINI_API_KEY, Gemini API quota, audio download failure, or function timeout on long recordings like the 153-minute Tech Stack Reimagined).

Monitoring failure (all 36): No system checked whether the pipeline completed. The status was set to "processing" by the webhook handler and never updated because the downstream workflow never ran. The only detection mechanism — Encore checking for >24h stalls — is manual and session-dependent.

Category: Verification Failure + Missing Monitoring

This is a pipeline without a circuit breaker. The webhook handler returns success after dispatching the event, but the actual transcription happens asynchronously with no completion verification. When the async step fails, the pipeline enters a permanently stuck state that accumulates silently.

Additional finding: get-social-schedule.js contained hardcoded strings ("26 undistributed articles") on lines 180 and 223 that were presented as live data by the Broadcast agent. This is a data integrity issue — static strings in operational scripts that agents report as facts.


Fix Applied

Immediate Resolution

  1. Batch reset all 36 episodes from transcriptionStatus: "processing""pending" via Sanity client
  2. Removed hardcoded "26 undistributed articles" from get-social-schedule.js
  3. Updated /media-prep command to include yesterday's on-demand audience-ready links (previously missing per Mar 24 feedback)
  4. Updated content-queue.json — "The Leads Trap in Disguise" status corrected from draftedpublished

Code/Configuration Changes

File Change
scripts/hubspot/get-social-schedule.js:180 Removed hardcoded "26 undistributed articles" from empty-schedule message
scripts/hubspot/get-social-schedule.js:223 Removed hardcoded "26 undistributed articles" from gap-detected message
.claude/commands/media-prep.md Added audience-ready verification for yesterday's episodes in schedule post section
agents/content-multiplier/data/content-queue.json Updated "The Leads Trap in Disguise" status: draftedpublished
36 Sanity episodes transcriptionStatus: processingpending

Verification

node scripts/sanity/query.js --query 'count(*[_type == "episode" && transcriptionStatus == "processing"])'
→ 0

node scripts/sanity/query.js --query 'count(*[_type == "episode" && transcriptionStatus == "pending"])'
→ (increased by 36)

Prevention Measures

Rules Added

Layer File Rule
Critical Lessons MEMORY.md NEVER trust async pipeline completion without monitoring. The Mux → Inngest → Gemini transcription pipeline ran for 11 weeks with 36 failures because no agent checked completion. Every async dispatch needs a completion verifier.
Critical Lessons MEMORY.md NEVER hardcode counts or metrics in operational scripts. get-social-schedule.js had "26 undistributed articles" as a static string. Agents reported it as live data. Every number in an ops script must come from a query.
Operations memory/operations.md Transcription pipeline status: 36 episodes reset to pending (Mar 30). Inngest credentials added Mar 25. Pipeline needs completion monitoring — no agent currently watches for stalled transcriptions.

Detection Triggers

For /media-prep and /media-recap: Add a pipeline health check: query count(*[_type == "episode" && transcriptionStatus == "processing" && dateTime(airDate) < dateTime(now()) - 86400]) — any episode in "processing" for >24 hours is a pipeline failure, not a pending job.

Structural Gaps Identified

  1. No codename for Content Pipeline orchestrator — runs create-weekly-episodes.ts, run-pipeline.ts, check-shows.ts but has no identity. Unnamed agents get less coherent behavior.
  2. No codename for the Inngest transcription workflow — it's infrastructure, not an agent, but it needs monitoring ownership.
  3. No codename for Transcript Harvester or Transcript Backfill — both are manual batch tools with no agent identity.
  4. Encore only runs when Marquee spawns it — no autonomous stall detection.
  5. Content queue scan (lastScanDate) has no freshness monitoring — went 13 days stale without detection.

Lessons

Every async pipeline needs a completion verifier. "Fire and forget" is acceptable for the dispatch — but something must check the other end. In this system, the Mux webhook handler returns 200 and moves on. Nothing asks "did the Inngest workflow finish?" The gap between dispatch and completion is where 36 episodes disappeared for 11 weeks.

Static strings in operational scripts are a particularly insidious form of data drift. Agents trust tool output as ground truth. When a script says "26 undistributed articles," the agent reports it as fact, the briefing presents it as intelligence, and humans act on fabricated data. Every number in an ops script must be computed from a live query.


Related Incidents

  • Mar 8, 2026: Calendar invitees reported as attendees — a different form of "trusting upstream data without verification"
  • Feb 18, 2026: HubSpot Broadcast API published 15 posts immediately — fire-and-forget without verification
  • Mar 4, 2026: esbuild/tsx block comment failures — silent infrastructure failures that accumulate
  • Mar 23, 2026: Background workers shut down because GitHub Actions reports never reached local filesystem — another case of a pipeline producing "success" while delivering nothing

Recommended Next Steps

  1. Assign codenames to all unnamed media pipeline agents (Content Pipeline, Transcript Harvester, Transcript Backfill)
  2. Build a transcription completion monitor — owned by a named agent, runs during /media-prep and /media-recap
  3. Add pipeline health query to media commands — surface stalled transcriptions automatically
  4. Make get-social-schedule.js compute distribution gaps live from Sanity article count vs. HubSpot broadcast count
  5. Investigate Failure 2 — why 3 post-Mar-25 episodes still stalled (Gemini quota? Audio download? Function timeout?)