subtitles Streaming / Entertainment

Auto Subtitles — Workers AI Nova-3 + Whisper

Upload any video or audio file. Workers AI routes to the best STT model per language — Nova-3 for English, Whisper for Thai and other local languages — then runs a two-pass AI correction before exporting a timestamped WebVTT file.

The Problem

"Streaming platforms need subtitles in the local language their viewers actually speak. Manual transcription costs $1–3/min. AWS Transcribe costs $0.024/min and requires S3, IAM, and a batch pipeline. Workers AI: Whisper at $0.0005/min for Thai, Nova-3 at $0.0052/min for English — both serverless, zero infrastructure."

The Outcome

47×

cheaper vs AWS Transcribe Standard. Whisper large-v3-turbo runs at $0.00051/audio-minute on the Cloudflare edge ($0.024/min on AWS Transcribe Standard, US-East-1, first 250K min). Supports 90+ languages including Thai, Indonesian, Vietnamese, and English. No servers, no S3 buckets, no batch jobs.

Live demo below
🐌
AWS Transcribe
$0.024/min
batch pipeline · S3 + IAM required
Workers AI Whisper
$0.0005/min
48× cheaper · multilingual · no infra
Drop your video or audio
or click to browse
MP4MOVMP3M4AWAVWebMOGGAACFLACMKV
Best for: lectures, interviews, presentations, voice-overs — any clear speech content
Note: music with instrumentation will have gaps — Whisper is a speech model, not a music transcription model
Tip for long files: extract audio first (MP3/M4A ≈ 50–120 MB) to cut decode time
Tip: extract audio from long lecture recordings
# Extract audio only — 2hr lecture → ~50 MB M4A
ffmpeg -i lecture.mp4 -vn -acodec copy lecture.m4a

# Or convert to MP3 at 64 kbps (smaller, still great quality)
ffmpeg -i lecture.mp4 -vn -acodec libmp3lame -b:a 64k lecture.mp3
Processing pipeline
1
Upload
any audio/video format
2
Decode + resample
browser: 16 kHz mono WAV
3
Split 30s chunks
5s overlap, ~1 MB each
4
3 in parallel
Whisper large-v3-turbo · all languages
5
Correct + VTT
SEA-LION + GPT-OSS 120B + export
2-hour lecture (~600 chunks at 12s stride) — 3 chunks run in parallel
API Worker — POST /api/subtitles/transcribe
// Bind Workers AI as AI in wrangler.toml: [ai] binding = "AI"

// Helper: convert seconds to WebVTT timestamp "HH:MM:SS.mmm"
function toVttTime(s) {
  const h = Math.floor(s / 3600), m = Math.floor((s % 3600) / 60), sec = s % 60
  return String(h).padStart(2,'0') + ':' + String(m).padStart(2,'0') + ':' +
         sec.toFixed(3).padStart(6,'0')
}

// Helper: encode ArrayBuffer to base64 without Node.js Buffer
function bufferToBase64(ab) {
  return btoa(String.fromCharCode(...new Uint8Array(ab)))
}

export default {
  async fetch(request, env) {
    if (request.method !== 'POST') return fetch(request)

    const form     = await request.formData()
    const file     = form.get('file')
    const language = form.get('language') || 'auto'
    const buffer   = await file.arrayBuffer()

    // Model routing based on language:
    //   Thai/ASEAN  ->  Whisper large-v3-turbo  ($0.0005/min)
    //   English/EU  ->  Deepgram Nova-3          ($0.0052/min, lower WER)
    const ASEAN = ['th','id','vi','ms','km','lo','my','tl']
    const isAsean = ASEAN.includes(language)

    let transcript = ''
    let segments   = []

    if (isAsean) {
      // Whisper requires base64-encoded audio — use btoa(), not Node.js Buffer
      const base64 = bufferToBase64(buffer)
      const result = await env.AI.run('@cf/openai/whisper-large-v3-turbo', {
        audio: base64, language
      })
      transcript = result.text ?? ''
      segments   = (result.segments ?? []).map(s => ({
        start: s.start, end: s.end, text: s.text
      }))
    } else {
      // Nova-3 accepts a ReadableStream for audio body
      const stream = new ReadableStream({
        start(c) { c.enqueue(new Uint8Array(buffer)); c.close() }
      })
      const result = await env.AI.run('@cf/deepgram/nova-3', {
        audio: { body: stream, contentType: file.type },
        smart_format: true, utterances: true, detect_language: true
      })
      segments   = (result.utterances ?? []).map(u => ({
        start: u.start, end: u.end, text: u.transcript
      }))
      transcript = result.results?.channels?.[0]?.alternatives?.[0]?.transcript ?? ''
    }

    // Build WebVTT inline — no external helper needed
    const vtt = 'WEBVTT\n\n' + segments.map((s, i) =>
      (i + 1) + '\n' + toVttTime(s.start) + ' --> ' + toVttTime(s.end) + '\n' + s.text
    ).join('\n\n')

    // After STT:
    //   SEA-LION 27B  cleans per-chunk hallucinations (foreign script, garbage words)
    //   GPT-OSS 120B  corrects domain vocabulary across the full transcript

    return Response.json({ transcript, segments, vtt })
  }
}

Productionising this

What changes when you ship this for real

Pick the right Whisper variant

Use @cf/openai/whisper-large-v3-turbo ($0.00051/min) for short-form / interactive. Use @cf/openai/whisper ($0.00045/min) for batch where the small accuracy gain doesn't matter.

Chunk long audio

Whisper has a per-call audio length cap. For >30min files, split into 5–10min chunks via Workers, transcribe in parallel via Promise.all, stitch the VTT timestamps client-side.

Persist VTT to R2

Save generated .vtt next to the video in R2. Stream's Captions API references it. Re-running Whisper on every playback wastes neurons.

Language detection

Whisper auto-detects language. For multi-region content, log the detected language alongside the asset so future searches / analytics can filter.

Quality bar

Run a quick post-process: drop segments with confidence < 0.6, optionally re-run with Llama 3.1 to fix grammar in the cleaned VTT.

GDPR / accessibility

Captions count as accessibility data — disclose retention. For user-recorded video, encrypt the source audio in R2 and only retain VTT for as long as the video is active.