🛒 E-commerce / Retail

AI Voice Agent + Video Call

Text chat (GLM-4.7-Flash), voice mode (Nova-3 STT + Aura-2 TTS), and RealtimeKit WebRTC video call — all from a single Cloudflare Worker. Zero backend changes to your origin.

The Problem

""HÖMSTYLE customers can't get instant furniture advice — support email takes 3–5 days.""

The Outcome

Worker to deploy. Text chat → opt-in voice → RealtimeKit video call. All in one script.

Live demo below

Productionising this

What changes when you ship this for real

API token scope

RTK_API_TOKEN must be a Cloudflare API token with the Realtime permission only. Never reuse a global token. Rotate quarterly.

Per-meeting auth tokens

Participant tokens are JWTs with exp ~24h. Don't cache them — generate one per meeting per participant. The /api/rtk/meeting endpoint already does this.

Custom presets

Default group_call_participant / group_call_host presets are the demo's starting point. For production, create custom presets in the dashboard with the exact permissions your role needs (mute, kick, screen-share, etc.).

STT word error rate

Nova-3 has lower WER than Whisper for conversational English; Whisper-Turbo wins for long-form, multilingual. Pick per use case — both are billed at ~$0.0005/audio-minute.

Recording + retention

Stream the call audio to R2 via the RealtimeKit recording API. Encrypt at rest. Set R2 lifecycle to delete recordings after 90 days unless flagged.

Observability

The /api/rtk/meeting handler now surfaces cloudflareStatus + cloudflareError on failures. Pipe these into your incident dashboard so a 401 from a rotated token gets caught immediately.

In-progress: real-time streaming rework

A continuous WebSocket rebuild of this demo using @cf/deepgram/flux (purpose-built for voice agents, native turn detection) was built and deployed to a Durable Object (src/workers/VoiceAgentRoom.ts) but is NOT wired into this live demo yet — the upstream Flux connection has an unresolved issue where it cycles/reconnects roughly every 5 seconds regardless of audio activity or eot_timeout_ms tuning, preventing sustained transcription. The infrastructure (DO, sidecar Worker binding, Astro WS proxy route, client-side PCM streaming) is otherwise fully built and the WS handshake works end-to-end — this is a documented starting point for whoever picks up that debugging next, not abandoned work.