A real-time STT pipeline streams audio chunks via WebSockets, processes them using AI models like Whisper or Sarvam, and returns live transcription over the same connection using a streaming architecture.
🎙️ Introduction
Real-time speech-to-text (STT) powers:
- Voice assistants 🗣️
- Live captions 📺
- AI call systems 📞
- Healthcare voice notes 🏥
But building it is not just “send audio → get text”.
You need:
- Streaming pipeline
- Buffering strategy
- AI processing
- Real-time response
Let’s break it down end-to-end 🔥
🧠 High-Level Architecture
🎤 Frontend (Mic)
│ (audio chunks via WebSocket)
▼
🌐 Backend Gateway (NestJS)
▼
📦 Message Broker (Kafka / RabbitMQ)
▼
🤖 STT Service (Whisper / Sarvam)
▼
📤 Transcription Stream
▼
🌐 Backend → WebSocket
▼
💻 Frontend (Live Text)🔌 Connection Flow (Who Talks to Whom)
Frontend ──WS──▶ Backend
Backend ──publish──▶ Kafka / RabbitMQ
Broker ──consume──▶ STT Worker
STT Worker ──publish result──▶ Broker / Backend
Backend ──WS──▶ Frontend🔁 End-to-End Real-Time Flow
1. 🎤 Audio Streaming (Frontend)
const ws = new WebSocket("ws://localhost:3000");
mediaRecorder.ondataavailable = (e) => {
ws.send(e.data); // send chunk every 200ms
};
2. ⚙️ Backend Receives Audio
handleAudio(client, chunk) {
producer.send({
topic: "audio-stream",
key: client.id,
value: chunk
});
}
3. 📦 Queue (Kafka / RabbitMQ)
Acts as:
- buffer
- load balancer
- decoupler
4. 🤖 STT Worker Processes Audio
buffers[user_id].append(chunk)
if len(buffers[user_id]) >= 5:
audio = b''.join(buffers[user_id])
# sliding window (important)
buffers[user_id] = buffers[user_id][-2:]
text = transcribe(audio)
5. 🔁 Send Transcription Back
producer.send("transcription", {
"user_id": user_id,
"text": text
})
Backend:
ws.sendToUser(user_id, text)
🧠 Buffering Strategy (Most Important 🔥)
❌ Wrong Way
- Process every 200ms chunk → broken words + high cost
✅ Correct Way (Sliding Window)
[chunk1 + chunk2 + chunk3] → process
[chunk2 + chunk3 + chunk4] → process
✔ Keeps context ✔ Avoids word cutting
⚡ Whisper vs Sarvam (STT Choices)
🧠 Whisper
✔ High accuracy ✔ Local / GPU ❌ Not truly streaming
⚡ Sarvam
✔ Real-time streaming ✔ Better for Indian languages 🇮🇳 ✔ Lower latency
💡 Best Approach
- Use Sarvam → real-time partial text
- Use Whisper → final correction
🔥 Kafka vs RabbitMQ (Where It Fits)
🧠 Core Difference
| Feature | Kafka 🔥 | RabbitMQ 🐰 |
|---|---|---|
| Model | Streaming | Task Queue |
| Throughput | Very high | Medium |
| Replay | Yes | No |
| Best for | Real-time streams | Jobs/tasks |
📍 In This Pipeline
Kafka Flow
Backend → Kafka → STT → Kafka → Backend
✔ Handles continuous audio streams ✔ Scales to millions ✔ Maintains ordering
RabbitMQ Flow
Backend → RabbitMQ → STT → Backend
✔ Simpler setup ✔ Good for MVP
🏆 Verdict
👉 For real-time voice systems → Kafka wins 🔥
🔁 Backend ↔ AI Communication
🟢 Option 1: Queue-Based (Recommended)
Backend → Kafka → AI
AI → Kafka → Backend
✔ scalable ✔ fault-tolerant
⚡ Option 2: Direct Call
Backend → AI → response
✔ low latency ❌ not scalable
🔥 Option 3: Streaming (Advanced)
Backend ↔ AI via WebSocket/gRPC
✔ best latency ✔ complex
🚨 Challenges & Solutions
⚠️ Audio Format Mismatch
- webm → wav (16kHz)
✔ Use ffmpeg
⚠️ Latency
✔ streaming responses ✔ smaller chunks
⚠️ Ordering Issues
✔ Kafka partition by user_id
⚠️ Backpressure
✔ Kafka buffering ✔ scale consumers
⚠️ Silence Handling
✔ use VAD (voice activity detection)
🏗️ Production Architecture
Frontend (Mic)
↓
WebSocket Gateway
↓
Kafka (partition by user_id)
↓
STT Workers (Whisper / Sarvam)
↓
Kafka (transcription topic)
↓
Backend Consumer
↓
WebSocket → Frontend💡 Pro Tips
- Use binary audio (not base64)
- Send partial + final transcripts
- Add sequence numbers
- Use Redis for session state
🔗 Internal Linking Ideas
- How WebSockets Work in Real-Time Systems
- Kafka for Streaming Applications
- Building a RAG Pipeline
- Scaling LLM Systems
❓ FAQ
1. Can I skip Kafka?
Yes for MVP, but not for scale.
2. Why not HTTP instead of WebSocket?
HTTP is too slow for real-time streaming.
3. Which STT is best?
- Sarvam → real-time
- Whisper → accuracy
4. How to reduce latency?
- streaming STT
- smaller buffers
- GPU inference
🏁 Final Thoughts
A real-time STT pipeline is not just AI — it’s a streaming system:
WebSocket → Queue → AI → WebSocket
If you build this right, you can power:
- Voice assistants 🗣️
- AI doctors 🏥
- Call automation 📞
💥 You’re basically building voice infrastructure like ChatGPT Voice