Architecture Overview

A Twilio AI voice agent consists of four layers working together in under 400ms per turn:

Your Python server sits in the middle of these four services, orchestrating the real-time audio pipeline over WebSockets.

1 Prerequisites

2 Install Dependencies

pip install fastapi uvicorn websockets openai deepgram-sdk twilio python-dotenv

3 Configure Twilio Webhook

When a call arrives, Twilio fetches a TwiML response from your server. This TwiML tells Twilio to start a Media Stream WebSocket to your server:

from fastapi import FastAPI, Request from fastapi.responses import Response app = FastAPI() @app.post("/voice") async def voice_webhook(request: Request): twiml = '''<?xml version="1.0" encoding="UTF-8"?> <Response> <Connect> <Stream url="wss://yourdomain.com/ws/audio" /> </Connect> </Response>''' return Response(content=twiml, media_type="application/xml")

4 Handle the Audio WebSocket

Twilio sends base64-encoded mulaw audio chunks over the WebSocket. Your handler decodes the audio and streams it to Deepgram:

import asyncio, base64, json from fastapi import WebSocket from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions @app.websocket("/ws/audio") async def audio_ws(websocket: WebSocket): await websocket.accept() dg_client = DeepgramClient(DEEPGRAM_API_KEY) # Open Deepgram live transcription dg_conn = dg_client.listen.live.v("1") async def on_transcript(self, result, **kwargs): sentence = result.channel.alternatives[0].transcript if len(sentence) == 0: return # Call GPT-4o with the transcript reply = await get_gpt_reply(sentence) # TTS the reply and send back to Twilio await stream_audio_reply(websocket, reply) dg_conn.on(LiveTranscriptionEvents.Transcript, on_transcript) opts = LiveOptions(model="nova-2", language="en-US", encoding="mulaw", sample_rate=8000, channels=1) dg_conn.start(opts) async for message in websocket.iter_text(): data = json.loads(message) if data["event"] == "media": chunk = base64.b64decode(data["media"]["payload"]) dg_conn.send(chunk)

5 OpenAI GPT-4o Conversation

from openai import AsyncOpenAI openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY) conversation_history = [] async def get_gpt_reply(user_text: str) -> str: conversation_history.append({"role": "user", "content": user_text}) response = await openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful business assistant. Be concise — this is a phone call."}, *conversation_history ], max_tokens=150, temperature=0.7 ) reply = response.choices[0].message.content conversation_history.append({"role": "assistant", "content": reply}) return reply
Production tip: Keep GPT-4o responses under 150 tokens. Callers hear silence while TTS processes — shorter replies feel more natural and reduce perceived latency. Use max_tokens=150 and instruct the model to be brief in the system prompt.

Challenges You Will Hit in Production

Skip the Build — Use RITZ BYOK Platform

Building this yourself takes 2–4 weeks and ongoing maintenance. RITZ gives you the full production-grade Twilio + GPT-4o + Deepgram pipeline as a configured platform — you bring your own API keys and get a working AI voice agent in minutes.

Deploy a Production AI Voice Agent Today

Use your own Twilio, OpenAI, and Deepgram keys. RITZ handles the orchestration. Start with 10 free calls — no credit card required.

Get Started Free    API Tutorial →