How to Build an AI Voice Agent with Twilio and OpenAI GPT-4o (2025 Guide)

Architecture Overview

A Twilio AI voice agent consists of four layers working together in under 400ms per turn:

Twilio: Handles the phone call, streams audio via WebSocket (Media Streams)
Deepgram: Converts caller speech to text in real-time (Nova-2 model)
OpenAI GPT-4o: Generates a conversational reply from the transcript
TTS Engine: Converts the reply to audio and streams it back to the caller

Your Python server sits in the middle of these four services, orchestrating the real-time audio pipeline over WebSockets.

1 Prerequisites

Twilio account with a phone number ($1/month) — twilio.com
OpenAI API key — platform.openai.com
Deepgram API key — deepgram.com
Python 3.10+ with FastAPI or Flask
Public HTTPS URL (ngrok for dev, your server for prod)

2 Install Dependencies

pip install fastapi uvicorn websockets openai deepgram-sdk twilio python-dotenv

3 Configure Twilio Webhook

When a call arrives, Twilio fetches a TwiML response from your server. This TwiML tells Twilio to start a Media Stream WebSocket to your server:

from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/voice")
async def voice_webhook(request: Request):
    twiml = '''<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://yourdomain.com/ws/audio" />
  </Connect>
</Response>'''
    return Response(content=twiml, media_type="application/xml")

4 Handle the Audio WebSocket

Twilio sends base64-encoded mulaw audio chunks over the WebSocket. Your handler decodes the audio and streams it to Deepgram:

import asyncio, base64, json
from fastapi import WebSocket
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

@app.websocket("/ws/audio")
async def audio_ws(websocket: WebSocket):
    await websocket.accept()
    dg_client = DeepgramClient(DEEPGRAM_API_KEY)
    
    # Open Deepgram live transcription
    dg_conn = dg_client.listen.live.v("1")
    
    async def on_transcript(self, result, **kwargs):
        sentence = result.channel.alternatives[0].transcript
        if len(sentence) == 0:
            return
        # Call GPT-4o with the transcript
        reply = await get_gpt_reply(sentence)
        # TTS the reply and send back to Twilio
        await stream_audio_reply(websocket, reply)

    dg_conn.on(LiveTranscriptionEvents.Transcript, on_transcript)
    opts = LiveOptions(model="nova-2", language="en-US", encoding="mulaw",
                       sample_rate=8000, channels=1)
    dg_conn.start(opts)

    async for message in websocket.iter_text():
        data = json.loads(message)
        if data["event"] == "media":
            chunk = base64.b64decode(data["media"]["payload"])
            dg_conn.send(chunk)

5 OpenAI GPT-4o Conversation

from openai import AsyncOpenAI

openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
conversation_history = []

async def get_gpt_reply(user_text: str) -> str:
    conversation_history.append({"role": "user", "content": user_text})
    
    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful business assistant. Be concise — this is a phone call."},
            *conversation_history
        ],
        max_tokens=150,
        temperature=0.7
    )
    
    reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": reply})
    return reply

Production tip: Keep GPT-4o responses under 150 tokens. Callers hear silence while TTS processes — shorter replies feel more natural and reduce perceived latency. Use max_tokens=150 and instruct the model to be brief in the system prompt.

Challenges You Will Hit in Production

Latency: The STT → LLM → TTS pipeline adds 400–800ms. Use streaming TTS and start sending audio before GPT-4o finishes.
Interruption handling: Callers talk over the agent. You need to detect new speech mid-response and cancel the current TTS stream.
Session management: Each call needs isolated conversation history. Shared state causes cross-call contamination.
Twilio rate limits: WebSocket reconnections need exponential backoff. Dropped connections mid-call are frustrating.
Audio encoding: Twilio uses mulaw 8kHz. Deepgram needs this configured. TTS output must be re-encoded to mulaw before sending back.

Skip the Build — Use RITZ BYOK Platform

Building this yourself takes 2–4 weeks and ongoing maintenance. RITZ gives you the full production-grade Twilio + GPT-4o + Deepgram pipeline as a configured platform — you bring your own API keys and get a working AI voice agent in minutes.

Dashboard-based agent configuration (no code)
BYOK: connect your Twilio, OpenAI, and Deepgram accounts
Built-in call logs, recordings, analytics
Knowledge base upload for custom FAQ responses
50+ voices and languages

Deploy a Production AI Voice Agent Today

Use your own Twilio, OpenAI, and Deepgram keys. RITZ handles the orchestration. Start with 10 free calls — no credit card required.

Get Started Free API Tutorial →

How to Build an AI Voice Agent with Twilio and OpenAI GPT-4o (2025)

Architecture Overview

1 Prerequisites

2 Install Dependencies

3 Configure Twilio Webhook

4 Handle the Audio WebSocket

5 OpenAI GPT-4o Conversation

Challenges You Will Hit in Production

Skip the Build — Use RITZ BYOK Platform

Deploy a Production AI Voice Agent Today