This pattern gives you the lowest possible “time to first audio”. Instead of waiting for the agent to finish, every token is forwarded to a streaming TTS callback that buffers up sentences and dispatches them to the speech engine the moment they’re complete.Documentation Index
Fetch the complete documentation index at: https://docs.swarms.world/llms.txt
Use this file to discover all available pages before exploring further.
Step 1: Install dependencies
Step 2: Build the agent
Step 3: Create the streaming TTS callback
StreamingTTSCallback is a callable that satisfies Swarms’s streaming_callback contract. With stream_mode=True, the audio for each sentence is played the moment it’s synthesised.
alloy, echo, fable, onyx, nova, shimmer.
Step 4: Run the agent with the callback
Pass the callback asstreaming_callback. Tokens flow into the agent’s response and into the TTS engine in parallel.
Step 5: Flush the buffer
StreamingTTSCallback buffers the last sentence until it sees a terminator (., ?, !, …). Always call flush() at the end so the final sentence is spoken.
Full example
When to use this pattern
- You want time to first audio as low as possible.
- The agent’s output is long enough that waiting for completion would be awkward.
- You’re fine with sentence-level granularity (the callback buffers per sentence, not per token).
See also
- Autonomous Voice Agent — same pattern but with
max_loops="auto"and tools. - Hierarchical Speech Swarm — distinct voice per agent in a swarm.