Real-time streaming with dynamic text (LLM responses, conversational AI)
Use convert() for most use cases. Use stream() for memory efficiency when handling large files. Use stream_websocket() when text is generated dynamically in real-time.
Generate speech from text with a single function call:
Copy
Ask AI
from fishaudio import FishAudiofrom fishaudio.utils import save, playclient = FishAudio()# Generate speech (returns bytes)audio = client.tts.convert(text="Hello, welcome to Fish Audio!")# Play or save the audioplay(audio)save(audio, "output.mp3")
Specify a voice model for consistent voice characteristics:
Copy
Ask AI
from fishaudio import FishAudiofrom fishaudio.utils import playclient = FishAudio()# Use a specific voiceaudio = client.tts.convert( text="This uses a specific voice model", reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian)play(audio)
Get voice model IDs from the Fish Audio website or programmatically:
Copy
Ask AI
from fishaudio import FishAudiofrom fishaudio.utils import playclient = FishAudio()# List available voicesvoices = client.voices.list(language="en", tags="male")for voice in voices.items: print(f"{voice.title}: {voice.id}")# Use a voice from the listaudio = client.tts.convert( text="Generated with discovered voice", reference_id=voices.items[0].id)play(audio)
Add emotional expressions to make speech more natural:
Copy
Ask AI
from fishaudio import FishAudiofrom fishaudio.utils import playclient = FishAudio()text = """(happy) I'm excited to announce this!(sad) Unfortunately, it didn't work out.(angry) This is so frustrating!(calm) Let me explain the details."""audio = client.tts.convert( text=text, reference_id="933563129e564b19a115bedd57b7406a" # Sarah)play(audio)
Create a configuration once and reuse it across multiple generations:
Copy
Ask AI
from fishaudio import FishAudiofrom fishaudio.types import TTSConfig, Prosodyclient = FishAudio()# Define config oncemy_config = TTSConfig( prosody=Prosody(speed=1.2, volume=-5), reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian format="wav", latency="balanced")# Reuse across multiple generationsaudio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config)audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config)audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config)
Use stream() for memory-efficient transfer and progressive download. Chunks are network transmission units (not semantic audio segments):
Copy
Ask AI
from fishaudio import FishAudioclient = FishAudio()# Collect all chunks efficientlyaudio_stream = client.tts.stream(text="Long text here")audio = audio_stream.collect() # Returns complete audio as bytes
For streaming to files or network without buffering in memory:
Copy
Ask AI
from fishaudio import FishAudioclient = FishAudio()# Stream directly to file (memory efficient for large audio)audio_stream = client.tts.stream(text="Very long text...")with open("output.mp3", "wb") as f: for chunk in audio_stream: f.write(chunk) # Write each chunk as it arrives
Use stream() when you have complete text upfront. For real-time streaming with dynamically generated text (LLMs, live captions), use stream_websocket() instead.
For real-time applications where text is generated dynamically, use stream_websocket(). This is perfect for LLM integrations, conversational AI, and live captions:
The FlushEvent forces the TTS engine to immediately generate audio from the accumulated text buffer. This is useful when you want to ensure audio is generated at specific points, even if the buffer hasn’t reached the optimal chunk size.
Copy
Ask AI
from fishaudio import FishAudiofrom fishaudio.types import FlushEventclient = FishAudio()# Use FlushEvent to force immediate generationdef text_with_flush(): yield "This is the first sentence. " yield "This is the second sentence. " yield FlushEvent() # Force audio generation NOW yield "This starts a new segment. " yield "And continues here." yield FlushEvent() # Force final generationaudio_stream = client.tts.stream_websocket(text_with_flush())# Process each audio chunk as it arrivesfor chunk in audio_stream: print(f"Received audio chunk: {len(chunk)} bytes")
Without FlushEvent, the engine automatically generates audio when the buffer reaches an optimal size. Use FlushEvent to control exactly when audio should be generated, which can reduce perceived latency in interactive applications.
WebSocket streaming shines when integrating with LLM streaming responses. The TTS engine acts as an accumulator, buffering text until it has enough to generate natural-sounding audio:
Copy
Ask AI
from fishaudio import FishAudiofrom fishaudio.utils import playclient = FishAudio()# Simulate streaming LLM responsedef llm_stream(): """Simulates text chunks from an LLM""" tokens = [ "The ", "weather ", "today ", "is ", "sunny ", "with ", "clear ", "skies. ", "Perfect ", "for ", "outdoor ", "activities!" ] for token in tokens: yield token# Stream to speech in real-timeaudio_stream = client.tts.stream_websocket(llm_stream())play(audio_stream)
The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don’t need to manually batch tokens unless you want to force generation at specific points using FlushEvent.