Documentation Index Fetch the complete documentation index at: https://mintlify.com/egarciaf2/VibeVoice/llms.txt
Use this file to discover all available pages before exploring further.
This guide shows you how to generate speech from text using the VibeVoice streaming model.
Prerequisites
Before running inference, ensure you have:
Installed VibeVoice and its dependencies
Downloaded or have access to a model (e.g., microsoft/VibeVoice-Realtime-0.5B)
Voice prompt files in .pt format (located in demo/voices/streaming_model/)
Basic Usage
Prepare Your Text File
Create a text file with the content you want to convert to speech: demo/text_examples/1p_vibevoice.txt
Hello, this is a test of the VibeVoice text-to-speech system.
Run Inference
Use the realtime_model_inference_from_file.py script: python demo/realtime_model_inference_from_file.py \
--model_path microsoft/VibeVoice-Realtime-0.5B \
--txt_path demo/text_examples/1p_vibevoice.txt \
--speaker_name Wayne \
--output_dir ./outputs
Check Output
The generated audio will be saved to the output directory: ls outputs/
# 1p_vibevoice_generated.wav
Command-Line Arguments
model_path
string
default: "microsoft/VibeVoice-Realtime-0.5B"
Path to the HuggingFace model directory or model ID
txt_path
string
default: "demo/text_examples/1p_vibevoice.txt"
Path to the text file containing the script to synthesize
Name of the speaker voice to use. Must match a voice file in demo/voices/streaming_model/
output_dir
string
default: "./outputs"
Directory where the generated audio files will be saved
Device for inference. Options: cuda, mps, or cpu. Defaults to CUDA if available, otherwise MPS or CPU
CFG (Classifier-Free Guidance) scale for generation. Higher values increase adherence to the input prompt
Device-Specific Configuration
CUDA (NVIDIA GPUs)
python demo/realtime_model_inference_from_file.py \
--device cuda \
--txt_path demo/text_examples/1p_vibevoice.txt
CUDA devices use bfloat16 dtype and flash_attention_2 for optimal performance.
MPS (Apple Silicon)
python demo/realtime_model_inference_from_file.py \
--device mps \
--txt_path demo/text_examples/1p_vibevoice.txt
MPS requires float32 dtype and uses SDPA attention implementation as flash_attention_2 is not supported.
CPU
python demo/realtime_model_inference_from_file.py \
--device cpu \
--txt_path demo/text_examples/1p_vibevoice.txt
CPU inference is significantly slower than GPU inference and should only be used for testing.
Understanding the Output
After generation completes, you’ll see a summary with performance metrics:
==================================================
GENERATION SUMMARY
==================================================
Input file: demo/text_examples/1p_vibevoice.txt
Output file: ./outputs/1p_vibevoice_generated.wav
Speaker names: Wayne
Prefilling text tokens: 42
Generated speech tokens: 1250
Total tokens: 1292
Generation time: 3.45 seconds
Audio duration: 5.20 seconds
RTF (Real Time Factor): 0.66x
==================================================
Key Metrics
Prefilling text tokens : Number of input text tokens processed
Generated speech tokens : Number of speech tokens generated by the model
RTF (Real Time Factor) : Generation time divided by audio duration. Values < 1.0 indicate faster than real-time generation
Advanced Configuration
Adjusting CFG Scale
The CFG scale controls how closely the model follows the input prompt:
Conservative (Lower adherence)
Balanced (Default)
Aggressive (Higher adherence)
python demo/realtime_model_inference_from_file.py \
--cfg_scale 1.0 \
--txt_path demo/text_examples/1p_vibevoice.txt
Start with the default CFG scale of 1.5 and adjust based on your audio quality preferences.
Python API Usage
You can also use VibeVoice programmatically:
import torch
from vibevoice import (
VibeVoiceStreamingForConditionalGenerationInference,
VibeVoiceStreamingProcessor
)
# Load processor and model
processor = VibeVoiceStreamingProcessor.from_pretrained( "microsoft/VibeVoice-Realtime-0.5B" )
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
"microsoft/VibeVoice-Realtime-0.5B" ,
torch_dtype = torch.bfloat16,
device_map = "cuda" ,
attn_implementation = "flash_attention_2"
)
model.eval()
model.set_ddpm_inference_steps( num_steps = 5 )
# Load voice prompt
voice_prompt = torch.load( "demo/voices/streaming_model/Wayne.pt" , map_location = "cuda" , weights_only = False )
# Prepare inputs
text = "Hello, this is VibeVoice."
inputs = processor.process_input_with_cached_prompt(
text = text,
cached_prompt = voice_prompt,
padding = True ,
return_tensors = "pt" ,
return_attention_mask = True
)
# Move to device
for k, v in inputs.items():
if torch.is_tensor(v):
inputs[k] = v.to( "cuda" )
# Generate
outputs = model.generate(
** inputs,
max_new_tokens = None ,
cfg_scale = 1.5 ,
tokenizer = processor.tokenizer,
generation_config = { 'do_sample' : False },
verbose = True ,
all_prefilled_outputs = voice_prompt
)
# Save audio
processor.save_audio(outputs.speech_outputs[ 0 ], output_path = "output.wav" )
Troubleshooting
Flash Attention Errors
If you encounter errors with flash_attention_2, the model will automatically fall back to SDPA:
Error loading the model. Trying to use SDPA. However, note that only
flash_attention_2 has been fully tested, and using SDPA may result in
lower audio quality.
For best results, install flash-attention: pip install flash-attn --no-build-isolation
Voice File Not Found
If your specified speaker name doesn’t match any voice files:
Warning: No voice preset found for 'InvalidName', using default voice
List available voices by checking demo/voices/streaming_model/ directory.