Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/egarciaf2/VibeVoice/llms.txt

Use this file to discover all available pages before exploring further.

This guide shows you how to generate speech from text using the VibeVoice streaming model.

Prerequisites

Before running inference, ensure you have:
  • Installed VibeVoice and its dependencies
  • Downloaded or have access to a model (e.g., microsoft/VibeVoice-Realtime-0.5B)
  • Voice prompt files in .pt format (located in demo/voices/streaming_model/)

Basic Usage

1

Prepare Your Text File

Create a text file with the content you want to convert to speech:
demo/text_examples/1p_vibevoice.txt
Hello, this is a test of the VibeVoice text-to-speech system.
2

Run Inference

Use the realtime_model_inference_from_file.py script:
python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt \
  --speaker_name Wayne \
  --output_dir ./outputs
3

Check Output

The generated audio will be saved to the output directory:
ls outputs/
# 1p_vibevoice_generated.wav

Command-Line Arguments

model_path
string
default:"microsoft/VibeVoice-Realtime-0.5B"
Path to the HuggingFace model directory or model ID
txt_path
string
default:"demo/text_examples/1p_vibevoice.txt"
Path to the text file containing the script to synthesize
speaker_name
string
default:"Wayne"
Name of the speaker voice to use. Must match a voice file in demo/voices/streaming_model/
output_dir
string
default:"./outputs"
Directory where the generated audio files will be saved
device
string
default:"auto"
Device for inference. Options: cuda, mps, or cpu. Defaults to CUDA if available, otherwise MPS or CPU
cfg_scale
float
default:"1.5"
CFG (Classifier-Free Guidance) scale for generation. Higher values increase adherence to the input prompt

Device-Specific Configuration

CUDA (NVIDIA GPUs)

python demo/realtime_model_inference_from_file.py \
  --device cuda \
  --txt_path demo/text_examples/1p_vibevoice.txt
CUDA devices use bfloat16 dtype and flash_attention_2 for optimal performance.

MPS (Apple Silicon)

python demo/realtime_model_inference_from_file.py \
  --device mps \
  --txt_path demo/text_examples/1p_vibevoice.txt
MPS requires float32 dtype and uses SDPA attention implementation as flash_attention_2 is not supported.

CPU

python demo/realtime_model_inference_from_file.py \
  --device cpu \
  --txt_path demo/text_examples/1p_vibevoice.txt
CPU inference is significantly slower than GPU inference and should only be used for testing.

Understanding the Output

After generation completes, you’ll see a summary with performance metrics:
==================================================
GENERATION SUMMARY
==================================================
Input file: demo/text_examples/1p_vibevoice.txt
Output file: ./outputs/1p_vibevoice_generated.wav
Speaker names: Wayne
Prefilling text tokens: 42
Generated speech tokens: 1250
Total tokens: 1292
Generation time: 3.45 seconds
Audio duration: 5.20 seconds
RTF (Real Time Factor): 0.66x
==================================================

Key Metrics

  • Prefilling text tokens: Number of input text tokens processed
  • Generated speech tokens: Number of speech tokens generated by the model
  • RTF (Real Time Factor): Generation time divided by audio duration. Values < 1.0 indicate faster than real-time generation

Advanced Configuration

Adjusting CFG Scale

The CFG scale controls how closely the model follows the input prompt:
python demo/realtime_model_inference_from_file.py \
  --cfg_scale 1.0 \
  --txt_path demo/text_examples/1p_vibevoice.txt
Start with the default CFG scale of 1.5 and adjust based on your audio quality preferences.

Python API Usage

You can also use VibeVoice programmatically:
import torch
from vibevoice import (
    VibeVoiceStreamingForConditionalGenerationInference,
    VibeVoiceStreamingProcessor
)

# Load processor and model
processor = VibeVoiceStreamingProcessor.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")
model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "microsoft/VibeVoice-Realtime-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
model.eval()
model.set_ddpm_inference_steps(num_steps=5)

# Load voice prompt
voice_prompt = torch.load("demo/voices/streaming_model/Wayne.pt", map_location="cuda", weights_only=False)

# Prepare inputs
text = "Hello, this is VibeVoice."
inputs = processor.process_input_with_cached_prompt(
    text=text,
    cached_prompt=voice_prompt,
    padding=True,
    return_tensors="pt",
    return_attention_mask=True
)

# Move to device
for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda")

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=None,
    cfg_scale=1.5,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    verbose=True,
    all_prefilled_outputs=voice_prompt
)

# Save audio
processor.save_audio(outputs.speech_outputs[0], output_path="output.wav")

Troubleshooting

Flash Attention Errors

If you encounter errors with flash_attention_2, the model will automatically fall back to SDPA:
Error loading the model. Trying to use SDPA. However, note that only 
flash_attention_2 has been fully tested, and using SDPA may result in 
lower audio quality.
For best results, install flash-attention: pip install flash-attn --no-build-isolation

Voice File Not Found

If your specified speaker name doesn’t match any voice files:
Warning: No voice preset found for 'InvalidName', using default voice
List available voices by checking demo/voices/streaming_model/ directory.