This guide covers advanced configuration options for VibeVoice, including model settings, generation parameters, and performance tuning.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/egarciaf2/VibeVoice/llms.txt
Use this file to discover all available pages before exploring further.
Model Configuration
VibeVoiceStreamingConfig
The main configuration class for VibeVoice streaming models:Configuration Components
The configuration is composed of three sub-configurations:Acoustic Tokenizer
Handles audio encoding and decoding with VAE architecture
Decoder
Language model backbone (Qwen2) for text and speech processing
Diffusion Head
Diffusion-based audio generation component
Processor Configuration
VibeVoiceStreamingProcessor
The processor handles text and audio preprocessing:Compression ratio for speech tokenization. Determines how many audio samples map to one speech token.
Whether to apply decibel normalization to audio inputs for consistent volume levels.
Audio Processor Settings
Audio sampling rate in Hz. VibeVoice uses 24kHz by default.
Enable audio normalization to target dB level.
Target decibel Full Scale for audio normalization. Controls output volume.
Small epsilon value for numerical stability in normalization.
Generation Parameters
Diffusion Inference Steps
Control the quality/speed tradeoff:More steps generally improve quality but increase generation latency. 5 steps provides a good balance for real-time applications.
Noise Scheduler Configuration
Customize the diffusion noise scheduler:Diffusion solver algorithm. Options include
sde-dpmsolver++, dpmsolver, euler.Noise schedule for the diffusion process. Affects generation characteristics.
CFG Scale (Classifier-Free Guidance)
| CFG Scale | Effect | Use Case |
|---|---|---|
| 1.0 | Minimal guidance | More creative, diverse outputs |
| 1.5 | Balanced (default) | General purpose |
| 2.0 | Strong guidance | Higher prompt adherence |
| 2.5+ | Very strong | Maximum control, may reduce naturalness |
Sampling Parameters
Whether to use sampling or greedy decoding.
False for deterministic output.Sampling temperature. Higher values increase randomness. Only used if
do_sample=True.Nucleus sampling threshold. Only tokens with cumulative probability > top_p are considered. Only used if
do_sample=True.For deterministic, reproducible output, use
do_sample=False (default).Device-Specific Optimization
CUDA Configuration
MPS Configuration (Apple Silicon)
CPU Configuration
Attention Implementation
Flash Attention 2 (Recommended)
- Faster inference
- Lower memory usage
- Better audio quality (fully tested)
- CUDA-compatible GPU
- Flash Attention installed:
pip install flash-attn --no-build-isolation
SDPA (Scaled Dot-Product Attention)
- CPU inference
- MPS (Apple Silicon)
- Systems without Flash Attention support
SDPA is the automatic fallback if Flash Attention 2 fails to load.
TTS Backbone Configuration
Layer Partitioning
The decoder is divided into text encoding and TTS layers:Number of upper Transformer layers used for TTS. Lower layers are text-only encoding.
- Lower layers: Text encoding only
- Upper
tts_backbone_num_hidden_layers: Text + speech generation
Streaming Configuration
Audio Streamer Settings
Stop Event Handling
Advanced Generation Options
Refresh Negative Prompt
Whether to regenerate the negative prompt for CFG. Set to
False for faster repeated generations.Verbose Output
Max New Tokens
Setting
max_new_tokens=None lets the model determine the appropriate length based on input text.Saving and Loading Configurations
Save Processor Configuration
preprocessor_config.jsonwith all processor settings
Load Custom Configuration
Configuration File Example
preprocessor_config.json
Performance Tuning
Optimize for Latency
Optimize for Quality
Optimize for Memory
Troubleshooting
Out of Memory
Slow Generation
- Install Flash Attention 2 for CUDA GPUs
- Reduce
num_stepsin diffusion inference - Use
do_sample=Falsefor deterministic generation - Ensure model is on GPU, not CPU
Poor Audio Quality
- Increase diffusion inference steps to 7-10
- Ensure Flash Attention 2 is being used (check logs)
- Adjust CFG scale (try 1.5-2.0 range)
- Verify voice prompt is appropriate for target language