VibeVoice / README.md
ChaitanyaChandra's picture
Update README for HF Space (shorten description)
503c989
metadata
title: VibeVoice
emoji: ๐ŸŒ
colorFrom: purple
colorTo: yellow
sdk: docker
pinned: false
license: mit
short_description: VibeVoice-Realtime-0.5B - Real-time neural voice generation

๏ฟฝ๏ธ VibeVoice: Open-Source Frontier Voice AI

Project Page Hugging Face Technical Report

Real-time neural voice synthesis system powered by Microsoft VibeVoice-Realtime-0.5B.
This Space demonstrates how to run the model using Docker on Hugging Face Spaces.

VibeVoice Logo

๐Ÿš€ Space Features

  • โšก Real-time voice generation (~300ms latency)
  • ๐Ÿง  Lightweight 0.5B parameter model
  • ๐Ÿณ Docker-based deployment (downloaded at runtime)
  • ๐ŸŒ Runs on CPU (Zero GPU supported)

๐Ÿ“ฆ Model Details


๐Ÿ—๏ธ Technical Overview (From Original Repo)

๐Ÿ“ฐ News

New Realtime TTS

2025-12-03: ๐Ÿ“ฃ We open-sourced VibeVoiceโ€‘Realtimeโ€‘0.5B.

To mitigate deepfake risks and ensure low latency for the first speech chunk, voice prompts are provided in an embedded format. For users requiring voice customization, please reach out to our team.

Overview

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

Realtime streaming TTS model: Produces initial audible speech in ~300 ms and supports streaming text input for single-speaker real-time speech generation.

MOS Preference Results VibeVoice Overview

๐ŸŽต Demo Examples

English

https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784

Cross-Lingual

https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722

For more examples, see the Project Page.

Risks and Limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused. Users must ensure transcripts are reliable and avoid using generated content in misleading ways. Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

Star History

Star History Chart