SylReg-Decoder Base

Model Details

Model Description

  • Model type: Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2
  • Language(s) (NLP): English
  • License: MIT

Model Sources

How to Get Started with the Model

Use the code below to get started with the model.

git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert

sudo apt install git-lfs  # for UTMOS

conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0
conda activate py310
pip install -r requirements/requirements.txt

sh scripts/setup.sh
import re

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer

from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery

wav_path = "/path/to/wav"

# download pretrained models from hugging face hub
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder-Base", device_map="cuda")

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
units = outputs[0]["units"]  # [3950, 67, ..., 503]

# unit-to-speech synthesis
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()

Training Details

Training Data

License Provider
LibriTTS-R CC BY 4.0 Y. Koizumi et al.

Training Hyperparameters

  • Training regime: fp16 mixed precision

Hardware

2 x A6000

Downloads last month
17
Safetensors
Model size
45.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ryota-komatsu/SylReg-Decoder-Base