Sopro: A 169M parameter real-time TTS model with zero-shot voice cloning

(github.com)

6 points | by marques576 21 hours ago

1 comments

marques576 21 hours ago
Some features:
169M parameters
Streaming support
Zero-shot voice cloning
0.25 RTF on CPU, meaning it generates 30 seconds of audio in 7.5 seconds
Requires 3-12 seconds of reference audio for voice cloning
Apache 2.0 license
The model was trained on a single L40S GPU. It’s not SOTA in most cases, can be a bit unstable, and sometimes fails to capture voice likeness.