According to the OpenASR Leaderboard [1], looks like Parakeet V2/V3 and Canary-Qwen (a Qwen finetune) handily beat Moonshine. All 3 models are open, but Parakeet is the smallest of the 3. I use Parakeet V3 with Handy and it works great locally for me.
By the way, I've been using a Whisper model, specifically WhisperX, to do all my work, and for whatever reason I just simply was not familiar with the Handy app. I've now downloaded and used it, and what a great suggestion. Thank you for putting it here, along with the direct link to the leaderboard.
I can tell that this is now definitely going to be my go-to model and app on all my clients.
I've helped many Twitch streamers set up https://github.com/royshil/obs-localvocal to plug transcription & translation into their streams, mainly for German audio to English subtitles.
I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.
I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?
Any plans regarding JavaScript support in the browser?
There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.
Accuracy is often presumed to be english, which is fine, but it's a vague thing to say "higher" because does it mean higher in English only? Higher in some subset of languages? Which ones?
The minimum useful data for this stuff is a small table of language | WER for dataset
This is awesome, well done guys, I’m gonna try it as my ASR component on the local voice assistant I’ve been building https://github.com/acatovic/ova. The tiny streaming latencies you show look insane
No idea why 'sudo pip install --break-system-packages moonshine-voice' is the recommended way to install on raspi?
The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)
For those wondering about the language support, currently English, Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese are available (most in Base size = 58M params)
haven't tested yet but I'm wondering how it will behave when talking about many IT jargon and tech acronyms. For those reason I had to mostly run LLM after STT but that was slowing done parakeet inference. Otherwise had problems to detect properly sometimes when talking about e.g. about CoreML, int8, fp16, half float, ARKit, AVFoundation, ONNX etc.
> This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.
> The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
> The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder.
reading through readme.md
"License
This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.
The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder."
[1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
I can tell that this is now definitely going to be my go-to model and app on all my clients.
I'm actually a little surprised they haven't added model size to that chart.
I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.
I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?
There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.
The minimum useful data for this stuff is a small table of language | WER for dataset
The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)
> This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.
> The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
> The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder.
The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder."
Timestamp 1: 2026-02-25T00:31:28 1771979488 https://news.ycombinator.com/item?id=47145661
Timestamp 2: 2026-02-25T00:32:03 1771979523 https://news.ycombinator.com/item?id=47145666
Two detailed large comments in two different threads in a 35 second span from a new account.