Handy is awesome! I used it for quite a while before Claude Code added voice support. Solid software, very good linux and mac integration. Shoutout to Parakeet models as well, extremely fast and solid models for their relatively modest memory requirements.
I love and have been using handy for a while too, what we need is this for mobile apps I don't think there's any free apps and native dictation is not always fully local and not as good.
Whisper is still old reliable - I find that it's less prone to hallucinations than newer models, easier to run (on AMD GPU, via whisper.cpp), and only ~2x slower than parakeet. I even bothered to "port" Parakeet to Nemo-less pytorch to run it on my GPU, and still went back to Whisper after a couple of days.
On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.
Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.
Incidentally, waiting for Apple to blow this all up with native STT any day now. :)
I've been running whisper large-v3 on an m2 max through a self-hosted endpoint and honestly the accuracy is good enough that i stopped bothering with cleanup models. The bigger annoyance for me was latency on longer chunks, like anything over 30 seconds starts feeling sluggish even with metal acceleration. Haven't tried whisperkit specifically but curious how it handles longer audio compared to the full model.
Thank you for sharing, I appreciate the emphasis on local speed and privacy. As a current user of Hex (https://github.com/kitlangton/Hex), which has similar goals, what are your thoughts on how they compare?
I see quite a few of these, the killer feature to me will be one that fine tunes the model based on your own voice.
E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.
Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.
We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.
I was mainly motivated by papers like this https://arxiv.org/pdf/2602.16800. But I found myself using it during vacation when I did not have internet connection.
I like this idea and it should work -- whatever microphone you have on should be able to hear the speaker. LMK if not (e.g., are you wearing headphones? if so, the mic can't hear the speaker)
If you don't feel like downloading a large model, you can also use `yap dictate`. Yap leverages the built-in models exposed though Speech.framework on macOS 26 (Tahoe).
Thanks! We currently have 2 multi-lingual options available:
- Whisper small (multilingual) (~466 MB, supports many languages)
- Parakeet v3 (25 languages) (~1.4 GB, supports 25 languages via FluidAudio)
https://github.com/cjpais/handy
I've been using parakeet v3 which is fantastic (and tiny). Confused still seeing whisper out there.
Also vibe coded a way to use parakeet from the same parakeet piper server on my grapheneos phone https://zach.codes/p/vibe-coding-a-wispr-clone-in-20-minutes
On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.
Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.
Incidentally, waiting for Apple to blow this all up with native STT any day now. :)
Not sure how you're running it, via whichever "app thing", but...
On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.
This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.
Maybe you can try hackin' that up?
Have you ever considered using a foot-pedal for PTT?
Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.
Apparently they do have a better model, they just haven't exposed it in their own OS yet!
https://developer.apple.com/documentation/speech/bringing-ad...
Wonder what's the hold up...
For footpedal:
Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.
Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)
Parakeet does both just fine.
E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.
Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.
We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.
https://developers.openai.com/cookbook/examples/whisper_prom...
Here is an example https://www.youtube.com/watch?v=Dw_q6l3Cwp4
I was mainly motivated by papers like this https://arxiv.org/pdf/2602.16800. But I found myself using it during vacation when I did not have internet connection.
https://hitoku.me/draft/
I setup a code for people to download it (HITOKUHN2026), in case you want to compare, or just give feedback!
[0]: https://github.com/beingpax/VoiceInk
Project repo: https://github.com/finnvoor/yap
EDIT: I see there is an open issue for that on github