Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS

(github.com)

72 points | by MattHart88 1 hour ago

13 comments

konaraddi 1 hour ago
That’s awesome! Do you know how it compares to Handy? Handy is open source and local only too. It’s been around a while and what I’ve been using.
https://github.com/cjpais/handy
[-]
- swaptr 35 minutes ago
  Handy is awesome! I used it for quite a while before Claude Code added voice support. Solid software, very good linux and mac integration. Shoutout to Parakeet models as well, extremely fast and solid models for their relatively modest memory requirements.
- youniverse 37 minutes ago
  I love and have been using handy for a while too, what we need is this for mobile apps I don't think there's any free apps and native dictation is not always fully local and not as good.
- stavros 22 minutes ago
  Handy is fantastic.
parhamn 27 minutes ago
I see a lot of whisper stuff out there. Are these updated models are the same old OpenAI whispers or have they been updated heavily?
I've been using parakeet v3 which is fantastic (and tiny). Confused still seeing whisper out there.
[-]
- daemonologist 10 minutes ago
  Whisper is still old reliable - I find that it's less prone to hallucinations than newer models, easier to run (on AMD GPU, via whisper.cpp), and only ~2x slower than parakeet. I even bothered to "port" Parakeet to Nemo-less pytorch to run it on my GPU, and still went back to Whisper after a couple of days.
- zackify 21 minutes ago
  same, even have kokoro for speech back to text for home assistant and parakeet on mac os through voice ink.
  Also vibe coded a way to use parakeet from the same parakeet piper server on my grapheneos phone https://zach.codes/p/vibe-coding-a-wispr-clone-in-20-minutes
goodroot 1 hour ago
Nice one! For Linux folks, I developed https://github.com/goodroot/hyprwhspr.
On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.
Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.
Incidentally, waiting for Apple to blow this all up with native STT any day now. :)
[-]
- LuxBennu 42 minutes ago
  I've been running whisper large-v3 on an m2 max through a self-hosted endpoint and honestly the accuracy is good enough that i stopped bothering with cleanup models. The bigger annoyance for me was latency on longer chunks, like anything over 30 seconds starts feeling sluggish even with metal acceleration. Haven't tried whisperkit specifically but curious how it handles longer audio compared to the full model.
  [-]
  - goodroot 24 minutes ago
    Ah yeah, longform is interesting.
    Not sure how you're running it, via whichever "app thing", but...
    On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.
    This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.
    Maybe you can try hackin' that up?
- hephaes7us 1 hour ago
  Thanks for sharing! I was literally getting ready to build, essentially, this. Now it looks like I don't have to!
  Have you ever considered using a foot-pedal for PTT?
  Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.
  [-]
  - goodroot 51 minutes ago
    They do, and they even have that nice microphone F5 key for it, and an ideal OS level API making the input experience >perfect<.
    Apparently they do have a better model, they just haven't exposed it in their own OS yet!
    https://developer.apple.com/documentation/speech/bringing-ad...
    Wonder what's the hold up...
    For footpedal:
    Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.
    Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)
    [-]
    - jiehong 12 minutes ago
      The only issue with Apple models is that they do not detect languages automatically, nor switch if you do between sentences.
      Parakeet does both just fine.
charlietran 1 hour ago
Thank you for sharing, I appreciate the emphasis on local speed and privacy. As a current user of Hex (https://github.com/kitlangton/Hex), which has similar goals, what are your thoughts on how they compare?
ericmcer 12 minutes ago
I see quite a few of these, the killer feature to me will be one that fine tunes the model based on your own voice.
E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.
Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.
We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.
[-]
- sorenjan 5 minutes ago
  Whisper supports a prompt, you can put your "Donold" there.
  https://developers.openai.com/cookbook/examples/whisper_prom...
- MattHart88 10 minutes ago
  I've found the "corrections" feature works well for most of the jargon and misspelling use cases. Can you give it a try and let me know edge cases?
lostathome 31 minutes ago
If anyone interested, I built Hitoku Draft. It is a context aware voice assistant. Local models only.
Here is an example https://www.youtube.com/watch?v=Dw_q6l3Cwp4
I was mainly motivated by papers like this https://arxiv.org/pdf/2602.16800. But I found myself using it during vacation when I did not have internet connection.
https://hitoku.me/draft/
I setup a code for people to download it (HITOKUHN2026), in case you want to compare, or just give feedback!
Supercompressor 14 minutes ago
I've been looking for the opposite - wanting to dump text and it be read to me, coherently. Anyone have good recommendations?
ipsum2 1 hour ago
Parakeet is significantly more accurate and faster than Whisper if it supports your language.
[-]
- rahimnathwani 52 minutes ago
  Right, and if you're on MacOS you can use it for free with Hex: https://github.com/kitlangton/Hex
- yeutterg 1 hour ago
  Are you running Parakeet with VoiceInk[0]?
  [0]: https://github.com/beingpax/VoiceInk
  [-]
  - zackify 23 minutes ago
    i am, working great for a long time now
- treetalker 1 hour ago
  I have been using Parakeet with MacWhisper's hold-to-talk on a MacBook Neo and it's been awesome.
hyperhello 35 minutes ago
Feature request or beg: let me play a speech video and transcribe it for me.
[-]
- MattHart88 25 minutes ago
  I like this idea and it should work -- whatever microphone you have on should be able to hear the speaker. LMK if not (e.g., are you wearing headphones? if so, the mic can't hear the speaker)
mathis 49 minutes ago
If you don't feel like downloading a large model, you can also use `yap dictate`. Yap leverages the built-in models exposed though Speech.framework on macOS 26 (Tahoe).
Project repo: https://github.com/finnvoor/yap
guzik 30 minutes ago
Sadly the app doesn't work. There is no popup asking for microphone permission.
EDIT: I see there is an open issue for that on github
[-]
- ttul 13 minutes ago
  And many people are mailing in Codex and Claude Code generated PRs - myself included. Fingers crossed, I suppose.
gegtik 15 minutes ago
how does this compare to macos built in siri TTS, in quality and in privacy?
aristech 33 minutes ago
Great job. How about the supported languages? System languages gets recognised?
[-]
- MattHart88 23 minutes ago
  Thanks! We currently have 2 multi-lingual options available: - Whisper small (multilingual) (~466 MB, supports many languages) - Parakeet v3 (25 languages) (~1.4 GB, supports 25 languages via FluidAudio)