How fast is N tokens per second really?

(mikeveerman.github.io)

123 points | by hexagr 2 days ago

16 comments

SXX 13 minutes ago
I think your demo need more realistic thinking logs because thinking usually burns at least 2x to 3x of tokens of the code and for harder tasks much more.
adampzakaria 16 minutes ago
This is awesome!! I use Cursor and I've been trending towards medium thinking models as much as possible - I don't like the dev cadence with something like opus 4.7 (thinking: very high) (great for some tasks, like complex plans). Eventually I'd like to make my way to open models and open harness, and this tool or something like it could help me understand what performance I'd need for productive work - bookmarked!
aurareturn 26 minutes ago
We truly are in the dial up era of GenAI.
antirez 38 minutes ago
Token/sec only makes sense once you tell me three four things:
1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.
2. prefill t/s, that is, prompt processing speed.
3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.
4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.
For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.
On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.
The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.
bjelkeman-again 1 hour ago
Interesting. It seems to me that with that speed (20-30) on local hardware the real issue is quality of output, not tokens per sec.
[-]
- NitpickLawyer 1 hour ago
  It really depends. With the new "thinking" models they usually spend some time before writing the final answer. If they "think" for 1k tokens, that's a minute of spinning wheel you're gonna see for each question. Add that to the prompt processing, and diminishing speeds as context increases, and it becomes really slow for longer sessions.
  [-]
  - mudkipdev 3 minutes ago
    Reminds me of the possibility of running DeepSeek at 3-4 t/s with SSD streaming, could be viable if you are running something overnight for example
ohadron 54 minutes ago
This is great. Agentic coding at 600+ tokens/sec is going to be a radically different beast. Coming soon-ish?
[-]
- tekacs 11 minutes ago
  Google's 3.5 Flash – which came out yesterday – is 200-300 tokens/second (albeit purportedly inefficient in its use of reasoning tokens) and according to Google, 800-1500+ tokens/second on their 8i TPUs when they're out!
  It's... suboptimal, but hopefully that's a reason to hope... if Google get themselves together for 3.5 Pro / the next Flash.
- black_knight 16 minutes ago
  People seem to use these tools very differently from each other. I value intelligence over speed any day. My programs are written in Haskell, so there are rarely any tasks which require thousands and thousands of lines to solve. Just intelligence. If there are rote tasks, I want the LLM to help me find intelligent ways of automating it: the right abstraction, the right meta-programming technique.
  I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!
- dkersten 44 minutes ago
  For small enough tasks with tight enough workflows, you can have it right now. Ie if you can constrain the task to work well with GPT OSS 120B/llama 3.3/qwen 3, then you can get upwards of 600 TPS on groq and up to 3k TPS on Cerebras.
  Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.
- 8note 39 minutes ago
  i really want a qwen on one of these chips: https://chatjimmy.ai
  15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem
  [-]
  - Cerium 32 minutes ago
    Why is that? It seems the other direction? I want to be sure I can complete a task in a certain amount of wall clock time. If the tokens per second are slow, then I am risking more by running a single approach at a time, and then have an incentive to try to multiplex my attention between separate work-streams. If the generation is fast enough to occupy my attention then there is no more available improvement by having parallel threads.
- c7b 27 minutes ago
  Do you have ideas/suggestions for agentic workflows that only start making sense at such speeds?
  [-]
  - colechristensen 6 minutes ago
    Branching strategies, do 10 things in parallel and evaluate for the best at the end or something along the lines of an evolutionary algorithms. Turn up the temperature on an LLM and have a survival mechanism, and generate solutions to the same problem over and over.
- philipp-gayret 46 minutes ago
  If you have a Cerebras Code subscription you can experience it right now. Indeed, a very different experience.
  [-]
  - KronisLV 39 minutes ago
    Used them for a while! They didn't seem to have prompt caching so I burnt through the daily 24M token limitations really quickly when doing large scale changes on a codebase (essentially a team's worth of menial migration/refactoring work). A lot of it was okay, but plenty had to be re-done and I still spotted some issues months down the line, in part I blame their model catalogue which did get an update to GLM 4.7 sometime way back, but definitely is showing its age: https://inference-docs.cerebras.ai/models/overview
    Quality wise, Anthropic gives me the best results (Opus for almost everything, I make sub-agents with fresh context review its work, after 2-10 loops, usually finds most issues). Token amount wise for agentic work, DeepSeek V4 is up there. What Cerebras is doing pretty cool though, apparently they even have prompt caching now like the other big providers: https://inference-docs.cerebras.ai/capabilities/prompt-cachi... At the same time, producing bad code faster was annoying in a uniquely new way.
    Wish they'd update the models with their subscription, it could genuinely be great with the proper harness. Like if they can run GLM 4.7, surely they could at least get DeepSeek V4 Flash with a big context window going as a starting point. How can you have so much money to make your own chips, but can't run modern models that you can get for free? It's like they don't want people to use their subscription.
  - dkersten 40 minutes ago
    It’s GLM 4.7, GPT OSS 120B, or llama 3.1 8B so not exactly the latest or best models.
    But GLM is good enough for many small tasks, certainly enough to get a taste for Cerebras’ high speeds!
    [edit: actually that’s just their general models, I can’t see what Cerebras code offers. It was Qwen-coder when it launched but I don’t know what it is now. I think GLM 4.7 but I’m not completely sure]
emehrkay 35 minutes ago
I just looked up what my computer is capable of (m2 MacBook Air) and it says 15-35 tokens per second. I could live with that writing code with a local model.
johng 2 days ago
Neat website, the visualization is great. I had a hard time wrapping my head around the tokens/s thing but this made it easy.
raverbashing 1 hour ago
On avg 1 token = 4 chars
So 75 tokens/s is ~ 300 chars per second which is the speed you'd get with a 2400 baud modem
tantalor 18 minutes ago
> Now switch between c and t at the same rate. The difference is striking — and intentional.
I don't see a big difference.
dfollent 2 days ago
Neat visual. 5 tok/s is still faster than me!
[-]
- himata4113 1 hour ago
  I had the opposite reaction, 5tok/s is so slow that when you include all the reasoning and thinking + warmup it is far slower than me.
  [-]
  - warmwaffles 41 minutes ago
    The sweet spot for being just fast enough to not irritate you is 10tok/s. Still slow but faster than you can sustain at typing and thinking. Just interesting to observe.
- zurfer 49 minutes ago
  yeah 3t/s seems human. only that i never wrote code perfectly top to bottom.
Eswo 59 minutes ago
super cool, thanks
dario-dentes 2 days ago
Thank you for this great utility. I love the "gut feel" calibration utilities like this one!
dbalatero 1 hour ago
This is cool, thanks for making it.
victorbjorklund 57 minutes ago
This is great.
tuo-lei 28 minutes ago
[flagged]