I actually can’t wait for the future where I upgrade hardware in order to upgrade my ai as an alternative to an expensive subscription.
There are many problems I want to work on which require billions of tokens. These are completely inaccessible without corporate project sponsorship at the moment. An asic generation machine which can pump out a few 10s of thousands of tokens per second at opus4.6 quality is more than sufficient.
A company called Taalas is working on something like that. Not Opus4.6 quality, but I'm sure they're targeting larger models. Currently they're using a LLama 8B model. It runs at ~17k tokens per second, and you can test it at https://chatjimmy.ai/.
"Design me a 3d printable rocket engine for a hobby rocket project. Verify it's design in a full simulation. Iterate until it works reliably in simulation based on a verified printable design on a consumer laser sintering device (or substitute contract manufacture for under 1000 dollars)."
This is a hobby version of a project, but you can imagine commercial versions of the same prompt for new databases, genomics studies, material analysis, operating systems etc.
I’m not convinced at all that the model won’t just get stuck in a loop where it doesn’t understand how to fix the broken rocket. I see similar failure modes in far simpler projects strictly confined to coding. This feels closer to “make me a profitable business, make no mistakes” than to a simple coding project.
Right now - there's some heavily subsidized subscriptions that are more or less cheating. For instance, Github CoPilot at $39/month gives you claude opus 4.6. They're going to close that off, but right now it's like a freebie for those doing API agentic harnesses.
That said, if you are doing always on agents and you spend $3k-$4k on a GB10 or, $5+ k on Apple Silicon as your sunk cost, you will probably come out ahead.
I've got 5 agents running a purely experimental social experiment. AThey operate in an evennia mud (a familiar sounding city called "gothmud). I've built a channel, idle prompts, sleep schedule. I feed in real world news, weather. There's a character up in a clock tower that reads evennia's audit logs every 20 minutes to surveil the city, and a cast of people wandering around, investigating things, having coffee, repairing robots. This is all hitting qwen3.6-35-A3B on the Asus GB10, which cost me $3k.
Over the last 30 days, I've hit 394M input tokens, 1.6B output tokens. I would have spent between $1600 to $1700 if I was using openrouter. Not calculated - I also have comfyui running in the spare space, and the agents "take photos" of the rooms they're in, selfies, workshop photos, etc.
How much did I spend on electricity? I don't have a meter on my box. My total electric bill for the last 30 days was $220, so I know it's less than that. My rate to compare is 11.7/kwh, but it's closer to 15c/Kwh total. The Asus GX10 has a 240W power supply, and it's probably only pulling 180. I estimate $15-$20/month. But worst case red-lining. 240 Watts, 720 hours = 172KWH , and at $0.20, I come to $35
Here's the kicker thought - that github copilot subscription I mentioned? I have another agent running on that, reading all my other agent logs, managing my obsidian notes, doing research, sending briefings. And all by itself, it used almost the same amount of claude-opus tokens for that $39/month subscription. I was actually a bit shocked when I pulled a recent report and saw that. I'm working to migrate functionality away from copilot subscription to the local model. A lot of the initial setup might have needed it, but not the ongoing review style work it does.
A little of A, a little of B. I have a lot of fun building it out, it's surpassed Factorio in addiction, and I've been able to flesh out some patterns that I roll back into more productive agent harness bits.
For A:
The learning is in building agent harnesses that aren't just cron jobs reading a file like HEARTBEAT.md. I have some serious tools for my own use. One main assistant/coordinator agent, one SRE/coder agent (with sub-agents of its own).
I originally just started last year with the AI assistant (Jane from enderverse). Along the way building scheduled systems, hand offs to other agents, etc. As I ran into problems, I'd be rewriting and refactoring. So I spent some time making a low-stake hatbot with history and routines. Instead a from-scratch golang harness, I built it around pi and extensions. Time of day prompt splices (extensions can inject into or modify prompts on the fly, wake up reminders. Things that you do in the main session vs spinning up an ephemeral session. Self improvement daydreaming (modify your own skills and AGENTS.md) A lot of that went back into rebuilding Jane to something more useful for me.
For B:
The "dynamic dollhouse" as you put it was seeing where I could take that living chatbot next. There's a lot of projects pointing agents at slack, discord, message boards. I figured why not a mud with rooms, weather, and props. Lots of interesting challenges. How to keep bots from nesting in their own room, how to keep them from yes-anding each other all day long. How to slow down 3 bots talking at each other so a human can get a word in edge-wise.
Different levels. There's plain old NPCs that have dice roll random responses. There's LLM driven NPCs that only remember the last 5-10 messages. And the main ones are bot agents. Full agent harness, moving around the environment. Long lived context windows. One character (a nurse at the hospital) gets into arguments with an NPC receptionist that treats her as another patient. Complains about it to other characters, they remember and the word spreads.
The agents get prompted to write down notes, the head home for sleep (and session compaction). Next time they enter a room with that person after compact, their notes get loaded automatically. This kind of behavior can feed back into the more productivity based agents.
A few things. I replied to someone else above, but I feed lessons learned from my social ant farm agents back into more productive agents.
Memory recall:
Lots of systems out there to give agents memory. I've used a bunch and written a couple. Storing memories is easy, but getting an agent to recall them, no matter how much you mention it in your AGENT/CLAUDE.md is a bit of an uphill battle. I've even watched claude make useful project memories and never refer to them again.
In my agent ant farm - agents go "to sleep" at night. They get nudged to head home, once there they get prompted to make notes about their day, about other characters. Then we do a compact with custom instructions. After compact/sleep cycle, if they enter a room with one of the characters in their notes, that gets loaded back into context automatically.
That all boils down to hooks in Pi like before_agent_turn. You can intercept a prompt, check it against code/flat files, and smartly inject more information into context. You can have a long running main session with compacts that discard procedural bits and offload the rest to memory.
Time Awareness:
Agents have no concept of time. You can send them a message at 5am, then at 10pm, and it's been 2 turns for them. For coding, this is fine. But for assistant level stuff, adding a message like "It's 3PM. It has been 3 hours since the last interaction with the user" goes a long way. Without me saying something like "new topic", it knows now that time has passed, i'm probably onto something new. If I left something hanging, it will remind me about it, or maybe go check on things that should have happened during the day.
Inner Thoughts/Idle nudges:
I can have an extension run every 5 seconds, check a a schedule, check activity level of the main session and fire off nudges on the main session. These look like the user sent it, but I generally prefix it with [inner thought]. For my social bot, I tested this along the lines of "[inner thought] it's been 3 hours since you last talked with user, why not reach out, let him know what's new, maybe send a selfie or a photo of where you are". For my assistant bot, it's an 8am, 3pm, and 7pm nudge along the lines of "[inner thought] put together an activity report of work things that has changed since the last report". This all runs in the main context, they get the thought, have historical context, can run skill to check on vault updates, open beads, anything observed from ingesting other agent sessions, and sends me a summary. It take into account my idle factor. If I'm heavily engaged in conversation at 3PM, the report might get delayed 15 minutes or an hour, or skipped altogether.
For open models, usually not well. You get 5+ providers competing on cost, all with cheaper electricity and better hardware utilization than your local setup
The TL;DR though is that a 10-15b param model baked into an ASIC with the latest fab tech would take around 62W of power draw when active. At ~10k+ t/s though it likely would only be active for short bursts of time. It'd fit perfectly fine within the thermal envelope of a laptop.
The approach makes a lot of sense. Once you get to those speeds, latency of the network becomes one of the bigger bottlenecks, so local has a real advantage over a subscription.
This assume Claude's price doesn't change. Which isn't a great assumption considering inference providers are moving to usage based billing. Also the VC money isn't going to last indefinitely. Current inference providers are being subsidized with VC money at this point.
Ok heres the thing you will nevwr be able to truly do this due to logic.
Logically five people pooling their resources beats one guy.
therefore datacenters will always win because they get higher time utilization.
so forget it.
I always wonder the same but i let logic tell me its a fantasy, on average you cant outspend a whole group of people making better use of the hardware.
you will get better hardware though, cutting edge will always be cloud
Laptops/desktops are cheaper per flop than any datacenter hardware by a good order of magnitude.
The problem is that expectations rise in datacenters, hardware/power/security/availability guarantees cost real money. Then the operator providing these guarantees expects some margin.
You can see this most clearly with "developer desktops", a gcp instance costs about 10x a hetzner instance which costs between 5 and 10x the same hardware sitting in the back of an office somewhere. While all of these premiums matter for 24/7 systems under active development, they don't really matter for ephemeral small scale workloads.
They just mean this part: "where I upgrade hardware in order to upgrade my ai as an alternative to an expensive subscription."
Upgrading local hardware will remain the more expensive alternative to the subscription regardless what the relative cost of running the models themselves are. If the local hardware to do so becomes affordable then the subscription will be even more affordable, not expensive.
At least for these kinds of mega tasks. For more micro task we will always end up with unutilized local compute we already purchased which will be "free" since we already paid for non-AI reasons (e.g. a gaming GPU while not gaming).
I saw '1-bit' and my mind first went to 1-bit dithered B&W image generation, not 1-bit model weights....
and so now I'm wondering how cool /fast / compressed a diffusion image generator could be if the images it was trained on / space it worked in was limited to 1 bit (Floyd-Steinberg / Atkinson / your favorite algo here) dithered images.
Training would surely be pretty quick and probably fit onto one modern GPU.
IME, the bottleneck when using diffusion models isn't storage space or memory, it's generation time. Lots of models will run on 8-12 GB 1080-generation GPUs onwards, or on Macs with similar memory, which are probably the bottom end from a GPU power perspective anyway. I also note that these models are marginally slower than the small FLUX.2 model they're based on.
Okay, maybe this allows running a local model on something that has a reasonably powerful GPU and limited memory, like an iPhone, but is that really a common requirement?
It's useful progress. Decent-fidelity local-scale inference means that you can create a product that generates throwaway images frequently without worrying about cost. Thus far every product I've seen that generates images is metered, which severely limits the value. I don't know if this is actually at the "decent fidelity" point yet.
It solves part of the download issue if they actually delivers a 1-bit whole package (currently their download is around 3.5GiB, still not ideal since FLUX.2 [klein] 4B you can get a package including text encoder ~6 GiB).
For speed, no. Draw Things runs on iPhone just fine and generally faster than their implementation on the same model (FLUX.2 [klein] 4B).
We are in an era of extreme demand for GPU and limited supply. Every inference we push to the edge frees cloud resources for other tasks. Every efficiency gain increases what we can achieve with existing resources. If images can be rendered with half as much compute, we need half as many GPUs.
I think the value of it is currently more academic than useful in the real world. Everything at the frontier is still only marginally Good Enough (in image generation, most of it is shit even from the best models), so things far behind the frontier in terms of capability (as a tiny 1-bit model necessarily must be) are unusable.
But, getting remarkably higher density of capability per unit of compute is a big thing. It means the frontier can get better and cheaper to operate and less resource hungry, and it means what can be accomplished at the edge, on personal laptops or phones, becomes a broader spectrum of tasks.
And, for privacy, there are a lot of things that should run on-device and not everyone has big dedicated GPUs.
It’s like asking how did Memoji generation on iPhone solved a real problem?
It does not need to directly solve any particular problem to be overall good for consumers, by putting pressure to all those subscription based solutions… at least it’s private and does not require you to provide all your data…
Genuine question: doesn't it blow your mind that there exists a 1 Gigabyte file/program that can generate any image you can think of just from a rough description of it?
Their 1-bit quantized Diffusion Transformer is just under 1 GB. You also need the text-encoder (4-bit quantized) and VAE (unquantized) for inference and their combined weight is ~3.42 GB.
Yeah, it's pretty incredible. And I guess that's mostly what's behind the question: whether this is more of an impressive research/technique demonstrator, or a real product advancement solving a need.
> doesn't it blow your mind that there exists a 1 Gigabyte file/program that can generate any image you can think of just from a rough description of it?
I can make this into a 5-lines Python program. I’m not saying the images will match the description, but that isn’t part of your spec ;)
Not quite as I understand it. The ternary approach bonsai uses leverages a FP16 scaling factor that each value in the ternary maps to. You're still using 16 bit multiplication, it's just that the weights are far more compressed.
fair, i think i was referring more to 1.58 bit architecture in general since the original paper (Figure 3) shows that we eliminate FP16 multiplication and addition just for INT8 addition. I need to dive deeper into bonsai overall if it differs
Yes its a huge deal because these are starting to get bound by memory bandwidth not compute. therefore one bit wirfhts stream way faster leading to substantially better results. At least thats what Id guess!
> To our knowledge, Bonsai Image 4B is the first image model in its parameter class to run directly on an iPhone.
This is wrong. But they worded it carefully to be not entirely wrong.
FLUX.2 [klein] 4B (the same parameter class, basically the same model) runs on iPhone through Draw Things app, with 8-bit or 6-bit quantization (hence not "directly", I guess, but that is the technicality that sounds fishy enough).
Stuff like this is great - more promises of things that can run on phones please!
Sadly right now the expensive developer subscription means the few folks willing to hold a forever subscription make something that barely works then move on… or make something with so many ads it is an app. For example Google’s “Model Garden” app has no ads but still has major UX issues and isn’t suitable for daily use, even though the models are amazing.
Raising awareness of how capable today’s phone hardware is will make normal people demand to run what they choose on their phones. It’d be a much stronger way back to general purpose computing than via all legislation that has been tried so far..
Couldn't try it because the demo app is iOS only and the web version just crashes my browser. The small model is impressive but if you front load a 1.8GB text encoder model, the savings aren't quite as useful.
The white paper says "mean-active memory pressure down to 1.95 GB for 1-bit Bonsai Image 4B and 2.38 GB for
Ternary Bonsai Image 4B". Storage is on the linked page, and is about half that.
That is very low, looks like it should run in base MacMini M4 with 16GB RAM. I understand it is not released yet? What sort of harness is necessary for this type of model? (I have only used coding agents through GH Copilot in VS Code, the JetBrains AI tool and Pi, this last one was sort of a pain to setup…)
I run a moderately popular image comparison benchmark site called GenAI Image Showdown [1]. You can click “View All Models” and filter the list down to just locally runnable options (Flux, Qwen, Hunyuan, etc.).
Except the two (GPT-Image-2 and Nano Banana Pro), anything displayed here can run on the 16 GiB MacBook (including the FLUX.2 [dev]): https://tests.drawthings.ai/generate
I believe it's the way the HN algorithm works. In order to give new and obscure posts a shot, it will add them to peoples feeds in their front page and see how they measure. Otherwise new posts wouldn't get seen and the flywheel would never get started.
So everyone acts as a sort of beta tester for obscure posts.
On weekends, yes. During the week, that’s also true if they arrive within a short time frame, e.g., three minutes. Almost no one looks at “New”. That is the real issue.
It’s about how quickly they get those points. It doesn’t have to be bots. Sending a post to friends with reputable human profiles, and asking for a vote kinda works of most social networks. Some social networks claim they have protection against this but I wouldn’t bet they catch everything.
Just a side note, that this website is classified by Apple as an Adult website. I have Limit Adult Websites set in Content & Privacy Restrictions switched on.
Led me to wonder what happens if a domain gets a new owner, and they want to petition Apple to remove the block.
what trade off would one need to clear to justify the hardware and the work to get this running locally as part of a broader system? It’s a lot of work setting up and maintaining a production harness/system on a local device. I don’t personally repeatedly generate images at a scale where using a lab’s app somehow burns all my tokens. I like the ideas of local ai but I don’t see widespread adoption of it happening in commercial or customer situations anytime soon no matter how little/good enough they get. Even Uber- token burn whiplash but I doubt their answer will be “run some of it local”. IT nightmare, I’d imagine.
This is why I don't think the big AI companies and nvidia will dominate the market. AIs will just run locally, on whatever hardware you have. Perhaps that's why they worked on this yet-to-be-defined partnership with ARM.
Can't speak for browser demos, but I just got the ternary model working on my M5 generating images. The 1 bit didn't work, as it has a known bug with XCode 24.5 and I wasn't in the mood for installing 24.4 alongside.
The online demos require WebGPU so Firefox on mobilr and privacy enhanced browsers will break. WebGPU support on Linux and other open source systems is also trash, you can force it to work in Chrome but it won't be happy.
There are many problems I want to work on which require billions of tokens. These are completely inaccessible without corporate project sponsorship at the moment. An asic generation machine which can pump out a few 10s of thousands of tokens per second at opus4.6 quality is more than sufficient.
This is a hobby version of a project, but you can imagine commercial versions of the same prompt for new databases, genomics studies, material analysis, operating systems etc.
I would very easily find ways to hit that level of token usage if it was cheaper/faster.
That said, if you are doing always on agents and you spend $3k-$4k on a GB10 or, $5+ k on Apple Silicon as your sunk cost, you will probably come out ahead.
I've got 5 agents running a purely experimental social experiment. AThey operate in an evennia mud (a familiar sounding city called "gothmud). I've built a channel, idle prompts, sleep schedule. I feed in real world news, weather. There's a character up in a clock tower that reads evennia's audit logs every 20 minutes to surveil the city, and a cast of people wandering around, investigating things, having coffee, repairing robots. This is all hitting qwen3.6-35-A3B on the Asus GB10, which cost me $3k.
Over the last 30 days, I've hit 394M input tokens, 1.6B output tokens. I would have spent between $1600 to $1700 if I was using openrouter. Not calculated - I also have comfyui running in the spare space, and the agents "take photos" of the rooms they're in, selfies, workshop photos, etc.
How much did I spend on electricity? I don't have a meter on my box. My total electric bill for the last 30 days was $220, so I know it's less than that. My rate to compare is 11.7/kwh, but it's closer to 15c/Kwh total. The Asus GX10 has a 240W power supply, and it's probably only pulling 180. I estimate $15-$20/month. But worst case red-lining. 240 Watts, 720 hours = 172KWH , and at $0.20, I come to $35
Here's the kicker thought - that github copilot subscription I mentioned? I have another agent running on that, reading all my other agent logs, managing my obsidian notes, doing research, sending briefings. And all by itself, it used almost the same amount of claude-opus tokens for that $39/month subscription. I was actually a bit shocked when I pulled a recent report and saw that. I'm working to migrate functionality away from copilot subscription to the local model. A lot of the initial setup might have needed it, but not the ongoing review style work it does.
What is the experiment? What are you hoping to learn from all this?
Or do you just mean you've made a dynamic dollhouse that you think is cool? The Sims on your own terms?
For A:
The learning is in building agent harnesses that aren't just cron jobs reading a file like HEARTBEAT.md. I have some serious tools for my own use. One main assistant/coordinator agent, one SRE/coder agent (with sub-agents of its own).
I originally just started last year with the AI assistant (Jane from enderverse). Along the way building scheduled systems, hand offs to other agents, etc. As I ran into problems, I'd be rewriting and refactoring. So I spent some time making a low-stake hatbot with history and routines. Instead a from-scratch golang harness, I built it around pi and extensions. Time of day prompt splices (extensions can inject into or modify prompts on the fly, wake up reminders. Things that you do in the main session vs spinning up an ephemeral session. Self improvement daydreaming (modify your own skills and AGENTS.md) A lot of that went back into rebuilding Jane to something more useful for me.
For B:
The "dynamic dollhouse" as you put it was seeing where I could take that living chatbot next. There's a lot of projects pointing agents at slack, discord, message boards. I figured why not a mud with rooms, weather, and props. Lots of interesting challenges. How to keep bots from nesting in their own room, how to keep them from yes-anding each other all day long. How to slow down 3 bots talking at each other so a human can get a word in edge-wise.
Different levels. There's plain old NPCs that have dice roll random responses. There's LLM driven NPCs that only remember the last 5-10 messages. And the main ones are bot agents. Full agent harness, moving around the environment. Long lived context windows. One character (a nurse at the hospital) gets into arguments with an NPC receptionist that treats her as another patient. Complains about it to other characters, they remember and the word spreads.
The agents get prompted to write down notes, the head home for sleep (and session compaction). Next time they enter a room with that person after compact, their notes get loaded automatically. This kind of behavior can feed back into the more productivity based agents.
Memory recall:
Lots of systems out there to give agents memory. I've used a bunch and written a couple. Storing memories is easy, but getting an agent to recall them, no matter how much you mention it in your AGENT/CLAUDE.md is a bit of an uphill battle. I've even watched claude make useful project memories and never refer to them again.
In my agent ant farm - agents go "to sleep" at night. They get nudged to head home, once there they get prompted to make notes about their day, about other characters. Then we do a compact with custom instructions. After compact/sleep cycle, if they enter a room with one of the characters in their notes, that gets loaded back into context automatically.
That all boils down to hooks in Pi like before_agent_turn. You can intercept a prompt, check it against code/flat files, and smartly inject more information into context. You can have a long running main session with compacts that discard procedural bits and offload the rest to memory.
Time Awareness:
Agents have no concept of time. You can send them a message at 5am, then at 10pm, and it's been 2 turns for them. For coding, this is fine. But for assistant level stuff, adding a message like "It's 3PM. It has been 3 hours since the last interaction with the user" goes a long way. Without me saying something like "new topic", it knows now that time has passed, i'm probably onto something new. If I left something hanging, it will remind me about it, or maybe go check on things that should have happened during the day.
Inner Thoughts/Idle nudges:
I can have an extension run every 5 seconds, check a a schedule, check activity level of the main session and fire off nudges on the main session. These look like the user sent it, but I generally prefix it with [inner thought]. For my social bot, I tested this along the lines of "[inner thought] it's been 3 hours since you last talked with user, why not reach out, let him know what's new, maybe send a selfie or a photo of where you are". For my assistant bot, it's an 8am, 3pm, and 7pm nudge along the lines of "[inner thought] put together an activity report of work things that has changed since the last report". This all runs in the main context, they get the thought, have historical context, can run skill to check on vault updates, open beads, anything observed from ingesting other agent sessions, and sends me a summary. It take into account my idle factor. If I'm heavily engaged in conversation at 3PM, the report might get delayed 15 minutes or an hour, or skipped altogether.
The TL;DR though is that a 10-15b param model baked into an ASIC with the latest fab tech would take around 62W of power draw when active. At ~10k+ t/s though it likely would only be active for short bursts of time. It'd fit perfectly fine within the thermal envelope of a laptop.
The approach makes a lot of sense. Once you get to those speeds, latency of the network becomes one of the bigger bottlenecks, so local has a real advantage over a subscription.
Logically five people pooling their resources beats one guy.
therefore datacenters will always win because they get higher time utilization.
so forget it.
I always wonder the same but i let logic tell me its a fantasy, on average you cant outspend a whole group of people making better use of the hardware.
you will get better hardware though, cutting edge will always be cloud
Which explains why you're using a dumb terminal to access compute services?
The problem is that expectations rise in datacenters, hardware/power/security/availability guarantees cost real money. Then the operator providing these guarantees expects some margin.
You can see this most clearly with "developer desktops", a gcp instance costs about 10x a hetzner instance which costs between 5 and 10x the same hardware sitting in the back of an office somewhere. While all of these premiums matter for 24/7 systems under active development, they don't really matter for ephemeral small scale workloads.
At 10x you have to be at hours per day and 5x you’re at 4h.
HBM has way higher bandwidth and its not all about flops.
Also the FP4 flops (inference) are so mind bogglingly high on these things.
Lastly what you fail to consider is the chip to chip bandwidth which is critical.
the people running these know that networking is just as critical.
all reduce etc.
they wouldnt pay if they could get something better value.
Don't think anyone was refuting that?
And of course when you pool resources you have access to more resources.
Upgrading local hardware will remain the more expensive alternative to the subscription regardless what the relative cost of running the models themselves are. If the local hardware to do so becomes affordable then the subscription will be even more affordable, not expensive.
At least for these kinds of mega tasks. For more micro task we will always end up with unutilized local compute we already purchased which will be "free" since we already paid for non-AI reasons (e.g. a gaming GPU while not gaming).
and so now I'm wondering how cool /fast / compressed a diffusion image generator could be if the images it was trained on / space it worked in was limited to 1 bit (Floyd-Steinberg / Atkinson / your favorite algo here) dithered images.
Training would surely be pretty quick and probably fit onto one modern GPU.
IME, the bottleneck when using diffusion models isn't storage space or memory, it's generation time. Lots of models will run on 8-12 GB 1080-generation GPUs onwards, or on Macs with similar memory, which are probably the bottom end from a GPU power perspective anyway. I also note that these models are marginally slower than the small FLUX.2 model they're based on.
Okay, maybe this allows running a local model on something that has a reasonably powerful GPU and limited memory, like an iPhone, but is that really a common requirement?
For speed, no. Draw Things runs on iPhone just fine and generally faster than their implementation on the same model (FLUX.2 [klein] 4B).
But, getting remarkably higher density of capability per unit of compute is a big thing. It means the frontier can get better and cheaper to operate and less resource hungry, and it means what can be accomplished at the edge, on personal laptops or phones, becomes a broader spectrum of tasks.
And, for privacy, there are a lot of things that should run on-device and not everyone has big dedicated GPUs.
It does not need to directly solve any particular problem to be overall good for consumers, by putting pressure to all those subscription based solutions… at least it’s private and does not require you to provide all your data…
Their 1-bit quantized Diffusion Transformer is just under 1 GB. You also need the text-encoder (4-bit quantized) and VAE (unquantized) for inference and their combined weight is ~3.42 GB.
TBF, even at that size it's no less mind blowing.
I can make this into a 5-lines Python program. I’m not saying the images will match the description, but that isn’t part of your spec ;)
https://arxiv.org/pdf/2402.17764
This is wrong. But they worded it carefully to be not entirely wrong.
FLUX.2 [klein] 4B (the same parameter class, basically the same model) runs on iPhone through Draw Things app, with 8-bit or 6-bit quantization (hence not "directly", I guess, but that is the technicality that sounds fishy enough).
Sadly right now the expensive developer subscription means the few folks willing to hold a forever subscription make something that barely works then move on… or make something with so many ads it is an app. For example Google’s “Model Garden” app has no ads but still has major UX issues and isn’t suitable for daily use, even though the models are amazing.
Raising awareness of how capable today’s phone hardware is will make normal people demand to run what they choose on their phones. It’d be a much stronger way back to general purpose computing than via all legislation that has been tried so far..
Isn't SD XL 3.5B? And the refiner model is even larger. Those can run on an iPhone 13 Pro.
I do wonder how these compare to existing image generation models. I've tried https://github.com/alichherawalla/off-grid-mobile-ai for a while but I find the image generation models rather lacking.
https://genai-showdown.specr.net
So everyone acts as a sort of beta tester for obscure posts.
Led me to wonder what happens if a domain gets a new owner, and they want to petition Apple to remove the block.
Here's a generation in your honor: https://peterc.org/img/johndoe.png
Is it compatible with Ollama, ComfyUI or are those providers unneeded, compatible with low-end hardware?
Also, where does "./setup.sh/ drop the components in Linux?
Thank you, Sol
having trouble loading the webgl browser demo on my phone but no biggy