(Btw, we used to be called Infinity AI and did a Show HN under that name last year: https://news.ycombinator.com/item?id=41467704.)
Unlike existing avatar video chat platforms like HeyGen, Tolan, or Apple Memoji filters, we do not require training custom models, rigging a character ahead of time, or having a human drive the avatar. Our tech allows users to create and immediately video-call a custom character by uploading a single image. The character image can be any style - from photorealistic to cartoons, paintings, and more.
To achieve this demo, we had to do the following (among other things! but these were the hardest):
1. Training a fast DiT model. To make our video generation fast, we had to both design a model that made the right trade-offs between speed and quality, and use standard distillation approaches. We first trained a custom video diffusion transformer (DiT) from scratch that achieves excellent lip and facial expression sync to audio. To further optimize the model for speed, we applied teacher-student distillation. The distilled model achieves 25fps video generation at 256-px resolution. Purpose-built transformer ASICs will eventually allow us to stream our video model at 4k resolution.
2. Solving the infinite video problem. Most video DiT models (Sora, Runway, Kling) generate 5-second chunks. They can iteratively extend it by another 5sec by feeding the end of the 1st chunk into the start of the 2nd in an autoregressive manner. Unfortunately the models experience quality degradation after multiple extensions due to accumulation of generation errors. We developed a temporal consistency preservation technique that maintains visual coherence across long sequences. Our technique significantly reduces artifact accumulation and allows us to generate indefinitely-long videos.
3. A complex streaming architecture with minimal latency. Enabling an end-to-end avatar zoom call requires several building blocks, including voice transcription, LLM inference, and text-to-speech generation in addition to video generation. We use Deepgram as our AI voice partner. Modal as the end-to-end compute platform. And Daily.co and Pipecat to help build a parallel processing pipeline that orchestrates everything via continuously streaming chunks. Our system achieves end-to-end latency of 3-6 seconds from user input to avatar response. Our target is <2 second latency.
More technical details here: https://lemonslice.com/live/technical-report.
Current limitations that we want to solve include: (1) enabling whole-body and background motions (we’re training a next-gen model for this), (2) reducing delays and improving resolution (purpose-built ASICs will help), (3) training a model on dyadic conversations so that avatars learn to listen naturally, and (4) allowing the character to “see you” and respond to what they see to create a more natural and engaging conversation.
We believe that generative video will usher in a new media type centered around interactivity: TV shows, movies, ads, and online courses will stop and talk to us. Our entertainment will be a mixture of passive and active experiences depending on what we’re in the mood for. Well, prediction is hard, especially about the future, but that’s how we see it anyway!
We’d love for you to try out the demo and let us know what you think! Post your characters and/or conversation recordings below.
Overall, a fun experience. I think that MH was better than Scott. Max was missing the glitches and moving background but I'd imagine both of those are technically challenging to achieve.
Michael Scott's mouth seemed a bit wrong - I was thinking Michael J Fox but my wife then corrected that with Jason Bateman - which is much more like it. He knew Office references alright, but wasn't quite Steve Carell enough.
The default while it was listening could do with some work, I think - that was the least convincing bit; for Max he would have just glitched or even been completely still I would think. Michael Scott seemed too synthetic at this point.
Don't get me wrong, this was pretty clever and I enjoyed it, just trying to say what I found lacking without trying to sound like I could do better (which I couldn't!).
I am sure they will have open source solutions for fully-generated real-time video within the next year. We also plan to provide an API for our solution at some point.
How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech?
The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion.
I do not think they are running in real time though. From the website: "Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache)." That means it would take them 37.5s to generate 1 second of video, which is fast for video but way slower than real time.
I mostly meant the using the previous frames to generate new frames insight that reminded me but lack knowledge on the specifics of the work
glad if its useful for your work/research to check out the paper
edit: the real-time-ness of it also has to have into equation what HW are you running your model on, obviously easier to make so on a H100 than a 3090, but these memory optimizations really help to make these models usable at all for local stuff, which is a great win i think for overall adoption/further stuff being build upon them a bit like sd-webui from automatic1111 alongside stable diffusion weights models being open sourced was a boom on image gen a couple years back
And yes, exactly. In between each character interaction we need to do speech-to-text, LLM, text-to-speech, and then our video model. All of it happens in a continuously streaming pipeline.
IP law tends to be "richer party wins". There's going to be a bunch of huge fights over this, as both individual artists and content megacorps are furious about this copyright infringment, but OpenAI and friends will get the "we're a hundred-billion-dollar company, we can buy our own legislation" treatment.
e.g. https://www.theguardian.com/technology/2024/dec/17/uk-propos... a straightforward nationalisation of all UK IP so that it can be instantly given away for free to US megacorps.
And that's probably still a subsidized cost!
Bfw "what is the best weed whacker" is John C. Dvorak's "AI test"
How can you be sure? Investing in an ASIC seems like one of the most expensive and complicated solutions.
Impressive technology - impressive demo! Sadly, the conversation seems to be a little bit overplayed. Might be worth plugging ChatGPT or some better LLM in the logic section.,
Also, you do realize that this will be used to defraud people of money and/or property, right?
All about coulda, not shoulda.
Sadly, no one cares. LLM driven fraud is already happening and it is about to become more profitable.
Super cool product in any case.
It seems clumsy to use copyrighted characters in your demos.
Seems to me this will be a standard way to interact with LLMs and even companies - like a receptionist/customer service/salesperson.
Obviously games could use this.
I'll pass thanks.
Each person gets a dedicated GPU, so we were worried about costs before. But, let' s just go for it.
Really looking forward to trying this out!