Training mRNA Language Models Across 25 Species for $165

We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

145 points | by maziyar 3 days ago

14 comments

seamossfet 22 hours ago
The problem with models like this is they're built on very little actual training data we can trace back to verifiable protein data. The protein data back, and other sources of training data for stuff like this, has a lot of broken structures in them and "creative liberties" taken to infer a structure from instrument data. It's a very complex process that leaves a lot for interpretation.
On top of that, we don't have a clear understanding on how certain positions (conformations) of a structure affect underlying biological mechanisms.
Yes, these models can predict surprisingly accurate structures and sequences. Do we know if these outputs are biologically useful? Not quite.
This technology is amazing, don't get me wrong, but to the average person they might see this and wonder why we can't go full futurism and solve every pathology with models like these.
We've come a long way, but there's still a very very long way to go.
[-]
- stardust2 19 hours ago
  How do we get more verifiable protein data? So even if we had better data, we don't yet understand how the structure impacts the biology?
nradclif 13 hours ago
"Complete results, architectural decisions, and runnable code below."
This is a weird post, there doesn't seem to be any "below" here. Another comment linked the article: https://huggingface.co/blog/OpenMed/training-mrna-models-25-...
[-]
- justinclift 15 minutes ago
  Yeah. Things like "Complete results, architectural decisions, and runnable code below." is literally how AI outputs stuff, so I'd expect the post was AI written too. :(
maziyar 3 days ago
full article: https://huggingface.co/blog/OpenMed/training-mrna-models-25-...
[-]
- pfisherman 21 hours ago
  Nice work! Here is an article you may find helpful if you have not already come across it.[0]. You may also want to consider benchmarking against some non ML methods.[1]
  0. https://pubmed.ncbi.nlm.nih.gov/35318324/
  1. https://www.nature.com/articles/s41586-023-06127-z
- xyz100 1 day ago
  What makes this dataset or problem worth solving compared to other health datasets? Would the results on this task be broadly useful to health?
  [-]
  - CyberDildonics 1 day ago
    What other "datasets" are you talking about? How do you "solve a dataset" ?
    [-]
    - xyz100 11 hours ago
      You solve a dataset when you learn what there is to learn about the phenomenon of interest. The limit of such phenomenon is “cure all disease”, and clearly this is not solving that.
      [-]
      - CyberDildonics 3 hours ago
        What are you talking about? "the phenomenon of interest"? There is nothing you wrote in either comment that makes sense.
        What is a "dataset" that has been "solved" and what did the program do that 'solved' it?
rubicon33 1 day ago
Can someone explain what one might use this model for? As a developer with a casual interest in biology it would be fun to play with but honestly not sure what I would do
[-]
- colechristensen 1 day ago
  You can get your feet wet with genetic engineering for surprisingly little money.
  This guy shows a lot of how it's done: https://www.youtube.com/@thethoughtemporium
  Basically you can design/edit/inject custom genes into things and see real results spending on the scale of $100-$1000.
  [-]
  - com2kid 16 hours ago
    We actually did this in my highschool genetics class back in 1999! We made bacteria change color by splicing in a gene. Awesome stuff.
    The (public!) school had a grant from one of Seattle's biotech boom companies.
  - someuser54541 1 day ago
    Is there something like this in text/readable format?
  - _zoltan_ 22 hours ago
    My main concern is using fungi. If it ends up in my lungs I'm most likely screwed, right?
    [-]
    - nurettin 21 hours ago
      Yes, but most students produce their best work while infected.
    - colechristensen 20 hours ago
      This is the classic meme https://www.reddit.com/r/labrats/comments/mmv2ig/lab_strains...
      Lab strains of things tend to be extremely sensitive and not human adapted. You shouldn't study and modify human-infecting organisms in your basement anyway. While you shouldn't ignore protective equipment and proper procedure... paranoia about infecting yourself with a lab leak isn't warranted.
jazzpush2 10 hours ago
A Codon-based model is cool. I know NVIDIA is building quite a large one.
At GTC they showed an SAE they built on a smaller version of it, allowing you to see what their model learned: https://research.nvidia.com/labs/dbr/blog/sae/
dhruv3006 13 hours ago
Interesting work - Looks like AI for science is having it's day right now.
khalic 1 day ago
> In Progress: CodonJEPA
JEPA is going to break the whole industry :D
[-]
- digdugdirk 1 day ago
  Can you explain this? I haven't heard of JEPA, and from a quick search it seems to be vision/robotics based?
  [-]
  - khalic 1 day ago
    It’s a self supervised learning architecture, and it’s pretty much universal. The loss function runs on embeddings, and some other smart architectural choices allover. Worth diving into for a few hours, Yann LeCun gives some interesting talks about it
  - lukeinator42 1 day ago
    https://openreview.net/pdf?id=BZ5a1r-kVsf
colingauvin 21 hours ago
HN's blindspots never cease to amaze me.
I am a structural biologist working in pharmaceutical design and this type of thing could be wildly useful (if it works).
[-]
- justinclift 8 minutes ago
  Blind spot?
yieldcrv 1 day ago
Distributing the load on this will probably be infinitely more useful than “folding at home”
simianwords 1 day ago
What makes these Domain specific models work when we don’t have good domain models for health care, chemistry, economics and so on
[-]
- colechristensen 1 day ago
  >we don’t have good domain models for health care, chemistry, economics and so on
  Who says we don't?
  [-]
  - simianwords 1 day ago
    Examples please?
    [-]
    - colechristensen 1 day ago
      No, it's really simple to search for domain specific models being used "in production" all over the place
      [-]
      - simianwords 23 hours ago
        I didn’t find a single one that outperforms a general model.
        [-]
        colechristensen 23 hours ago
        Ok, alphafold.
        [-]
        simianwords 23 hours ago
        It’s not a large language model
HocusLocus 1 day ago
gray goo of the future
skyskys 23 hours ago
hmmmm seems like some fake hype.