Meta released No Language Left Behind (NLLB) [1], I think in 2022. I wonder why this in not "NLLB 2.0"? These companies love introducing new names to confuse things
I'll be looking at this in detail. I've started a company to do similar things, https://6k.ai
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
I find that meta’s translations are very poor compared to others, at least for relatively obscure languages, which I figured was relevant considering the article.
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
So, LLMs are noticeably better in Khmer than Google Translate? I wonder why Google Translate doesn't use Gemini under-the-hood. Perhaps it's more prone to hallucinations.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
There's a dropdown on Google Translate that lets you choose "Advanced" mode or "Classic" mode. Advanced mode uses Gemini but it's only available for select languages.
Hello from Siem Reap, Cambodia! Awesome to see a fellow tech enthusiast from Cambodia.
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
That's a high count, but still a bit away from "Omni". Usual count is between 4k and 8k depending the source. But the first 1k might be the hardest, certainly.
Off topic, since the AI craze MS‘ documentation translation has ridiculous errors like translating try catch keywords to "versuchen" and "fangen" for German pages
I’m very wary of celebrating Meta’s language work when the company was credibly found to have contributed to the genocide against the Rohingya in Myanmar, and separately, to human rights abuses against Tigrayans during the conflict in northern Ethiopia. Be careful whose sins you’re laundering.
[1] https://ai.meta.com/research/no-language-left-behind/
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
(I live in Cambodia where they speak Khmer)
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
(Sorry I had to)
Is it open weight? If so, why isn't there just a straight link to the models?
And the errors are really basic, like translating shortly to short, not the same thing at all!
https://www.amnesty.org/en/latest/news/2025/02/meta-new-poli... https://www.amnesty.org/en/latest/news/2023/10/meta-failure-...