Unsupervised Machine Translation in the Age of LLMs

9 minute read

Published:

Behind the Paper

For years, machine translation improved by scaling data. But scaling parallel data is not an option for most languages. In the LLM era, that bottleneck did not disappear. Few-shot prompting works best when good translation examples already exist, and for many language pairs they do not. In our recent paper, we asked a harder question: can we mine those examples automatically and still translate well?[4]

Paper summary: this post explains why unsupervised machine translation still matters, how in-context learning changes the problem, and how self-mined examples can improve multilingual LLM translation for low-resource languages without large parallel corpora.

This article is especially relevant to unsupervised machine translation, low-resource translation, in-context learning for machine translation, and multilingual LLM adaptation.

What is unsupervised MT?

Translation without human-aligned sentence pairs, bootstrapped from monolingual text and weak cross-lingual signals.[2][4]

What changed with LLMs?

LLMs made few-shot translation practical, but example quality and example selection still matter a lot.[1][3]

What our paper adds

We self-mine word pairs, turn them into weak sentence examples, and rank the best demonstrations with similarity filtering plus BM25.[4]

Why it matters

If translation examples can be mined from small amounts of unlabeled text, more languages can benefit from modern translation systems.[2][4]

What is unsupervised machine translation?

Direct answer: Unsupervised machine translation (UMT) is translation between languages without human-labeled parallel sentences. Classical UMT starts from monolingual corpora, a weak cross-lingual initialization, and iterative back-translation that gradually improves translation quality.[2]

This line of work matters because conventional machine translation depends heavily on large parallel corpora, while many language pairs have little or no such data. One of the key insights in early UMT was that monolingual text is much easier to obtain than aligned bitext, so the right question is not only “How do we get more labels?” but also “How far can we go without them?”[2]

A useful mental model comes from the 2018 UMT literature: first align the problem just enough to get off the ground, then rely on language modeling, denoising, and back-translation to iteratively refine the system. That recipe turned an ill-posed task into something trainable even before LLM prompting entered the picture.[2]

Toy illustration of the principles of unsupervised machine translation: initialization, language modeling, and back-translation.
A compact visual summary of the classical UMT recipe: weak initialization, denoising or language modeling, and iterative back-translation. Source: Lample et al. (2018).

UMT in the era of LLMs and in-context learning

Large language models changed the interface of translation. Brown et al. framed few-shot learning as giving the model a small number of task demonstrations directly in the prompt, with no gradient updates at inference time. In that setup, a few-shot translation prompt can be as simple as repeated source sentence and target translation pairs followed by the new source sentence to translate.[1]

That is powerful, but it does not solve the hardest part for low-resource translation: where do those demonstrations come from? Brown et al. explicitly note that few-shot learning still requires a small amount of task-specific data. For machine translation, later work showed that the number and quality of prompt examples matter, that performance varies with prompt design, and that directly using monolingual examples can hurt translation while pseudo-parallel examples help.[1][3]

The bottleneck moved, but it did not disappear: LLMs can translate with prompts, yet low-resource settings still suffer from a missing-example problem. The challenge is not only prompting the model, but also constructing the prompt when no parallel data exist.[3][4]

Figure from prior prompting research showing that direct monolingual examples hurt translation while pseudo-parallel examples improve it.
Prior prompting work for machine translation found that monolingual-only demonstrations are usually harmful, while pseudo-parallel examples created by back-translation or forward translation improve prompting quality. Source: Zhang et al. (2023).

Our paper: self-mining in-context examples for unsupervised MT

One-sentence summary: we treat the missing-demonstration problem as an unsupervised mining problem: first mine reliable word translations, then use them to create and filter sentence-level examples that an LLM can use for translation in context.[4]

In our Findings of NAACL 2025 paper, we assume access to a multilingual LLM, vocabularies in the source and target languages, and a small amount of unlabeled text in each language. Importantly, the learning phase uses no human-labeled parallel data, and the paper emphasizes a data-scarce regime with fewer than 1,000 unlabeled sentences per language in the studied setup.[4]

1. Mine word pairs

Use zero-shot prompting to translate frequent source words, reverse the direction, keep consistent back-translations, and rank the remaining pairs by cross-lingual similarity to retain high-quality lexical anchors.[4]

2. Bootstrap with word-level prompts

Feed the best mined word pairs back into the model as in-context examples, refining the word inventory before moving to sentence-level translation.[4]

3. Create weak sentence translations

Translate sentences word by word to obtain rough but semantically useful sentence pairs. They are noisy, but they preserve enough meaning to seed the next stage.[4]

4. Select the right demonstrations

Back-translate to obtain more natural pairs, then choose input-specific examples with a two-step filter: similarity threshold first, BM25 ranking second. The final method is TopK+BM25.[4]

Key idea: instead of assuming demonstrations already exist, the system manufactures them from unlabeled text and then ranks them for each test input. Example selection becomes part of the learning pipeline, not an afterthought.[4]

Results

We evaluated the approach with Llama-3 8B and Bloom 7B on 288 translation directions from FLORES-200. The headline result is that the unsupervised method can be comparable to, and sometimes better than, translation with regular in-context examples drawn from human-annotated data, while also outperforming prior UMT systems by an average of 7 BLEU points in the paper’s summary results.[4]

288 directions

Evaluation scale across FLORES-200 translation directions with two multilingual LLMs.[4]

+7 BLEU

Average improvement over prior state-of-the-art unsupervised MT methods reported in the paper's abstract.[4]

55.76 spBLEU

Average score for TopK+BM25 on the English-involving subset, competitive with regular human-annotated in-context learning.[4]

40.13 BLEU

WMT benchmark average in the paper's Table 2, ahead of the best listed baseline at 33.68.[4]

Heatmap showing translation performance by source and target language resource levels.
Translation performance is highest when both source and target languages are high-resource and lower when either side becomes more data-scarce. Source: El Mekki and Abdul-Mageed (2025).
Line charts showing that using more mined in-context examples generally improves translation until gains taper.
More mined in-context examples generally help, especially when moving from 1 to about 8 examples, after which gains become smaller. Source: El Mekki and Abdul-Mageed (2025).

Two findings stood out to me. First, resource level still matters: even with strong multilingual LLMs, translation is easier when the target side is better represented. Second, better prompting is not just about the model. It is also about retrieval and filtering. In our experiments, a carefully selected unsupervised demonstration set was the difference between a rough translation and a competitive one.[4]

Why this matters for low-resource translation

The broader significance is straightforward. If translation quality depends on large curated bitexts, then many languages remain blocked by a data collection problem before they can benefit from new models. But if an LLM can bootstrap usable demonstrations from a small amount of unlabeled text, the entry cost drops dramatically.[2][4]

That does not mean the problem is solved. Low-resource translation remains harder, and our heatmap makes that visible. But it does mean the path forward looks different. Instead of waiting for perfect parallel corpora, we can start from weak lexical evidence, noisy sentence pairs, and strong multilingual priors, then iteratively mine something useful.[4]

My own takeaway is that unsupervised MT is newly relevant in the LLM era. Not because LLMs made supervision obsolete, but because they made bootstrapping supervision more plausible. For underrepresented languages, that distinction matters. It is the difference between “we cannot build this yet” and “we can begin with what we have.”[2][4]

Bottom line: if we can mine trustworthy in-context examples from unlabeled data, translation systems no longer have to wait for abundant parallel corpora before they become useful. That is a practical route toward broader language coverage in search, assistants, education, and public-facing digital tools.

FAQ

What is unsupervised machine translation?

It is translation without human-aligned sentence pairs. Instead of supervised bitext, the system has to bootstrap from monolingual data, weak lexical alignments, denoising, and back-translation.[2][4]

Is unsupervised MT the same as zero-shot translation?

No. Zero-shot translation is an inference setting where the model receives an instruction but no examples. In our work, the main problem is how to create reusable in-context examples from unlabeled data so the model can translate more reliably than plain zero-shot prompting.[1][4]

Why not just use monolingual examples as demonstrations?

Prior work on prompting for machine translation found that monolingual-only demonstrations generally hurt translation, whereas pseudo-parallel examples created through zero-shot back-translation or forward translation are much more effective.[3]

How much unlabeled data does the approach assume?

The paper studies a setting with fewer than 1,000 unlabeled sentences in each language, together with source and target vocabularies, a multilingual LLM, and an unsupervised sentence similarity function.[4]

What is the main empirical takeaway?

The paper reports that self-mined in-context examples can match or beat regular human-annotated in-context learning in many settings, while improving on previous UMT systems by an average of 7 BLEU points in the paper’s summary results.[4]

References

  1. Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. NeurIPS. PDF
  2. Lample, G., Ott, M., Conneau, A., Denoyer, L., and Ranzato, M. A. (2018). Phrase-Based and Neural Unsupervised Machine Translation. EMNLP. PDF
  3. Zhang, B., Haddow, B., and Birch, A. (2023). Prompting Large Language Model for Machine Translation: A Case Study. ICML / PMLR. PDF
  4. El Mekki, A., and Abdul-Mageed, M. (2025). Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs. Findings of NAACL 2025. PDF

Disclaimer (April 04, 2026): The latest version of this blog post was post-edited and formatted using an LLM.