Research blog • translation • low-resource languages

Unsupervised Machine Translation in the Age of LLMs

For years, machine translation improved by scaling data. But scaling parallel data is not an option for most languages. In the LLM era, that old problem has not disappeared: few-shot prompting works best when good examples already exist, and for many languages they simply do not. In our recent paper, we asked a harder question: can we mine those examples automatically and use them to translate anyway?[1][2][4]

By Abdellah El Mekki Based on Findings of NAACL 2025 Topic: Unsupervised MT + in-context learning
On this page

What is unsupervised MT?

Translation without human-aligned sentence pairs. The system has to bootstrap from monolingual text and weak cross-lingual signals rather than conventional bitext.[2][4]

What changed with LLMs?

LLMs made in-context learning practical, but they did not remove the need for good demonstrations. Example quality and selection still matter a lot.[1][3]

What our paper adds

We self-mine word pairs, turn them into weak sentence translations, then select the most useful sentence-level demonstrations with a similarity threshold plus BM25.[4]

Why it matters

If good examples can be mined from small amounts of unlabeled text, translation becomes more reachable for languages that lack large parallel corpora.[2][4]

What is unsupervised machine translation?

Direct answer: Unsupervised machine translation (UMT) is the task of translating between languages without human-labeled parallel sentences. Classical UMT relies on monolingual corpora, a way to initialize weak cross-lingual correspondences, and iterative back-translation to gradually improve translation quality.[2]

This line of work became important because conventional MT depends heavily on large parallel corpora, while many language pairs have very little or no such data. One of the key insights in early UMT was that monolingual text is often easier to obtain than aligned bitext, so the right question is not “How do we get more labels?” but “How far can we go without them?”[2]

A useful mental model comes from the 2018 UMT literature: first align the problem just enough to get off the ground, then use language modeling and back-translation to iteratively refine the system. That recipe turned an ill-posed problem into something trainable, even before LLM prompting entered the picture.[2]

Toy illustration of the principles of unsupervised machine translation: initialization, language modeling, and back-translation.
A compact visual summary of the classical UMT recipe: weak initialization, denoising or language modeling, and iterative back-translation. Source: Lample et al. (2018), Phrase-Based & Neural Unsupervised Machine Translation.

UMT in the era of LLMs and in-context learning

Large language models changed the interface of translation. Brown et al. framed few-shot learning as giving the model a small number of task demonstrations directly in the prompt, with no gradient updates at inference time. In their formulation, a few-shot example can be as simple as a source sentence followed by its translation, repeated K times before the test sentence.[1]

That is powerful, but it does not solve the hardest part for low-resource translation: where do those demonstrations come from? Brown et al. explicitly note that few-shot learning still requires a small amount of task-specific data. For machine translation, later work showed that the number and quality of prompt examples matter, that performance varies with prompt design, and that directly using monolingual examples can hurt translation while pseudo-parallel examples help.[1][3]

The bottleneck moved, but it did not disappear: LLMs can translate with prompts, yet low-resource settings still suffer from a missing-example problem. In other words, the challenge is not only prompting the model, but also constructing the prompt when no parallel data exist.[3][4]

Figure from prior prompting research showing that direct monolingual examples hurt translation while pseudo-parallel examples improve it.
Prior prompting work for MT found that monolingual-only demonstrations are usually harmful, while pseudo-parallel examples created by back-translation or forward-translation can improve prompting quality. Source: Zhang et al. (2023), Prompting Large Language Model for Machine Translation: A Case Study.

Our paper: self-mining in-context examples for unsupervised MT

One-sentence summary: We treat the missing-demonstration problem as an unsupervised mining problem: first mine reliable word translations, then use them to create and filter sentence-level examples that an LLM can use for translation in context.[4]

In our Findings of NAACL 2025 paper, we assume access to a multilingual LLM, vocabularies in the source and target languages, and a small amount of unlabeled text in each language. Importantly, the learning phase has no human-labeled parallel data; the paper also emphasizes a data-scarce regime with fewer than 1,000 unlabeled sentences per language in the setup it studies.[4]

1

Mine word pairs

Use zero-shot prompting to translate frequent source words, reverse the direction, keep consistent back-translations, and rank the remaining pairs by cross-lingual similarity to retain high-quality lexical anchors.[4]

2

Bootstrap with k-shot word prompts

Feed the best mined word pairs back into the model as in-context examples, refining the word-level inventory before moving to sentence-level translation.[4]

3

Create weak sentence translations

Translate sentences word by word to obtain rough but semantically useful sentence pairs. These are noisy, but they preserve enough meaning to seed the next stage.[4]

4

Select the right sentence examples

Back-translate to obtain more natural pairs, then choose input-specific examples with a two-part filter: first a similarity threshold, then BM25 ranking over the surviving pairs. That final selection method is TopK+BM25.[4]

This design matters because it turns “example selection” from a luxury into a core part of the model-building process. Instead of assuming demonstrations already exist, the system manufactures them from unlabeled text and then ranks them for each test input.[4]

What we found

We evaluated the approach with Llama-3 8B and Bloom 7B on 288 translation directions from FLORES-200. The headline result is that the unsupervised method can be comparable to, and sometimes better than, translation with regular in-context examples drawn from human-annotated data, while also outperforming prior UMT systems by an average of 7 BLEU points in the paper’s summary results.[4]

Scale of evaluation 288

translation directions on FLORES-200 with two multilingual LLMs.[4]

Against prior UMT +7 BLEU

average improvement reported in the paper’s abstract over previous state-of-the-art UMT methods.[4]

English-involving subset 55.76

average spBLEU for TopK+BM25, compared with 55.07 for regular k-shot ICL and 56.93 for regular BM25-based k-shot ICL, without human-annotated data for the unsupervised method.[4]

WMT benchmark average 40.13

BLEU average in the paper’s Table 2, ahead of the best listed baseline at 33.68.[4]

Heatmap showing translation performance by source and target language resource levels.
Translation performance is highest when both source and target languages are high-resource and lower when either side becomes more data-scarce. Source: El Mekki & Abdul-Mageed (2025), Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs.
Line charts showing that using more mined in-context examples generally improves translation until gains taper.
More mined in-context examples generally help, especially when moving from 1 to about 8 examples, after which gains become smaller. Source: El Mekki & Abdul-Mageed (2025), Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs.

Two findings stood out to me. First, resource level still matters: even with strong multilingual LLMs, translation is easier when the target side is better represented. Second, better prompting is not just about the model; it is about the retrieval and filtering of examples. In our experiments, a carefully selected unsupervised demonstration set was the difference between a rough translation and a competitive one.[4]

Why this matters for language access

The broader significance is straightforward. If translation quality depends on large curated bitexts, then many languages remain blocked by a data collection problem before they can benefit from new models. But if an LLM can bootstrap usable demonstrations from a small amount of unlabeled text, the entry cost drops dramatically.[2][4]

That does not mean the problem is solved. Low-resource translation remains harder, and our own heatmap makes that visible. But it does mean the path forward looks different. Instead of waiting for perfect parallel corpora, we can start from weak lexical evidence, noisy sentence pairs, and strong multilingual priors, then iteratively mine something useful.[4]

My own takeaway is that unsupervised MT is newly relevant in the LLM era. Not because LLMs made supervision obsolete, but because they made bootstrapping supervision more plausible. For underrepresented languages, that distinction matters. It is the difference between “we cannot build this yet” and “we can begin with what we have.”[2][4]

Bottom line: if we can mine trustworthy in-context examples from unlabeled data, translation systems no longer have to wait for abundant parallel corpora before they become useful. That is a practical route toward broader language coverage in search, assistants, education, and public-facing digital tools.

FAQ

What is unsupervised machine translation?

It is translation without human-aligned sentence pairs. Instead of learning from supervised bitext, the system has to bootstrap from monolingual data, weak lexical alignments, denoising, and back-translation.[2][4]

Is unsupervised MT the same as zero-shot translation?

No. Zero-shot translation is an inference setting where the model receives an instruction but no examples. In our work, the key problem is how to create reusable in-context examples from unlabeled data so the model can translate more reliably than plain zero-shot prompting.[1][4]

Why not just use monolingual examples as demonstrations?

Prior work on prompting for MT found that monolingual-only demonstrations generally hurt translation, whereas pseudo-parallel examples created through zero-shot back-translation or forward-translation are much more effective.[3]

How much unlabeled data does this approach assume?

The paper studies a setup with fewer than 1,000 unlabeled sentences in each language, together with source and target vocabularies, a multilingual LLM, and an unsupervised sentence similarity function.[4]

What is the main empirical takeaway?

The paper reports that self-mined in-context examples can match or beat regular human-annotated ICL in many settings, while improving on previous UMT systems by an average of 7 BLEU points in the paper’s summary results.[4]

References

This post cites the main papers directly in the text and in figure captions so that readers, search engines, and AI assistants can trace every major claim back to the primary source.

  1. Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. NeurIPS. PDF
  2. Lample, G., Ott, M., Conneau, A., Denoyer, L., & Ranzato, M. A. (2018). Phrase-Based & Neural Unsupervised Machine Translation. EMNLP. PDF
  3. Zhang, B., Haddow, B., & Birch, A. (2023). Prompting Large Language Model for Machine Translation: A Case Study. ICML / PMLR. PDF
  4. El Mekki, A., & Abdul-Mageed, M. (2025). Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs. Findings of NAACL 2025. PDF