BLEU score

A numerical metric that measures how closely a machine translation matches one or more human reference translations, expressed as a value between 0 and 1.

BLEU stands for Bilingual Evaluation Understudy. Developed by IBM researchers in 2002, it remains the most widely used automatic metric for evaluating machine translation output. The score compares word sequences, called n-grams, between the machine-translated text and a human-translated reference. The more sequences match, the higher the score. A score of 0 means no overlap at all. A score of 1 means an exact match, which in practice is nearly impossible to achieve even with high-quality human translations.

The calculation uses modified precision, which counts how many n-grams from the machine translation appear in the reference, capped to avoid inflating scores through word repetition. It also applies a brevity penalty to discourage systems from generating very short outputs that score well simply by including fewer words.

🤔 How to read a BLEU score #️⃣

BLEU scores are often expressed as a percentage (0–100) rather than a decimal. General interpretation benchmarks used in the MT industry:

  • Below 20 — output is mostly unusable without significant editing
  • 20–30 — rough, understandable but requires heavy post-editing
  • 30–50 — acceptable quality, suitable for many production workflows with light editing
  • 50+ — high quality, approaching human-level output for certain content types

These ranges are not universal. The same score means different things depending on the language pair, content domain, and how many reference translations were used. A score of 35 on legal contracts is not the same as 35 on marketing copy.

🔢 Key points about BLEU scores in localization #️⃣

  • BLEU is useful for comparing MT engines on the same content type. It gives a consistent, repeatable signal for whether one system performs better than another on your specific material.
  • A higher BLEU score means lower post-editing effort in practice. In many practical scenarios, higher BLEU scores tend to correlate with lower post‑editing effort, though the relationship can vary by domain and language pair.
  • BLEU scores are not comparable across different test sets. A vendor showing a high BLEU score on generic news data tells you very little about how their engine will perform on your product UI strings or technical documentation.
  • BLEU does not measure meaning. A translation can use the right words in nearly the right order, score well, and still reverse the meaning of a negation or misidentify the subject of a sentence.
  • Using only one reference translation limits the score. BLEU was designed for multiple references. When only one human reference is available, perfectly valid alternative phrasings get penalized.

🙅🏻‍♂️ What BLEU does not catch #️⃣

BLEU operates on surface-level word matches. It cannot recognize synonyms, so “big” and “large” are treated as different words even when the meaning is identical. It does not assess grammatical correctness, fluency, cultural appropriateness, or whether the translation is actually accurate in meaning. A sentence that reuses the right words in nearly the right order can score well while completely distorting the intent.

This is why industry experts and localization teams increasingly use BLEU alongside newer metrics like COMET (Cross‑lingual Optimized Metric for Evaluation of Translation), which uses neural models trained on human judgments to evaluate semantic quality more reliably, especially for domain-specific content. BLEU remains useful as a fast, cheap baseline signal during MT engine selection and regression testing. For production quality evaluation, it should not be the only metric.

BLEU in localization workflows #️⃣

In practice, localization teams use BLEU in two main ways. First, during MT engine selection: running candidate engines on a sample of actual project content and comparing BLEU scores gives an objective basis for choosing the best-performing system for that content type. Second, as a KPI during post-editing workflows: a higher baseline BLEU score means translators spend less time correcting MT output, directly reducing cost and turnaround time.

What BLEU does not replace is human review. For brand-sensitive, legal, or safety-critical content, automatic metrics of any kind are a starting point, not a final judgment.

Curious about software localization beyond the terminology?

⚡ Manage your translations with Localazy! 🌍