A numerical metric that measures how closely a machine translation matches one or more human reference translations, expressed as a value between 0 and 1.
BLEU stands for Bilingual Evaluation Understudy. Developed by IBM researchers in 2002, it remains the most widely used automatic metric for evaluating machine translation output. The score compares word sequences, called n-grams, between the machine-translated text and a human-translated reference. The more sequences match, the higher the score. A score of 0 means no overlap at all. A score of 1 means an exact match, which in practice is nearly impossible to achieve even with high-quality human translations.
The calculation uses modified precision, which counts how many n-grams from the machine translation appear in the reference, capped to avoid inflating scores through word repetition. It also applies a brevity penalty to discourage systems from generating very short outputs that score well simply by including fewer words.
BLEU scores are often expressed as a percentage (0–100) rather than a decimal. General interpretation benchmarks used in the MT industry:
These ranges are not universal. The same score means different things depending on the language pair, content domain, and how many reference translations were used. A score of 35 on legal contracts is not the same as 35 on marketing copy.
BLEU operates on surface-level word matches. It cannot recognize synonyms, so “big” and “large” are treated as different words even when the meaning is identical. It does not assess grammatical correctness, fluency, cultural appropriateness, or whether the translation is actually accurate in meaning. A sentence that reuses the right words in nearly the right order can score well while completely distorting the intent.
This is why industry experts and localization teams increasingly use BLEU alongside newer metrics like COMET (Cross‑lingual Optimized Metric for Evaluation of Translation), which uses neural models trained on human judgments to evaluate semantic quality more reliably, especially for domain-specific content. BLEU remains useful as a fast, cheap baseline signal during MT engine selection and regression testing. For production quality evaluation, it should not be the only metric.
In practice, localization teams use BLEU in two main ways. First, during MT engine selection: running candidate engines on a sample of actual project content and comparing BLEU scores gives an objective basis for choosing the best-performing system for that content type. Second, as a KPI during post-editing workflows: a higher baseline BLEU score means translators spend less time correcting MT output, directly reducing cost and turnaround time.
What BLEU does not replace is human review. For brand-sensitive, legal, or safety-critical content, automatic metrics of any kind are a starting point, not a final judgment.