A form of automated MT quality assessment that predicts the quality of machine translation output without requiring a human reference translation.
Commonly written as QE and often referred to as MTQE (Machine Translation Quality Estimation), quality estimation uses machine learning models to score translated segments in real time. Unlike evaluation metrics such as BLEU or COMET, which compare MT output against a human reference, QE works on new content where no reference exists yet. The model analyzes the source and target text together and produces a score indicating how likely the translation is to be accurate, fluent, and ready for use.
Scores are typically expressed on a scale of 0 to 100, though some systems return categorical labels such as Good, Fair, or Poor. The underlying models are trained on large datasets of machine-translated content that has been reviewed and corrected by human translators, so the estimations reflect patterns learned from real post-editing behavior rather than abstract linguistic rules.
The primary use case is intelligent routing, sometimes called hybrid post-editing. Instead of sending every MT segment to a human reviewer regardless of quality, teams set score thresholds that determine what happens to each segment automatically:
This approach means human effort is concentrated where it adds the most value. Routine, predictable content moves through without review, while complex or uncertain segments get the attention they need. Teams that have implemented QE-driven workflows have reported significant reductions in post-editing volume and cost.
QE also helps with MT engine selection. Running candidate engines on a sample of real project content and comparing QE scores across segments provides a more practical signal than generic benchmark comparisons.
QE scores can be skewed when the MT engine and the QE model are trained on the same data. In that case, the estimator tends to rate the engine’s output more favorably than an independent model would, potentially allowing errors to slip through with high confidence scores. Teams should treat unusually high average scores as a signal to audit, not as confirmation of quality.
QE also cannot catch errors that require domain knowledge, cultural context, or knowing the brand voice. A sentence can be grammatically correct, semantically close to the source, and still be wrong for a specific audience. Automated scoring is a triage tool, not a replacement for human review on high-stakes content.
BLEU and COMET are evaluation metrics: they measure quality after the fact by comparing MT output to human references. QE is a prediction mechanism: it estimates quality before any human review takes place. In practice, teams use all three at different stages: BLEU and COMET for MT engine benchmarking, QE for live production workflows.