The Coming of Age of Translation Quality Scoring

Introduction

Translation quality scoring refers to the process of evaluating and assigning a numerical or qualitative score to assess the quality of a translation. It involves measuring the degree to which a translated text meets specific criteria or standards.

Scoring translation quality is desirable for several reasons:

To assess and manage the quality of deliverables from the directly employed linguists, or the quality of linguist services provided by Language Service Provider (LSP) vendors, as part of vendor management or to figure out which vendors are better for which types of content / languages.
To score the quality of machine translation models, especially in cases where the model is being continuously retrained and it is necessary to understand whether the next iteration of the model improves on the prior version, to determine whether to release it or not.
To identify unacceptably bad translations produced by machine translation models so these can be remediated e.g., sent for linguist post-edit by exception, in workflows that use MT without post edit as the default.
To identify the best translations produced by machine translations that are good enough to not need post-editing so that the post-editing stage (and cost) can be skipped for these translations, in workflows that use MT with Post Edit as the default.

The mechanisms to assess translation quality vary whether you are assessing a linguist or LSP service, assessing an NMT model, or assessing a specific translation produced by an NMT model, but the goal in all cases is to assess the translation quality on a set of dimensions including accuracy, fluency, cultural appropriateness, style, and terminology.

Accuracy. A translation’s accuracy is how much it properly conveys the meaning, context, and intent of the source text. An inaccurate translation can lead to miscommunication, confusion, and even legal disputes.
Fluency. Fluency relates to the extent to which the translation is free of grammatical errors, is easily read and understood by the target audience. A lack of fluency in a translation can lead to misunderstandings and make it challenging for the target audience to understand the message, and lead to negative brand associations.
Cultural Appropriateness. High quality translations express the intent of the original text in a culturally appropriate way, adapting things like metaphors or cultural references as necessary to preserve the intent of the text. A well-adapted translation will be more effective and will help the target audience to better understand the message.
Style. The style of a translation refers to factors such as the formality of tone, verbosity, and implied degree of emotional energy of the source text. A mismatch in style between the source text and the translation can lead to confusion and make the message less effective.
Terminology. This refers to specialized words and phrases used in a particular field or industry, and it is expected that these terms are used accurately and consistently. These terms are defined in glossaries and are often not translated (e.g., names of products). It is a terminology error in many cases to translate them at all, or in any event fail to translate them consistently.

Human Evaluation and Scoring Methods

The most widely accepted and utilized method of translation quality assessment is the Multidimensional Quality Metrics (MQM) model. MQM defines a taxonomy of error types. Linguist reviewers annotate translations, marking up the errors with the relevant type and a severity level of Minor to Major to Critical. This annotation data can be then reduced to a quality score (0-100) by deducting from a perfect (100) score, points for each error that are based on weighting factors for each error type and the severity. Organizations can adjust the weights to tailor the scoring to care more or less about various aspects of quality. For instance, some organizations may want to focus mostly on accuracy and fluency with less concern about tone and style, others may care more about style or terminology dimensions.

This form of scoring can also be done by linguists to review the quality of output of Neural Machine Translation (NMT) models. While these are typically evaluated with other methods described below, it can be useful to benchmark those scoring methods against a linguist quality review of a representative sample to ensure correlation of automated metrics to human evaluation and support the ability to trust the automated evaluation metrics.

Machine Quality Evaluation and Estimation

Machine Quality Evaluation and Quality Estimation are similar terms that mean quite different things.

Quality Evaluation is the process of evaluating a newly trained NMT model. This is done by reserving a small portion of the training data that will not be used to train the model, only to evaluate it. It constitutes a set of known good or “reference” translations. Once the model is trained using all the other data, it is then asked to produce translations for all the sentences in evaluation set. These translations are then programmatically compared with the known good reference translations in the evaluation set. BLEU is a commonly used metric that measures how close the model produced translations are to the reference translations. TER, F-Measure, COMET are other commonly used metrics for this that differ in the details of how the scoring is done but the serve the same purpose.

Quality Estimation is different, and more similar in purpose to traditional human quality scoring, in that it is assessing the quality of an individual translation as objectively as possible but without known good reference translations to compare against.

This is not a trivial problem to solve. It is incredibly hard for any algorithm or AI classifier to accurately assess all the dimensions of translation quality described earlier and be 100% right 100% of the time so that the machine evaluation correlates perfectly with human review.

In our labs at MotionPoint we have been testing various AI based QE models. These vary in approach from semantic similarity assessments utilizing vector embeddings produced by domain adapted embedding models, to using Large Language Models to do MQM style evaluations.

We are finding these models perform best at the extremes. They identify very bad translations well and rarely incorrectly tag translations as such. Similarly, they identify very good ones well and rarely incorrectly tag translations as excellent. They are less discerning in the middle ground, but the ability to perform well at the extremes nonetheless makes them very useful in a couple of ways, despite being far from perfect.

Automated Quality Estimation as a Safety Net

Say you are risk tolerant on translation quality generally and just really need translations to be basically accurate and intelligible. If you have a QE model that can at least identify the really bad translations (ie. inaccurate/unintelligble) with high accuracy, you can then process these by exception e.g. have a linguist do a post-edit. This is a more cost-effective way to manage the risk of bad translations, than sending everything for post-edit “just in case.”

Automated Quality Estimation as a Cost Optimizer

On the other end of the spectrum, say you have a high quality bar and MTPE is the default workflow for everything. To the extent that QE models can accurately identify at least some of the machine translations that are good enough to use as-is, this allows the post-editing to be skipped for these translations with significant cost savings, and a low risk of quality sacrifice.

Learn more about AdaptiveQE, MotionPoint’s AI-powered quality estimation tool that optimizes translation workflows, reduces costs, and ensures precise linguistic accuracy for your business needs.

In our lab testing we are finding that in datasets where the human evaluation indicated about 40% of translations did not need post-editing, that the QE models could reliably identify about 15% of the translations as not needing post-edit with high accuracy.

This is valuable if it can save 15% of post-editing costs in MTPE workflows. As technology improves and the 15% gets closer to 30% or more the cost savings will become extremely compelling.

Don’t Let Your Translation Quality Slack

At MotionPoint we have 20 years of history providing top quality translations to our customers and are investing in the technology platform necessary to deliver superior business outcomes, intelligently balancing the cost vs risk tradeoff as it relates to translation.

We are doing this by investing in our platform to ensure we deliver the best quality machine translation engines, and smart workflow options that utilize automated quality estimation tools to convert that quality into beneficial outcomes in terms of cost and risk for our customers.

Want to learn more about how we’re revolutionizing the translation industry by offering human-quality translations for up to 60% less? Download our recent webinar on website translation in light of AI for free! You can find it here.