BLEU score is often bandied about by translation companies and software developers in relation to Machine Translation (MT) systems.

But what is a BLEU score? How does it work? Is it important?

In this article, we take a look at everything you need to know about BLEU scores. And that might be more than you think…

Because if you’re using the BLEU metric as a way to compare two different Machine Translation engines before you purchase, you might not have all of the information you need to make the right decision.

What is BLEU score?

BLEU stands for BiLingual Evaluation Understudy. A BLEU score is a quality metric assigned to a text which has been translated by a Machine Translation engine.

The goal with MT is to produce results equal to those of a professional human translator. BLEU is an algorithm which is designed to measure exactly that.

Because it is fast and cheap to carry out, BLEU has stayed one of the most popular automated metrics for determining the quality of Machine Translation. Even though it was one of the first automated metrics to be created.

Why is BLEU score important?

The best way to judge the quality of the output of an MT system is to have a skilled human linguist evaluate it.

Having a human evaluate the system’s output might be better. Yet, much like the difference between Machine Translation and skilled human translation, using a person also costs more and tends to be slower. It is also often less objective.

In the same way that MT itself was created to try and make the translation process faster and more cost-effective, metrics like BLEU can offer a cheaper, faster alternative when MT systems need to be tested regularly.

How does BLEU score work?

A BLEU score runs on a scale from 0 to 1. You will sometimes see those decimal places turned into a 0 to 100 scale so they can be read more easily.

But even that 0 to 100 scale should not be understood as a percentage measure of accuracy. It isn’t one.

A BLEU score measures a given machine-translated text against a test reference text which has been human-translated. On the BLEU scale:

  • 1 (or 100) is the top end of the scale. The closer a BLEU score is to 1, the more it resembles the reference text from the human translator. If a BLEU score actually hit 1, it would be identical to the specific human reference work supplied. However, this is so unlikely that it is essentially impossible.
  • 0 (zero) is the bottom end of the scale. Results towards this end of the scale indicate that there is much less overlap with the test human reference text.

Is a BLEU score an accurate measure of quality?

BLEU score is a fast and cost-effective way to measure the quality of Machine Translation output. But it does have its limitations. Common issues include:

1) The art of translation

It is important to remember that translation is in some ways as much an art as it is a science.

For example, two highly skilled human translators working on the same material may produce two very good translated texts. Yet if they were compared, they would have a BLEU score of only around 0.6 to 0.7 (60 to 70).

This is because each linguist would likely use different specific words or phrase things slightly differently.

Each could produce an excellent translation. But they can easily be slightly – or very – different from one another in specifics while still being accurate and conveying the same sense as the original document.

This is a good thing to bear in mind when considering BLEU scores.

2) Overfitting measurements

BLEU scores of over 0.7 (70) should be treated with some reservation. As previously mentioned, even the work of two highly skilled human translators would likely struggle to create a BLEU score of more than 0.7.

Thus, very high BLEU scores are often an indication of the measuring algorithm not working or being applied incorrectly.

3) Word order and vocabulary choice

The BLEU algorithm assigns better scores to words in sequential order which match the reference document.

Two matching words in a row is good. Four or five in a row would be very good, according to the algorithm.

But what if the document contains all of the same matching words in a different order? Or different words which have the same meaning?

The BLEU metric might score documents displaying either of those characteristics more harshly. Yet despite this, to a human eye, the MT output might actually compare very favourably with the human reference text.

What are the advantages of BLEU scores?

A BLEU score might not be the perfect judge of Machine Translation quality. Yet it does have several advantages over other metrics and methods:

1) Fast and affordable

Having a human measure the results of your Machine Translation might be better in some circumstances. But using an automated system like BLEU is much faster and cheaper.

2) Easy to understand

A human might be able to create detailed reporting which explains where an MT engine has gone wrong and where it could do better.

For most people and most purposes though, having numerical values is easier to understand.

3) Independent of language

In order to measure the quality of Machine Translation systems designed for different language groups, you would need specialist human translators in every language pair.

Not so with BLEU score. The BLEU algorithm is language independent.

4) Results are close to human evaluation

On average, the scores which the BLEU metric produces are pretty close to those which a human translator would assign to a given document.

5) It is in common usage

One of the biggest advantages of BLEU is that it has been adopted across the translation industry as a measure of MT quality.

This means that, once you understand BLEU scores, you can use them and know you are getting at least some kind of accurate understanding of the quality a Machine Translation engine can produce.

How to compare BLEU scores (and how not to)

There are essentially two different purposes which people use the BLEU metric for. One makes good use of the advantages of BLEU scores. The other risks inaccurate impression of what the metric really means and the quality of the MT system involved:

1) BLEU scores in R&D (the right way)

If you are aiming to evaluate an MT engine you are training (or which you are having trained by custom MT engine training specialists like Asian Absolute), a BLEU score is a handy thing to have. In fact, it’s difficult to build MT systems without them.

BLEU scores give MT engine designers, trainers and users the ability to measure an MT system’s evolving quality over time.

Because the score is objective and can be returned very quickly, it is the ideal measuring system when designers need feedback on the latest improvements they have made swiftly and on a regular basis.

2) BLEU scores in MT system comparison and purchase (not the right way)

It might seem like having two objective scores which demonstrate how “good” two different Machine Translation systems are would be a great way of comparing them.

But using BLEU scores in this way doesn’t usually give you a very accurate picture of the comparative quality of two different engines.

For example, one MT engine might have been trained using the human reference document. Another might never have “seen it” before. There have even been occasions when an MT engine was itself used to develop the reference data.

In general, using BLEU scores to compare different engines, languages, or systems working on two different bodies of text against each other will almost always result in incorrect conclusions.

Where to use BLEU scores

There are several rules of thumb to bear in mind when deciding when to use BLEU scores. If you want to be sure that the metric is giving you useful data on which to base your decisions, you should understand:

1) Same language. Same engine. Same body of text.

If you use BLEU scores from different MT systems, across different languages, or across different bodies of text, you will end up with data which can’t really be properly compared against each other. Always aim to compare like with like.

2) Same reference text and body of text

If you compare BLEU scores generated using the same body of text but different reference texts, you will again have data which should not really be set against each other.

If you want an accurate reading of quality, always ensure the text corpora and human reference text are the same.

3) Same scores do not mean the same thing

Overall, the thing to bear in mind when using BLEU scores is that not all scores – or even most scores – can be compared against each other.

One MT system might score 0.3 (30). Another might produce output with a quality of 0.4 (40). But if those two different engines are designed for different languages or subject matter, you cannot necessarily conclude that the “40” engine is “better” than the “30”.

The real use of BLEU scores is in measuring the development and evolution of your chosen MT system as your developers and trainers work on it. The body of text, test reference text, and language all need to stay the same. Only the alterations made to the engine should be different.

What BLEU scores do and do not do

Think of custom MT engine training and development like a college science experiment. You can’t expect to change multiple variables and then be able to measure why or how the quality of the system is better or worse than it was before or against others.

A BLEU score is only really useful when it is used as a measure of development in the internal development of an MT system.

What a BLEU score is not is a percentage measure of the quality of any Machine Translation engine which you can use to compare it against any other.

Are you trying to decide which Machine Translation engine is right for your project?

Asian Absolute specialises in the selection and training of custom MT engines. We’ve been used by the Financial Times and businesses in every industry.

Let’s talk about your project. There’s no commitment to use us. Just the relevant information you need to help make your decision.