What is BLEU?
The acronym BLEU refers to a “Bilingual Evaluation Understudy,” and it’s a statistic for measuring the accuracy of machine translations compared to human translators. IBM’s version of BLEU NLP is a popular tool for analyzing data and gauging the quality of machine translations.
- The machine translation is compared to one or more reference translations through BLEU by comparing n-grams (word sequences).
The machine-generated translation is given a score between 0 and 1 based on the measure, with 1 signifying full agreement with the reference translation. A machine translation’s accuracy is measured by how well it matches the source text’s n-gram frequency distribution.
When comparing translations across languages with different grammatical structures or word order, the BLEU metric might provide deceptive findings due to its limitations. Yet, because of its simplicity and convenience, it continues to be one of the most used measures for evaluating machine translation.
How to calculate the BLEU score?
To determine BLEU scores, do the following steps:
- Determine the n-gram precision Compute n-gram accuracy by adding up the number of n-grams in the machine-generated translation that also occurs in the reference translation(s) and subtracting it from the total amount of n-grams in the machine-generated translation.
- Determine the brevity penalty This penalty is applied to translations that are smaller than the reference translations. The length of the machine-generated translation divided by the length of the smallest source translation yields the brevity penalty.
- Integrate the n-gram accuracy measures Take the geometric mean of the n-gram precisions. This yields a single translation score that shows how well the translation matches the source translations regarding n-gram accuracy.
- Determine the ultimate BLEU score To get the final BLEU score, multiply the combined n-gram accuracy by the brevity penalty.
The following is the formula for calculating the code BLEU score:
- BLEU = brevity_penalty * exp(sum(w_n * log(p_n)))
where:
- The brevity penalty is explained already.
- w_n is the weight applied to the n-gram accuracy score. Weights are often set to 1/n, where n refers to the number of n-gram sizes utilized
- p_n represents the precision rating for the n-gram size.
For instance, to compute the BLEU score with 1-gram and 2-gram precisions, given a machine-generated translation and two reference translations, one would:
- Count the number of 1 grams: in the machine-generated translation that also exists in the reference translations and divide by the total number of 1 grams in the machine-generated translation to get the 1-gram precision.
- Calculate the 2 grams: Count the number of 2 grams in the machine-generated translation that also appear in the reference translations and divide by the total number of 2 grams in the machine-generated translation.
- Determine the brevity penalty: The length of the machine-generated translation is divided by the length of the shortest reference translation.
- Add the n-gram precisions: Take the geometric mean of the precisions of 1 and 2 grams.
- Determine the ultimate BLEU score: Divide the total n-gram accuracy by the brevity penalty.
Since only two n-gram sizes are utilized in this example, the weights would be set to 0.5 for each n-gram precision.
Cumulative and Individual BLEU Scores
The BLEU statistic may be used to assess machine translation output in two ways: cumulative and individual.
Separate BLEU is determined for each reference translation independently and then averaged. This method is beneficial when there are numerous reference translations for the same source text since it enables you to independently assess how well the machine translation resembles each reference.
Cumulative BLEU, on the other hand, is determined by adding the n-gram accuracy scores from all reference translations and then taking the geometric mean. Since there is just one reference translation available, this technique is important because it enables you to assess how well the machine translation matches the reference as a whole.
Here’s an illustration of the distinction between cumulative and individual BLEU. Assume you have three reference translations (A, B, and C) and a machine-generated translation for a source phrase (D). Individual BLEU is calculated by independently calculating the BLEU score between D and each of the references (A, B, and C) and then averaging them. Individual results might look something like this:
- BLEU(A,D) = 0.3
- BLEU(B,D) = 0.4
- BLEU(C,D) = 0.2
- BLEU average = (0.3 + 0.4 + 0.2) / 3 = 0.3
To get a cumulative BLEU score, add the n-gram accuracy scores from all three reference translations (A, B, and C) before calculating the geometric mean.
- BLEU cumulative = 0.291 = exp(1/3 * (log(0.3) + log(0.4) + log(0.2))
As you can see, the individual BLEU ratings enable you to assess how well the machine translation matches each reference translation individually, but the cumulative BLEU score assesses how well it matches all of the reference translations combined. Both procedures have advantages and disadvantages, and which one to choose will be determined by the precise assessment job at hand.
Overall, BLEU is a valuable tool for assessing machine translations; however, it should be combined with other assessment metrics and human judgment to provide a full picture of translation quality.