Evaluate Language Model Translations with BLEU & ROUGE Scores

Evaluate Language Model Translations – BLEU & ROUGE Scores

Language models have significantly advanced, with translation being a key application. To evaluate language model translations effectively, we need quantitative metrics that compare the generated outputs with human references. Two widely-used metrics for this purpose are BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation). This article provides a complete guide to leveraging these metrics, including their advantages, disadvantages, and limitations.


Why Evaluate Translations?

Evaluating machine translations is crucial to:

  1. Ensure accuracy: Verify that the model captures the correct meaning, tone, and context.
  2. Compare models: Benchmark one model against others to select the best performer.
  3. Refine performance: Identify areas where the model struggles and optimize it further.

Among many evaluation techniques, BLEU and ROUGE stand out as established standards in the field. However, each has its unique strengths and limitations.


Understanding BLEU Score to Evaluate Language Model Translations

What is BLEU?

The BLEU score measures the precision of n-grams (contiguous word sequences) in the candidate translation compared to one or more reference translations. It focuses on how many words in the candidate overlap with the reference while penalizing overly short translations using a brevity penalty.

How BLEU Works?

BLEU considers n-grams from unigrams (single words) to higher-order n-grams (e.g., sequences of 2, 3, or 4 words). The final BLEU score is a weighted geometric mean of precision scores for these n-grams, adjusted with the brevity penalty.

Advantages of BLEU

  1. Automated and fast: Enables quick evaluation of large datasets.
  2. Widely used: Standard for benchmarking translation systems.
  3. Reproducible: Provides consistent results across different experiments.

Disadvantages of BLEU

  1. Ignores semantics: BLEU focuses solely on word overlap and cannot measure the meaning of translations.
  2. Penalizes diversity: Penalizes valid translations that differ in phrasing from the reference.
  3. Fixed n-gram focus: Fails to capture long-term dependencies or word order beyond the n-gram range.

Ranges and Interpretation

  1. BLEU scores range from 0 to 1, where 1 indicates a perfect match with the reference.
  2. In practice, scores above 0.5 are considered high for machine translations.
  3. Lower scores may still indicate acceptable translations, especially for diverse language pairs.

Understanding ROUGE Score to Evaluate Language Model Translation

What is ROUGE?

ROUGE measures recall rather than precision, focusing on how much of the reference is captured by the candidate translation. It evaluates word overlap, bigram overlap, and the longest common subsequence (LCS) between the reference and candidate translations.

Types of ROUGE

  1. ROUGE-1: Measures unigram (word-level) overlap.
  2. ROUGE-2: Measures bigram (two-word sequence) overlap.
  3. ROUGE-L: Measures the longest common subsequence to capture sentence structure similarity.

Advantages of ROUGE

  1. Recall-focused: Ensures important information from the reference is retained.
  2. Adaptable: Works well for summarization tasks as well as translation.
  3. Human-readable: Offers intuitive scores tied to overlapping content.

Disadvantages of ROUGE

  1. Context-insensitive: Cannot measure grammatical or semantic correctness.
  2. Dependent on references: Performance varies greatly based on reference quality.
  3. Limited to overlap: Cannot handle valid paraphrases or synonyms effectively.

Ranges and Interpretation

  1. ROUGE scores are typically presented as percentages (0 to 100%) or decimals (0 to 1).
  2. Higher ROUGE scores indicate better recall.
  3. ROUGE-1 and ROUGE-2 scores above 0.6 are generally considered good for translations.

Advantages and Limitations of BLEU and ROUGE

BLEU Advantages

  1. Measures precision and penalizes brevity, making it suitable for translations that emphasize fluency.
  2. Straightforward implementation and compatibility with n-gram-based evaluations.

BLEU Limitations

  1. Struggles with synonyms and paraphrases.
  2. Focuses on word overlap rather than the overall context or meaning.

ROUGE Advantages

  1. Captures the content coverage of a translation.
  2. Performs well for summarization and language pairs where recall matters more.

ROUGE Limitations

  1. Ignores precision, which can result in overly verbose translations.
  2. Cannot differentiate between grammatically correct and incorrect phrases.

Setting Up the Evaluation

To demonstrate BLEU and ROUGE evaluation, we’ll use the H2OGPTE client for translations and Python’s nltk and rouge-score libraries for metric calculations.

1: Installing Dependencies

pip install h2ogpte nltk rouge-score

Additionally, download NLTK resources:

import nltk
nltk.download('punkt')

2: Initializing the H2OGPTE Client

from config import H2O_GPT_E_API_KEY, REMOTE_ADDRESS
from h2ogpte import H2OGPTE

client = H2OGPTE(address=REMOTE_ADDRESS, api_key=H2O_GPT_E_API_KEY)

3: Translation Function

def translate_text(question, target_language="Spanish", llm="gpt-4o-mini", llm_args=None):
    if llm_args is None:
        llm_args = dict(temperature=0.5, max_new_tokens=100, response_format="text")
    
    prompt = f"""You are a professional translation assistant. \
    Your task is to translate ```{question}``` from English to a specified ```{target_language}``` language while \
    maintaining the original meaning, tone, and context. Just provide translation in the final output and nothing else."""
    
    response = client.answer_question(question=prompt, llm=llm, llm_args=llm_args)
    return response.content.strip()

4: Metric Evaluation Function

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

def evaluate_translation(reference, candidate):
    smoothie = SmoothingFunction().method4
    bleu_score = sentence_bleu([reference.split()], candidate.split(), weights=(1.0, 0, 0, 0), smoothing_function=smoothie)
    
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = scorer.score(reference, candidate)
    
    return {
        "BLEU": bleu_score,
        "ROUGE-1": rouge_scores['rouge1'].fmeasure,
        "ROUGE-2": rouge_scores['rouge2'].fmeasure,
        "ROUGE-L": rouge_scores['rougeL'].fmeasure
    }

Demonstration with Example Data

Dataset

wmt_data = [
    {"source": "How are you?", "reference": "¿Cómo estás?"},
    {"source": "Good morning", "reference": "Buenos días"},
    {"source": "What is your name?", "reference": "¿Cómo te llamas?"},
]

Evaluation Workflow

results = []
for data in wmt_data:
    source = data["source"]
    reference = data["reference"]
    
    candidate = translate_text(source)
    metrics = evaluate_translation(reference, candidate)
    results.append({"source": source, "reference": reference, "candidate": candidate, "metrics": metrics})

Displaying Results

for result in results:
    print(f"\nSource: {result['source']}")
    print(f"Reference: {result['reference']}")
    print(f"Candidate: {result['candidate']}")
    print(f"Metrics: {result['metrics']}")
Evaluating Language Models - Blue , Rouge Score - Evaluate language model translations - Data Coach - https://datacoach.in/
Evaluating Language Translation Model Result

Interested in more content on Generative AI? Check out my article Andrew Ng’s Python AISuite, a Game-Changer for AI Developers!


Conclusion

BLEU and ROUGE scores are invaluable tools evaluate language model translations. However, they are not without limitations. BLEU excels in precision-based evaluation, while ROUGE highlights recall, making them complementary metrics. Despite their shortcomings in capturing semantic meaning, they remain vital for quick and reproducible assessments. To achieve a holistic evaluation, consider combining these metrics with human judgment or semantic similarity metrics like METEOR and BERTScore.

Contact Us for any query. If you’re eager to learn how to use H2OGPTE, stay tuned for the next article! 🙂

GitHub Repository Link: Evaluate Large Language Model Translation using BLEU and ROUGE Score.

How to Evaluate Language Model Translations Using BLEU and ROUGE Scores: A Complete Guide
Machine translation is a critical application of language models, but how do we measure its success? This guide dives deep into two popular evaluation metrics: BLEU and ROUGE. Learn how these metrics work, their advantages, limitations, and how to implement them with Python. Whether you’re benchmarking translations or refining your language model, this article will give you the tools to succeed.