speechbrain.utils.bertscore module

Provides a metrics class for the BERTscore metric.

Authors * Sylvain de Langen 2024

Summary

Classes:

BERTScoreStats

Computes BERTScore with a provided HuggingFace Transformers text encoder, using the method described in the paper BERTScore: Evaluating Text Generation with BERT.

Functions:

`get_bert_token_mask`	Returns a token mask with special tokens masked.
`get_bertscore_token_weights`	Returns token weights for use with the BERTScore metric.

Reference

class speechbrain.utils.bertscore.BERTScoreStats(lm: TextEncoder, batch_size: int = 64, use_idf: bool = True, sentence_level_averaging: bool = True, allow_matching_special_tokens: bool = False)[source]

Bases: MetricStats

Computes BERTScore with a provided HuggingFace Transformers text encoder, using the method described in the paper BERTScore: Evaluating Text Generation with BERT.

BERTScore operates over contextualized tokens (e.g. the output of BERT, but many other models would work). Since cosine similarities are used, the output range would be between -1 and 1. See the linked resources for more details.

Special tokens (as queried from the tokenizer) are entirely ignored.

Authors’ reference implementation of the metric can be found here. The linked page extensively describes the approach and compares how the BERTScore relates to human evaluation with many different models.

Warning

Out of the box, this implementation may not strictly match the results of the reference implementation. Please read the argument documentation to understand the differences.

Parameters:

lm (speechbrain.lobes.models.huggingface_transformers.TextEncoder) – HF Transformers tokenizer and text encoder wrapper to use as a LM.
batch_size (int, optional) – How many pairs of utterances should be considered at once. Higher is faster but may result in OOM.
use_idf (bool, optional) – If enabled (default), tokens in the reference are weighted by Inverse Document Frequency, which allows to weight down the impact of common words that may carry less information. Every sentence appended is considered a document in the IDF calculation.
sentence_level_averaging (bool, optional) – When True, the final recall/precision metrics will be the average of recall/precision for each tested sentence, rather of each tested token, e.g. a very long sentence will weigh as much as a very short sentence in the final metrics. The default is True, which matches the reference implementation.
allow_matching_special_tokens (bool, optional) – When True, non-special tokens may match against special tokens during greedy matching (e.g. [CLS]/[SEP]). Batch size must be 1 due to padding handling. The default is False, which is different behavior from the reference implementation (see bert_score#180).

clear()[source]: Clears the collected statistics

append(ids, predict, target)[source]

Appends inputs, predictions and targets to internal lists

Parameters:

ids (list) – the string IDs for the samples
predictions (list) – the model’s predictions in tokenizable format
targets (list) – the ground truths in tokenizable format

summarize(field=None)[source]

Summarize the classification metric scores. Performs the actual LM inference and BERTScore estimation.

Full set of fields:

bertscore-recall, optionally weighted by idf of ref tokens
bertscore-precision, optionally weighted by idf of hyp tokens
bertscore-f1

Parameters:: field (str) – If provided, only returns selected statistic. If not, returns all computed statistics.
Returns:: Returns a float if field is provided, otherwise returns a dictionary containing all computed stats.
Return type:: float or dict

speechbrain.utils.bertscore.get_bert_token_mask(tokenizer) → BoolTensor[source]

Returns a token mask with special tokens masked.

Parameters:: tokenizer – HuggingFace tokenizer for the BERT model.
Returns:: A mask tensor that can be indexed by token ID (of shape [vocab_size]).
Return type:: torch.BoolTensor

speechbrain.utils.bertscore.get_bertscore_token_weights(tokenizer, corpus: Iterable[str] | None = None) → Tensor[source]

Returns token weights for use with the BERTScore metric. When specifying corpus, the weights are the Inverse Document Frequency (IDF) of each token, extracted from the corpus.

The IDF formula is adapted from the BERTScore paper, where words missing from the reference corpus are weighted with +1 smoothing.

Parameters:

tokenizer – HuggingFace tokenizer for the BERT model.
corpus (Iterable[str], optional) – Iterable corpus to compute the IDF from. Each iterated value is considered a document in the corpus in the IDF calculation. If omitted, no IDF weighting is done.

Returns:

A floating-point tensor that can be indexed by token ID, of shape [vocab_size], where each entry is by how much the impact of a given token should be multiplied.

Return type:

torch.Tensor