speechbrain.tokenizers.SentencePiece module

Library for Byte-pair-encoding (BPE) tokenization. Authors

  • Abdelwahab Heba 2020

  • Loren Lugosch 2020




BPE class call the SentencePiece unsupervised text tokenizer from Google. Reference: https://github.com/google/sentencepiece SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer. :param model_dir: The directory where the model will be saved (or already stored). :type model_dir: str :param vocab_size: Vocab size for the chosen tokenizer type (BPE, Unigram). The vocab_size is optional for char, and mandatory for BPE & unigram tokenization. :type vocab_size: int, None, optional :param annotation_train: Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format. :type annotation_train: str :param annotation_read: The data entry which contains the word sequence in the annotation file. :type annotation_read: str :param model_type: (bpe, char, unigram). If "bpe", train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If "word" take the vocabulary from the input text. If "unigram" do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959 :type model_type: str :param char_format_input: Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d) :type char_format_input: bool :param character_coverage: Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0) :type character_coverage: int :param user_defined_symbols: String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None) :type user_defined_symbols: string :param max_sentencepiece_length: Maximum number of characters for the tokens. (default: 10) :type max_sentencepiece_length: int :param bos_id: If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1) :type bos_id: int :param eos_id: If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1) :type eos_id: int :param pad_id: If -1 the pad_id = unk_id = 0. otherwise, bos_id = int. (default: -1) :type pad_id: int :param unk_id: The token corresponding to an unknown symbol (not in token set). :type unk_id: int :param split_by_whitespace: If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True) :type split_by_whitespace: bool :param num_sequences: If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None) :type num_sequences: int :param annotation_list_to_check: List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer. :type annotation_list_to_check: list, :param annotation_format: The format of the annotation file. JSON or csv are the formats supported. :type annotation_format: str :param text_file: An alternate path to the text file (needed when multiple models are trained on the same data file) :type text_file: str :param add_dummy_prefix: If True the tokenizer adds dummy whitespace at the beginning of text. (default: True) :type add_dummy_prefix: bool.


Mutable streaming context for a single SentencePiece streaming session.



Fetch list of tokens, can be indexed by token id


Assuming the tokenizer is sentencepiece, decodes the input hypothesis but avoids incorrectly stripping leading spaces when streaming.


class speechbrain.tokenizers.SentencePiece.SentencePiece(model_dir, vocab_size, annotation_train=None, annotation_read=None, model_type='unigram', char_format_input=False, character_coverage=1.0, user_defined_symbols=None, max_sentencepiece_length=10, bos_id=-1, eos_id=-1, pad_id=-1, unk_id=0, split_by_whitespace=True, num_sequences=None, annotation_list_to_check=None, annotation_format='csv', text_file=None, add_dummy_prefix=True)[source]

Bases: object

BPE class call the SentencePiece unsupervised text tokenizer from Google. Reference: https://github.com/google/sentencepiece SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer. :param model_dir: The directory where the model will be saved (or already stored). :type model_dir: str :param vocab_size: Vocab size for the chosen tokenizer type (BPE, Unigram).

The vocab_size is optional for char, and mandatory for BPE & unigram tokenization.

  • annotation_train (str) – Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format.

  • annotation_read (str) – The data entry which contains the word sequence in the annotation file.

  • model_type (str) – (bpe, char, unigram). If “bpe”, train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If “word” take the vocabulary from the input text. If “unigram” do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959

  • char_format_input (bool) – Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d)

  • character_coverage (int) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0)

  • user_defined_symbols (string) – String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None)

  • max_sentencepiece_length (int) – Maximum number of characters for the tokens. (default: 10)

  • bos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)

  • eos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)

  • pad_id (int) – If -1 the pad_id = unk_id = 0. otherwise, bos_id = int. (default: -1)

  • unk_id (int) – The token corresponding to an unknown symbol (not in token set).

  • split_by_whitespace (bool) – If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True)

  • num_sequences (int) – If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None)

  • annotation_list_to_check (list,) – List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer.

  • annotation_format (str) – The format of the annotation file. JSON or csv are the formats supported.

  • text_file (str) – An alternate path to the text file (needed when multiple models are trained on the same data file)

  • add_dummy_prefix (bool) – If True the tokenizer adds dummy whitespace at the beginning of text. (default: True)


>>> import torch
>>> dict_int2lab = {1: "HELLO", 2: "MORNING"}
>>> model_dir = getfixture('tmpdir') / "tokenizer_data"
>>> # Example with csv
>>> annotation_train = "tests/samples/annotation/dev-clean.csv"
>>> annotation_read = "wrd"
>>> model_type = "bpe"
>>> bpe = SentencePiece(str(model_dir), 100, annotation_train, annotation_read, model_type)
>>> batch_seq = torch.Tensor([[1, 2, 2, 1],[1, 2, 1, 0]])
>>> batch_lens = torch.Tensor([1.0, 0.75])
>>> encoded_seq_ids, encoded_seq_pieces = bpe(
...     batch_seq, batch_lens, dict_int2lab, task="encode"
... )
>>> # Example using JSON
>>> annotation_train = str(model_dir + "/dev-clean.json")
>>> annotation_read = "wrd"
>>> bpe = SentencePiece(model_dir, 100, annotation_train, annotation_read, model_type, annotation_format = 'json')
>>> encoded_seq_ids, encoded_seq_pieces = bpe(
...     batch_seq, batch_lens, dict_int2lab, task="encode"
... )
__call__(batch, batch_lens=None, ind2lab=None, task='encode')[source]

This __call__ function implements the tokenizer encoder and decoder (restoring the string of word) for BPE, Regularized BPE (with unigram), and char (speechbrain/nnet/RNN.py). :param batch: List if ( batch_lens = None and task = “decode_from_list”)

Contains the original labels. Shape: [batch_size, max_length]

  • batch_lens (tensor.LongTensor) – Containing the relative length of each label sequences. Must be 1D tensor of shape: [batch_size]. (default: None)

  • ind2lab (dict) – Dictionary which maps the index from label sequences (batch tensor) to string label.

  • task (str) –

    (“encode”, “decode”, “decode_from_list) “encode”: convert the batch tensor into sequence of tokens.

    the output contain a list of (tokens_seq, tokens_lens)

    ”decode”: convert a tensor of tokens to a list of word sequences. “decode_from_list”: convert a list of token sequences to a list

    of word sequences.


Fetch list of tokens, can be indexed by token id

The resulting list can be used to map id to token.


model_path (str) – Path to SentencePiece model


Tokens in order by id (can be indexed by id)

Return type:


class speechbrain.tokenizers.SentencePiece.SentencePieceDecoderStreamingContext(emitted_symbol_count: int = 0)[source]

Bases: object

Mutable streaming context for a single SentencePiece streaming session.

emitted_symbol_count: int = 0

The number of symbols that have been emitted for this transcription.

speechbrain.tokenizers.SentencePiece.spm_decode_preserve_leading_space(tokenizer: SentencePieceProcessor, hyps: List[int], context: SentencePieceDecoderStreamingContext) List[str][source]

Assuming the tokenizer is sentencepiece, decodes the input hypothesis but avoids incorrectly stripping leading spaces when streaming. Operates on a single hypothesis, not a batch of hypotheses.

Normally, the tokenizer always decodes full sentences at a time, with the consequence that the first space in decoding will get removed. However, when streaming, we might be decoding mid-utterance where spaces must not be removed mid-sentence. This function handles this case.

e.g. if within the same streaming context, you decode ["▁how", "▁are"] then ["▁you"], the decoder would normally return "how areyou" instead of "how are you" like this function does.

  • tokenizer (sentencepiece.SentencePieceProcessor) – The SentencePiece processor to use for decoding.

  • hyps (list of output token hypotheses) – List of tokens to decode of any length >=0.

  • context (SentencePieceDecoderStreamingContext) – Mutable streaming context for the sentencepiece decoder, which should be reused across calls for the same decoding stream.


Decoded text. Leading spaces are preserved, except at the start of a transcription.

Return type:
