What are tokenizers and why are they important in NLP?

Q: What are tokenizers and why are they important in NLP?

Learn the answer to "What are tokenizers and why are they important in NLP?" with detailed explanations, code examples, and best practices on DeployU.

The Scenario

You are an ML engineer at a global e-commerce company. The company is expanding into new markets and needs to be able to analyze customer feedback in multiple languages. You are working on a new sentiment analysis model that has been trained on a multilingual dataset.

However, the model’s performance is poor. You have tried a variety of different model architectures and hyperparameters, but nothing seems to help. You suspect that the issue might be with the tokenization process.

You are currently using a word-based tokenizer, and you have noticed that the vocabulary size is very large and that there are a lot of out-of-vocabulary (OOV) tokens.

The Challenge

Explain why a word-based tokenizer is not suitable for this task. What tokenization strategy would you use instead, and why? Outline your plan for training a new tokenizer from scratch and using it to tokenize the dataset.

Wrong Approach

A junior engineer might not recognize that the tokenization process is the source of the problem. They might continue to try different model architectures and hyperparameters, without realizing that the underlying issue is with the way the data is being represented.

Addresses symptoms, not root cause

Right Approach

A senior engineer would immediately suspect that the tokenization process is the source of the problem. They would be able to explain why a word-based tokenizer is not suitable for a multilingual dataset and would know how to train a new tokenizer from scratch using a more appropriate strategy, like BPE or SentencePiece.

Step 1: Diagnose the Problem

The first step is to diagnose the problem with the current tokenization strategy. A word-based tokenizer is not suitable for a multilingual dataset for two main reasons:

Large vocabulary: A multilingual dataset will have a very large number of unique words, which will lead to a very large vocabulary. This will increase the memory footprint of the model and can slow down training.
OOV tokens: A word-based tokenizer cannot handle words that are not in its vocabulary. This is a major problem for a multilingual dataset, because there will always be new words that the tokenizer has not seen before.

Step 2: Choose a Better Tokenization Strategy

A subword-based tokenization strategy, like BPE (Byte-Pair Encoding), WordPiece, or SentencePiece, is a much better choice for this task.

Strategy	Description	Pros	Cons
BPE	Starts with a vocabulary of individual characters and iteratively merges the most frequent pairs of tokens.	Can handle OOV words and is good for morphologically rich languages.	Can be slow to train.
WordPiece	Similar to BPE, but it merges tokens based on the likelihood of the training data.	Used by popular models like BERT and is generally faster than BPE.	Can be more difficult to implement.
SentencePiece	Treats the input text as a raw stream of Unicode characters, which makes it language-agnostic.	Language-agnostic, can handle any language without requiring pre-tokenization.	Can be more difficult to interpret.

We will use SentencePiece, because it is language-agnostic and is well-suited for multilingual datasets.

Step 3: Train a New Tokenizer

The next step is to train a new tokenizer from scratch on our custom dataset. We can use the tokenizers library from Hugging Face to do this.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# 1. Initialize a tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# 2. Create a trainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# 3. Train the tokenizer
files = ["my_dataset.txt"] # A text file with one sentence per line
tokenizer.train(files, trainer)

# 4. Save the tokenizer
tokenizer.save("my-tokenizer.json")

Step 4: Use the New Tokenizer

Once we have trained the new tokenizer, we can use it to tokenize our dataset and fine-tune our model.

from transformers import PreTrainedTokenizerFast

# Load the trained tokenizer
tokenizer = PreTrainedTokenizerFast(tokenizer_file="my-tokenizer.json")

# ... (use the tokenizer to tokenize the dataset and fine-tune the model) ...

Systematic, production-ready debugging

Practice Question

You are working with a language that does not use spaces to separate words, like Japanese. Which tokenization strategy would be the most suitable?

Questions