Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens.
These tokens can be as small as characters or as long as words. The primary reason this process matters is that it helps machines understand human language by breaking it down into bite-sized pieces, which are easier to analyze.
To delve deeper into the mechanics, consider the sentence, "Chatbots are helpful." When we tokenize this sentence by words, it transforms into an array of individual words:
["Chatbots", "are", "helpful"].
This is a straightforward approach where spaces typically dictate the boundaries of tokens. However, if we were to tokenize by characters, the sentence would fragment into
["C", "h", "a", "t", "b", "o", "t", "s", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"].
This character-level breakdown is more granular and can be especially useful for certain languages or specific NLP tasks.
Tokenization methods vary based on the granularity of the text breakdown and the specific requirements of the task at hand.
Word tokenization. This method breaks text down into individual words. It's the most common approach and is particularly effective for languages with clear word boundaries like English.
Character tokenization. Here, the text is segmented into individual characters. This method is beneficial for languages that lack clear word boundaries or for tasks that require a granular analysis, such as spelling correction.
Subword tokenization. Striking a balance between word and character tokenization, this method breaks text into units that might be larger than a single character but smaller than a full word.
For instance, "Chatbots" could be tokenized into "Chat" and "bots". This approach is especially useful for languages that form meaning by combining smaller units or when dealing with out-of-vocabulary words in NLP tasks.
The landscape of Natural Language Processing offers a plethora of tools, each tailored to specific needs and complexities. Here's a guide to some of the most prominent tools and methodologies available for tokenization:
NLTK (Natural Language Toolkit). A stalwart in the NLP community, NLTK is a comprehensive Python library that caters to a wide range of linguistic needs. It offers both word and sentence tokenization functionalities, making it a versatile choice for beginners and seasoned practitioners alike.
Spacy. A modern and efficient alternative to NLTK, Spacy is another Python-based NLP library. It boasts speed and supports multiple languages, making it a favorite for large-scale applications.
BERT tokenizer. Emerging from the BERT pre-trained model, this tokenizer excels in context-aware tokenization. It's adept at handling the nuances and ambiguities of language, making it a top choice for advanced NLP projects (see this tutorial on NLP with BERT).
Advanced techniques:
Byte-Pair Encoding (BPE). An adaptive tokenization method, BPE tokenizes based on the most frequent byte pairs in a text. It's particularly effective for languages that form meaning by combining smaller units.
SentencePiece. An unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation tasks. It handles multiple languages with a single model and can tokenize text into subwords, making it versatile for various NLP tasks.
Your choice of tool should align with the specific requirements of your project. For those taking their initial steps in NLP, NLTK or Spacy might offer a more approachable learning curve. However, for projects demanding a deeper understanding of context and nuance, the BERT tokenizer stands out as a robust option.
Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.
BPE begins with individual characters and progressively merges the most frequent pairs.
It enables the representation of words using subword units, effectively handling out-of-vocabulary words.
The final vocabulary size and the number of iterations depend on the specific application and the desired balance between vocabulary size and representation granularity.
running runner jumped jumping highest
Initialization: The vocabulary starts with all the unique characters present in the corpus.
Vocabulary = { r, u, n, i, n, g, e, j, m, p, d, h, s, t }
Iterative Merging: We will iterate, identifying the most frequent pair of adjacent characters and merging them into a new token.
Iteration 1:
Most frequent pair: i n (occurs 2 times: running, jumping)
Merge i and n into a new token in
Updated Vocabulary: { r, u, n, i, n, g, e, j, m, p, d, h, s, t, in }
Iteration 2:
Most frequent pair: n n (occurs 2 times: running, runner)
Merge n and n into a new token nn
Updated Vocabulary: { r, u, n, i, n, g, e, j, m, p, d, h, s, t, in, nn }
Iteration 3:
Most frequent pair: in g (occurs 2 times: running, jumping)
Merge in and g into a new token ing
Updated Vocabulary: { r, u, n, i, n, g, e, j, m, p, d, h, s, t, in, nn, ing }
Iteration 4:
Most frequent pair: r u (occurs 2 times: running, runner)
Merge r and u into a new token ru
Updated Vocabulary: { r, u, n, i, n, g, e, j, m, p, d, h, s, t, in, nn, ing, ru }
{ r, u, n, i, n, g, e, j, m, p, d, h, s, t, in, nn, ing, ru }
running -> ru nn ing
runner -> ru nn er
jumped -> j u m p ed
jumping -> j u m p ing
highest -> h igh est
Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.
Ref: