Introduction to Natural Language Processing with PyTorch
This is my personal notes from the course: "PyTorch Fundamentals", taught freely by Microsoft. [Link]
This is my personal notes from the course: "PyTorch Fundamentals", taught freely by Microsoft. [Link]
I. INTRODUCTION:
In this module, we will explore different neural network architectures for dealing with natural language texts.
In the recent years, Natural Language Processing (NLP) has experiences fast growth as a field, primarily because performance of the language models depend on their overall ability to "understand" text, and that can be trained in unsupervised manner on large text corpora. Thus, pre-trained text models such as BERT simplified many NLP tasks, and dramatically improved the performance.
We will focus on the fundamental aspects of representing NLP as tensors in PyTorch, and on classical NLP architectures, such as using bag-of-words, embeddings and recurrent neural networks.
1) Natural Language Tasks:
There are several NLP tasks that we traditionally try to solve using neural networks:
Text Classification: is used when we need to classify text fragment into one of several pre-defined classes. Examples include e-mail spam detection, news categorization, assigning support request to one of the categories, and more.
Intent Classification is one specific case of text classification, when we want to map input utterance in the conversational AI system into one of the intents that represent the actual meaning of the phrase, or intent of the user.
Sentiment Analysis is a regression task, where we want to understand the degree of negativity of given piece of text. We may want to label texts in a dataset from the most negative (-1) to most positive ones (+1), and train a model that will output a number of "positiveness" of a text.
Named Entity Recognition (NER) is a task of extracting some entities from text, such as dates, addresses, people names, etc. Together with intent classification, NER is often used in dialog systems to extract parameters from user's utterance.
Keyword Extraction can be used to find the most meaningful words inside a text, which can then be used as tags.
Text Summarization extracts the most meaningful pieces of text, giving a user a compressed version that contains most of the meaning.
Question/Answer is a task of extracting an answer from a piece of text. This model gets text fragment and a question as an input, and needs to find exact place within the text that contains answer. For example, the text "John is a 22 year old student who loves to use Microsoft Learn", and the question How old is John should provide us with the answer 22.
II. REPRESENTING TEXT AS TENSORS:
1) Representing text:
If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such as ASCII or UTF-8.
We understand what each letter represents, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and neural network has to learn the meaning during training.
Therefore, we can use different approaches when representing text:
Character-level representation, when we represent text by treating each character as a number. Given that we have C different characters in our text corpus, the word Hello would be represented by 5×C tensor. Each letter would correspond to a tensor column in one-hot encoding.
Word-level representation, in which we create a vocabulary of all words in our text, and then represent words using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given large dictionary size, we need to deal with high-dimensional sparse tensors.
Let's start by installing some required Python packages we'll use in this module.
!pip install -r https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlppytorch/requirements.txt
2) Text classification task:
In this module, we will start with a simple text classification task based on AG_NEWS dataset, which is to classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech. This dataset is built into torchtext module, so we can easily access it:
import torch
import torchtext
import os
import collections
os.makedirs('./data',exist_ok=True)
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
classes = ['World', 'Sports', 'Business', 'Sci/Tech']
Here, train_dataset and test_dataset contain iterators that return pairs of label (number of class) and text respectively, for example:
next(train_dataset)
So, let's print out the first 10 new headlines from our dataset:
for i,x in zip(range(5),train_dataset):
print(f"**{classes[x[0]]}** -> {x[1]}")
Because datasets are iterators, if we want to use the data multiple times we need to convert it to list:
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)
Now we need to convert text into numbers that can be represented as tensors. If we want word-level representation, we need to do two things:
use tokenizer to split text into tokens.
build a vocabulary of those tokens.
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
tokenizer('He said: hello')
counter = collections.Counter()
for (label, line) in train_dataset:
counter.update(tokenizer(line))
vocab = torchtext.vocab.Vocab(counter, min_freq=1)
Using vocabulary, we can easily encode out tokenized string into a set of numbers:
vocab_size = len(vocab)
print(f"Vocab size if {vocab_size}")
def encode(x):
return [vocab.stoi[s] for s in tokenizer(x)]
encode('I love to play with my words') #output: [283, 2321, 5, 337, 19, 1301, 2357]
3) Bag of Words text representation:
Because words represent meaning, sometimes we can figure out the meaning of a text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like weather, snow are likely to indicate weather forecast, while words like stocks, dollar would count towards financial news.
Bag of Words (BoW) vector representation is the most commonly used traditional vector representation. Each word is linked to a vector index, vector element contains the number of occurrences of a word in a given document.
Note: You can also think of BoW as a sum of all one-hot-encoded vectors for individual words in the text.
Below is an example of how to generate a bag of word representation using the Scikit Learn python library:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'I like hot dogs.',
'The dog ran fast.',
'Its hot outside.',
]
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()
# array([[1, 1, 0, 2, 0, 0, 0, 0, 0]])
To compute bag-of-words vector from the vector representation of our AG_NEWS dataset, we can use the following function:
vocab_size = len(vocab)
def to_bow(text,bow_vocab_size=vocab_size):
res = torch.zeros(bow_vocab_size,dtype=torch.float32)
for i in encode(text):
if i<bow_vocab_size:
res[i] += 1
return res
print(to_bow(train_dataset[0][1]))
#tensor([0., 0., 2., ..., 0., 0., 0.])
Note: Here we are using global vocab_size variable to specify default size of the vocabulary. Since often vocabulary size is pretty big, we can limit the size of the vocabulary to most frequent words. Try lowering vocab_size value and running the code below, and see how it affects the accuracy. You should expect some accuracy drop, but not dramatic, in lieu of higher performance.
4) Training BoW classifier:
Now that we have learned how to build Bag-of-Words representation of our text, let's train a classifier on top of it. First, we need to convert our dataset for training in such a way, that all positional vector representations are converted to bag-of-words representation.
This can be achieved by passing bowify function as collate_fn parameter to standard torch DataLoader:
from torch.utils.data import DataLoader
import numpy as np
# this collate function gets list of batch_size tuples, and needs to
# return a pair of label-feature tensors for the whole minibatch
def bowify(b):
return (
torch.LongTensor([t[0]-1 for t in b]),
torch.stack([to_bow(t[1]) for t in b])
)
train_loader = DataLoader(train_dataset, batch_size=16, collate_fn=bowify, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=bowify, shuffle=True)
Now let's define a simple classifier neural network that contains one linear layer. The size of the input vector equals to vocab_size, and output size corresponds to the number of classes (4). Because we are solving classification task, the final activation function is LogSoftmax().
net = torch.nn.Sequential(torch.nn.Linear(vocab_size,4),torch.nn.LogSoftmax(dim=1))
Now we will define standard PyTorch training loop. Because our dataset is quite large, for our teaching purpose we will train only for one epoch, and sometimes even for less than an epoch (specifying the epoch_size parameter allows us to limit training). We would also report accumulated training accuracy during training; the frequency of reporting is specified using report_freq parameter.
def train_epoch(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.NLLLoss(),epoch_size=None, report_freq=200):
optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
net.train()
total_loss,acc,count,i = 0,0,0,0
for labels,features in dataloader:
optimizer.zero_grad()
out = net(features)
loss = loss_fn(out,labels) #cross_entropy(out,labels)
loss.backward()
optimizer.step()
total_loss+=loss
_,predicted = torch.max(out,1)
acc+=(predicted==labels).sum()
count+=len(labels)
i+=1
if i%report_freq==0:
print(f"{count}: acc={acc.item()/count}")
if epoch_size and count>epoch_size:
break
return total_loss.item()/count, acc.item()/count
train_epoch(net,train_loader,epoch_size=15000)
5) BiGrams, TriGrams and N-Grams:
One limitation of a bag of words approach is that some words are part of multi-word expressions, for example, the word 'hot dog' has a completely different meaning than the words 'hot' and 'dog' in other contexts. If we represent words 'hot` and 'dog' always by the same vectors, it can confuse our model.
To address this, N-gram representations are often used in methods of document classification, where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. In bi-gram representation, for example, we will add all word pairs to the vocabulary, in addition to original words.
Below is an example of how to generate a bigram bag of word representation using the Scikit Learn:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
corpus = [
'I like hot dogs.',
'The dog ran fast.',
'Its hot outside.',
]
bigram_vectorizer.fit_transform(corpus)
print("Vocabulary:\n",bigram_vectorizer.vocabulary_)
bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()
# Vocabulary:
# {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}
# array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
The main drawback of N-gram approach is that vocabulary size starts to grow extremely fast. In practice, we need to combine N-gram representation with some dimensionality reduction techniques, such as embeddings, which we will discuss in the next unit.
To use N-gram representation in our AG News dataset, we need to build special ngram vocabulary:
counter = collections.Counter()
for (label, line) in train_dataset:
l = tokenizer(line)
counter.update(torchtext.data.utils.ngrams_iterator(l,ngrams=2))
bi_vocab = torchtext.vocab.Vocab(counter, min_freq=1)
print("Bigram vocabulary length = ",len(bi_vocab))
# Bigram vocabulary length = 1308844
We could then use the same code as above to train the classifier, however, it would be very memory-inefficient. In the next unit, we will train bigram classifier using embeddings.
Note: You can only leave those ngrams that occur in the text more than specified number of times. This will make sure that infrequent bigrams will be omitted, and will decrease the dimensionality significantly. To do this, set min_freq parameter to a higher value, and observe the length of vocabulary change.
6) Term Frequency Inverse Document Frequency TF-IDF:
In BoW representation, word occurrences are evenly weighted, regardless of the word itself. However, it is clear that frequent words, such as a, in, etc. are much less important for the classification, than specialized terms. In fact, in most NLP tasks some words are more relevant than others.
TF-IDF stands for term frequency–inverse document frequency. It is a variation of bag of words, where instead of a binary 0/1 value indicating the appearance of a word in a document, a floating-point value is used, which is related to the frequency of word occurrence in the corpus.
More formally, the weight wij of a word i in the document j is defined as:
wij = tfij × log(N/dfi)
where: tfij is the number of occurrences of i in j, i.e. the BoW value we have seen before. N is the number of documents in the collection. dfi is the number of documents containing the word i in the whole collection.
TF-IDF value wij increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appears in every document in the collection, dfi = N, and those terms would be completely disregarded.
You can easily create TF-IDF vectorization of text using Scikit Learn:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()
However, even though TF-IDF representations provide frequency weight to different words, they are unable to represent meaning or order. As the famous linguist J. R. Firth said in 1935, “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.”.
We will learn in the later units how to capture contextual information from text using language modeling.
III. REPRESENT WORDS WITH EMBEDDINGS:
1) Embeddings:
In our previous example, we operated on high-dimensional bag-of-words vectors with length vocab_size, and we were explicitly converting from low-dimensional positional representation vectors into sparse one-hot representation. This one-hot representation is not memory-efficient, in addition, each word is treated independently from each other, i.e. one-hot encoded vectors do not express any semantic similarity between words.
In this unit, we will continue exploring News AG dataset. To begin, let's load the data and get some definitions from the previous unit.
import torch
import torchtext
import numpy as np
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)
print("Vocab size = ",vocab_size)
The idea of embedding is to represent words by lower-dimensional dense vectors, which somehow reflect semantic meaning of a word.
Embedding layer would take a word as an input, and produce an output vector of specified embedding_size. In a sense, it is very similar to Linear layer, but instead of taking one-hot encoded vector, it will be able to take a word number as an input.
By using embedding layer as a first layer in our network, we can switch from bag-or-words to embedding bag model, where we first convert each word in our text into corresponding embedding, and then compute some aggregate function over all those embeddings, such as sum, average or max.
Our classifier neural network will start with embedding layer, then aggregation layer, and linear classifier on top of it:
class EmbedClassifier(torch.nn.Module):
def __init__(self, vocab_size, embed_dim, num_class):
super().__init__()
self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
self.fc = torch.nn.Linear(embed_dim, num_class)
def forward(self, x):
x = self.embedding(x)
x = torch.mean(x,dim=1)
return self.fc(x)
2) Dealing with variable sequence size:
As a result of this architecture, minibatches to our network would need to be created in a certain way. In the previous unit, when using bag-of-words, all BoW tensors in a minibatch had equal size vocab_size, regardless of the actual length of our text sequence. Once we move to word embeddings, we would end up with variable number of words in each text sample, and when combining those samples into minibatches we would have to apply some padding.
This can be done using the same technique of providing collate_fn function to the datasource:
def padify(b):
# b is the list of tuples of length batch_size
# - first element of a tuple = label,
# - second = feature (text sequence)
# build vectorized sequence
v = [encode(x[1]) for x in b]
# first, compute max length of a sequence in this minibatch
l = max(map(len,v))
return ( # tuple of two tensors - labels and features
torch.LongTensor([t[0]-1 for t in b]),
torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])
)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)
3) Training embedding classifier
Now that we have defined proper dataloader, we can train the model using the training function we have defined in the previous unit:
net = EmbedClassifier(vocab_size,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=1, epoch_size=25000)
4) EmbeddingBag Layer and Variable-Length Sequence Representation:
In the previous architecture, we needed to pad all sequences to the same length in order to fit them into a mini-batch. This is not the most efficient way to represent variable length sequences - another apporach would be to use offset vector, which would hold offsets of all sequences stored in one large vector.
Note: On the picture above, we show a sequence of characters, but in our example we are working with sequences of words. However, the general principle of representing sequences with offset vector remains the same.
To work with offset representation, we use EmbeddingBag layer. It is similar to Embedding, but it takes content vector and offset vector as input, and it also includes averaging layer, which can be mean, sum or max.
Here is modified network that uses EmbeddingBag:
class EmbedClassifier(torch.nn.Module):
def __init__(self, vocab_size, embed_dim, num_class):
super().__init__()
self.embedding = torch.nn.EmbeddingBag(vocab_size, embed_dim)
self.fc = torch.nn.Linear(embed_dim, num_class)
def forward(self, text, off):
x = self.embedding(text, off)
return self.fc(x)
To prepare the dataset for training, we need to provide a conversion function that will prepare the offset vector:
def offsetify(b):
# first, compute data tensor from all sequences
x = [torch.tensor(encode(t[1])) for t in b]
# now, compute the offsets by accumulating the tensor of sequence lengths
o = [0] + [len(t) for t in x]
o = torch.tensor(o[:-1]).cumsum(dim=0)
return (
torch.LongTensor([t[0]-1 for t in b]), # labels
torch.cat(x), # text
o
)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=offsetify, shuffle=True)
Note, that unlike in all previous examples, our network now accepts two parameters: data vector and offset vector, which are of different sizes.
Similarly, our data loader also provides us with 3 values instead of 2: both text and offset vectors are provided as features.
Therefore, we need to slightly adjust our training function to take care of that:
net = EmbedClassifier(vocab_size,32,len(classes)).to(device)
def train_epoch_emb(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.CrossEntropyLoss(),epoch_size=None, report_freq=200):
optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
loss_fn = loss_fn.to(device)
net.train()
total_loss,acc,count,i = 0,0,0,0
for labels,text,off in dataloader:
optimizer.zero_grad()
labels,text,off = labels.to(device), text.to(device), off.to(device)
out = net(text, off)
loss = loss_fn(out,labels) #cross_entropy(out,labels)
loss.backward()
optimizer.step()
total_loss+=loss
_,predicted = torch.max(out,1)
acc+=(predicted==labels).sum()
count+=len(labels)
i+=1
if i%report_freq==0:
print(f"{count}: acc={acc.item()/count}")
if epoch_size and count>epoch_size:
break
return total_loss.item()/count, acc.item()/count
train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)
5) Semantic Embeddings: Word2Vec
In our previous example, the model embedding layer learnt to map words to vector representation, however, this representation did not have much semantical meaning. It would be nice to learn such vector representation, that similar words or symonims would correspond to vectors that are close to each other in terms of some vector distance (eg. euclidian distance).
To do that, we need to pre-train our embedding model on a large collection of text in a specific way. One of the first ways to train semantic embeddings is called Word2Vec. It is based on two main architectures that are used to produce a distributed representation of words:
Continuous bag-of-words (CBoW) - in this architecture, we train the model to predict a word from surrounding context. Given the ngram (W-2, W-1, W0, W1, W2) the goal of the model is to predict W0 from (W-2, W-1, W1, W2)
Continuous skip-gram is opposite to CBoW. The model uses surrounding window of context words to predict the current word.
CBoW is faster, while skip-gram is slower, but does a better job of representing infrequent words.
To experiment with word2vec embedding pre-trained on Google News dataset, we can use gensim library. Below we find the words most similar to 'neural'
Note: When you first create word vectors, downloading them can take some time!
import gensim.downloader as api
w2v = api.load('word2vec-google-news-300')
for w,p in w2v.most_similar('neural'):
print(f"{w} -> {p}")
""" Output:
neuronal -> 0.780479907989502
neurons -> 0.7326500415802002
neural_circuits -> 0.7252851128578186
neuron -> 0.7174385190010071
cortical -> 0.6941086053848267
brain_circuitry -> 0.6923245787620544
synaptic -> 0.6699119210243225
neural_circuitry -> 0.6638563275337219
neurochemical -> 0.6555314064025879
neuronal_activity -> 0.6531826257705688
"""
We can also extract vector embeddings from the word, to be used in training classification model (we only show first 20 components of the vector for clarity):
w2v.word_vec('play')[:20]
Great thing about semantical embeddings is that you can manipulate vector encoding to change the semantics. For example, we can ask to find a word, whose vector representation would be as close as possible to words king and woman, and as far away from the word man:
w2v.most_similar(positive=['king','woman'],negative=['man'])[0]
Both CBOW and Skip-Grams are “predictive” embeddings, in that they only take local contexts into account. Word2Vec does not take advantage of global context.
FastText, builds on Word2Vec by learning vector representations for each word and the character n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pre-training it enables word embeddings to encode sub-word information.
GloVe leverages the idea of co-occurence matrix, uses neural methods to decompose co-occurrence matrix into more expressive and non linear word vectors.
You can play with the example by changing embeddings to FastText and GloVe, since gensim supports:
w2v.most_similar(positive=['king','woman'],negative=['man'])[0]
6) Using Pre-Trained Embeddings in PyTorch
We can modify the example above to pre-populate the matrix in our embedding layer with semantical embeddings, such as Word2Vec. We need to take into account that vocabularies of pre-trained embedding and our text corpus will likely not match, so we will initialize weights for the missing words with random values:
embed_size = len(w2v.get_vector('hello'))
print(f'Embedding size: {embed_size}')
net = EmbedClassifier(vocab_size,embed_size,len(classes))
print('Populating matrix, this will take some time...',end='')
found, not_found = 0,0
for i,w in enumerate(vocab.itos):
try:
net.embedding.weight[i].data = torch.tensor(w2v.get_vector(w))
found+=1
except:
net.embedding.weight[i].data = torch.normal(0.0,1.0,(embed_size,))
not_found+=1
print(f"Done, found {found} words, {not_found} words missing")
net = net.to(device)
Now let's train our model. Note that the time it takes to train the model is significantly larger than in the previous example, due to larger embedding layer size, and thus much higher number of parameters. Also, because of this, we may need to train our model on more examples if we want to avoid overfitting.
train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)
In our case we do not see huge increase in accuracy, which is likely to quite different vocalularies. To overcome the problem of different vocabularies, we can use one of the following solutions:
Re-train word2vec model on our vocabulary.
Load our dataset with the vocabulary from the pre-trained word2vec model. Vocabulary used to load the dataset can be specified during loading.
The latter approach seems easiter, especially because PyTorch torchtext framework contains built-in support for embeddings. We can, for example, instantiate GloVe-based vocabulary in the following manner:
vocab = torchtext.vocab.GloVe(name='6B', dim=50)
Loaded vocabulary has the following basic operations:
vocab.stoi dictionary allows us to convert word into its dictionary index
vocab.itos does the opposite - converts number into word
vocab.vectors is the array of embedding vectors, so to get the embedding of a word s we need to use vocab.vectors[vocab.stoi[s]]
Here is the example of manipulating embeddings to demonstrate the equation kind-man+woman = queen (I had to tweak the coefficient a bit to make it work):
# get the vector corresponding to kind-man+woman
qvec = vocab.vectors[vocab.stoi['king']]-vocab.vectors[vocab.stoi['man']]+1.3*vocab.vectors[vocab.stoi['woman']]
# find the index of the closest embedding vector
d = torch.sum((vocab.vectors-qvec)**2,dim=1)
min_idx = torch.argmin(d)
# find the corresponding word
vocab.itos[min_idx]
To train the classifier using those embeddings, we first need to encode our dataset using GloVe vocabulary:
def offsetify(b):
# first, compute data tensor from all sequences
x = [torch.tensor(encode(t[1],voc=vocab)) for t in b] # pass the instance of vocab to encode function!
# now, compute the offsets by accumulating the tensor of sequence lengths
o = [0] + [len(t) for t in x]
o = torch.tensor(o[:-1]).cumsum(dim=0)
return (
torch.LongTensor([t[0]-1 for t in b]), # labels
torch.cat(x), # text
o
)
As we have seen above, all vector embeddings are stored in vocab.vectors matrix. It makes it super-easy to load those weights into weights of embedding layer using simple copying:
net = EmbedClassifier(len(vocab),len(vocab.vectors[0]),len(classes))
net.embedding.weight.data = vocab.vectors
net = net.to(device)
Now let's train our model and see if we get better results:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=offsetify, shuffle=True)
train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)
One of the reasons we are not seeing significant increase in accuracy is due to the fact that some words from our dataset are missing in the pre-trained GloVe vocabulary, and thus they are essentially ignored. To overcome this fact, we can train our own embeddings on our dataset.
7) Train your own embeddings:
In our examples, we have been using pre-trained semantic embeddings, but it is interesting to see how those embeddings can be trained using either CBoW, or Skip-gram architectures. This exercise goes beyond this module, but those interested might want to check out this official PyTorch tutorial on Language Modeling. Also, gensim framework can be used to train most commonly used embeddings in a few lines of code, as described in this documentation.
8) Contextual Embeddings:
One key limitation of tradition pretrained embedding representaitons such as Word2Vec is the problem of word sense disambigio ution. While pretrained embeddings can capture some of the meaning of words in context, every possible meaning of a word is encoded into the same embedding. This can cause problems in downstream models, since many words such as the word 'play' have different meanings depending on the context they are used in.
For example word 'play' in those two different sentences have quite different meaning:
I went to a play at the theature.
John wants to play with his friends.
The pretrained embeddings above represent both of these meanings of the word 'play' in the same embedding. To overcome this limitation, we need to build embeddings based on the language model, which is trained on a large corpus of text, and knows how words can be put together in different contexts. Discussing contextual embeddings is out of scope for this tutorial, but we will come back to them when talking about language models in the next unit.
IV. CAPTURE PATTERN WITH RECURRENT NEURAL NETWORK:
1) Recurrent neural networks:
In the previous module, we have been using rich semantic representations of text, and a simple linear classifier on top of the embeddings. What this architecture does is to capture aggregated meaning of words in a sentence, but it does not take into account the order of words, because aggregation operation on top of embeddings removed this information from the original text. Because these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.
To capture the meaning of text sequence, we need to use another neural network architecture, which is called a recurrent neural network, or RNN. In RNN, we pass our sentence through the network one symbol at a time, and the network produces some state, which we then pass to the network again with the next symbol.
Given the input sequence of tokens X0, ..., Xn, RNN creates a sequence of neural network blocks, and trains this sequence end-to-end using back propagation. Each network block takes a pair (Xi, Si) as an input, and produces Si+1 as a result. Final state Sn or output Xn goes into a linear classifier to produce the result. All network blocks share the same weights, and are trained end-to-end using one back-propagation pass.
Because state vectors S0, ..., Sn are passed through the network, it is able to learn the sequential dependencies between words. For example, when the word not appears somewhere in the sequence, it can learn to negate certain elements within the state vector, resulting in negation.
Let's see how recurrent neural networks can help us classify our news dataset.
import torch
import torchtext
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)
2) Simple RNN classifier:
In case of simple RNN, each recurrent unit is a simple linear network, which takes concatenated input vector and state vector, and produce a new state vector. PyTorch represents this unit with RNNCell class, and a networks of such cells - as RNN layer.
To define an RNN classifier, we will first apply an embedding layer to lower the dimensionality of input vocabulary, and then have RNN layer on top of it:
class RNNClassifier(torch.nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
super().__init__()
self.hidden_dim = hidden_dim
self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
self.rnn = torch.nn.RNN(embed_dim,hidden_dim,batch_first=True)
self.fc = torch.nn.Linear(hidden_dim, num_class)
def forward(self, x):
batch_size = x.size(0)
x = self.embedding(x)
x,h = self.rnn(x)
return self.fc(x.mean(dim=1))
Note: We use untrained embedding layer here for simplicity, but for even better results we can use pre-trained embedding layer with Word2Vec or GloVe embeddings, as described in the previous unit. For better understanding, you might want to adapt this code to work with pre-trained embeddings.
In our case, we will use padded data loader, so each batch will have a number of padded sequences of the same length. RNN layer will take the sequence of embedding tensors, and produce two outputs:
x is a sequence of RNN cell outputs at each step.
h is a final hidden state for the last element of the sequence.
We then apply a fully-connected linear classifier to get the number of class.
Note: RNNs are quite difficult to train, because once the RNN cells are unrolled along the sequence length, the resulting number of layers involved in back propagation is quite large. Thus we need to select small learning rate, and train the network on larger dataset to produce good results. It can take quite a long time, so using GPU is preferred.
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)
net = RNNClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=0.001)
3) Long Short Term Memory (LSTM)
One of the main problems of classical RNNs is so-called vanishing gradients problem. Because RNNs are trained end-to-end in one back-propagation pass, it is having hard times propagating error to the first layers of the network, and thus the network cannot learn relationships between distant tokens. One of the ways to avoid this problem is to introduce explicit state management by using so called gates. There are two most known architectures of this kind: Long Short Term Memory (LSTM) and Gated Relay Unit (GRU).
LSTM Network [Christopher Olah] is organized in a manner similar to RNN, but there are two states that are being passed from layer to layer: actual state c, and hidden vector h. At each unit, hidden vector hi is concatenated with input hi, and they control what happens to the state c via gates. Each gate is a neural network with sigmoid activation (output in the range [0,1]), which can be thought of as bitwise mask when multiplied by the state vector. There are the following gates (from left to right on the picture above):
forget gate takes hidden vector and determines, which components of the vector c we need to forget, and which to pass through.
input gate takes some information from the input and hidden vector, and inserts it into state.
output gate transforms state via some linear layer with tanh activation, then selects some of its components using hidden vector hi to produce new state ci+1.
Components of the state c can be thought of as some flags that can be switched on and off.
For example, when we encounter a name Alice in the sequence, we may want to assume that it refers to female character, and raise the flag in the state that we have female noun in the sentence. When we further encounter phrases and Tom, we will raise the flag that we have plural noun. Thus by manipulating state we can supposedly keep track of grammatical properties of sentence parts.
While internal structure of LSTM cell may look complex, PyTorch hides this implementation inside LSTMCell class, and provides LSTM object to represent the whole LSTM layer. Thus, implementation of LSTM classifier will be pretty similar to the simple RNN which we have seen above:
class LSTMClassifier(torch.nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
super().__init__()
self.hidden_dim = hidden_dim
self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5
self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)
self.fc = torch.nn.Linear(hidden_dim, num_class)
def forward(self, x):
batch_size = x.size(0)
x = self.embedding(x)
x,(h,c) = self.rnn(x)
return self.fc(h[-1])
Now let's train our network. Note that training LSTM is also quite slow, and you may not seem much raise in accuracy in the beginning of training. Also, you may need to play with lr learning rate parameter to find the learning rate that results in reasonable training speed, and yet does not cause
net = LSTMClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=0.001)
4) Packed sequences:
In our example, we had to pad all sequences in the minibatch with zero vectors. While it results in some memory waste, with RNNs it is more critical that additional RNN cells are created for the padded input items, which take part in training, yet do not carry any important input information. It would be much better to train RNN only to the actual sequence size.
To do that, a special format of padded sequence storage is introduced in PyTorch. Suppose we have input padded minibatch which looks like this:
[[1,2,3,4,5],
[6,7,8,0,0],
[9,0,0,0,0]]
Here 0 represents padded values, and the actual length vector of input sequences is [5,3,1].
In order to effectively train RNN with padded sequence, we want to begin training first group of RNN cells with large minibatch ([1,6,9]), but then end processing of third sequence, and continue training with shorted minibatches ([2,7], [3,8]), and so on. Thus, packed sequence is represented as one vector - in our case [1,6,9,2,7,3,8,4,5], and length vector ([5,3,1]), from which we can easily reconstruct the original padded minibatch.
To produce packed sequence, we can use torch.nn.utils.rnn.pack_padded_sequence function. All recurrent layers, including RNN, LSTM and GRU, support packed sequences as input, and produce packed output, which can be decoded using torch.nn.utils.rnn.pad_packed_sequence.
To be able to produce packed sequence, we need to pass length vector to the network, and thus we need a different function to prepare minibatches:
def pad_length(b):
# build vectorized sequence
v = [encode(x[1]) for x in b]
# compute max length of a sequence in this minibatch and length sequence itself
len_seq = list(map(len,v))
l = max(len_seq)
return ( # tuple of three tensors - labels, padded features, length sequence
torch.LongTensor([t[0]-1 for t in b]),
torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v]),
torch.tensor(len_seq)
)
train_loader_len = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=pad_length, shuffle=True)
test_loader_len = torch.utils.data.DataLoader(test_dataset, batch_size=16, collate_fn=pad_length, shuffle=True)