Introduction to Audio Classification with PyTorch
This is my personal notes from the course: "PyTorch Fundamentals", taught freely by Microsoft. [Link]
This is my personal notes from the course: "PyTorch Fundamentals", taught freely by Microsoft. [Link]
I. INTRODUCTION TO VOICE ASSISTANTS:
In this learn module we will be learning how to do audio classification with PyTorch.
There are multiple ways to build an audio classification model. You can use the waveform, tag sections of a wave file, or even use computer vision on the spectrogram image. In this tutorial we will first break down how to understand audio data, from analog to digital representations, then we will build the model using computer vision on the spectrogram images. Thats right, you can turn audio into an image representation and then do computer vision to classify the word spoken!
In this module, we want to look at how we get the text from the spoken audio. Of course, audio classification is useful for many things, not just speech assistants. For example, in music, you can classify genres or detect illness by the tone in someone's voice, and even more applications that we haven't even thought of yet.
That's right! You can turn audio into an image representation and then use computer vision to classify the word spoken!
We will be building a simple model that can understand yes and no. The dataset we will be using is the open dataset Speech Commands which is built into PyTorch datasets. This dataset has 36 total different words/sounds to be used for classification. Each utterance is stored as a one-second (or less) WAVE format file. We will only be using yes and no for binary classification.
II. UNDERSTAND AUDIO DATA AND CONCEPTS:
1) Audio data:
We will look at some key concepts and features of audio data.
Let's think about the digital representation of analog sound. How does sound get recorded anyway?! Just like with images we need to take our physical world and convert it to numbers or a digital representation for a computer to understand. For audio, a microphone is used to capture the sound and then its converted from analog sound to digital sound by sampling at consistent intervals of time. This is called the sample rate. The higher the sample rate the higher the quality of the sound however after a certain point the difference is not able to be detected by the human ear. The average sound sample rate is 48 kHz or 48,000 samples per second. This dataset was sampled at 16kHz so our sample rate is 16,000.
When the audio is sampled its sampling the frequency or the pitch of the sound and the amplitude or how loud the audio is. We can then take our sample rate and frequency and represent the signal visually. This signal can be represented as a waveform which is the signal representation over time in a graphical format. The audio can be recorded in different channels. For example stereo recording have 2 channels, right and left.
Now that we understand a bit about how we get our audio file. Lets take a moment to understand how we might want to parse out a file. For example if you have longer audio files you may want to split it out into frames or sections of the audio to be classified individually. For this dataset we don't need to set any frames of our audio samples as each sample is only one second and one word. Another processing step might be an offset which means the number of frames from the start of the file to begin data loading.
2) Get setup with TorchAudio:
TorchAudio is a library that is part of the PyTorch ecosystem that has I/O functionality, popular open datasets and common audio transformations that we will need to build our model. We will use this library to work with our audio data.
Lets get started! First we will import the packages needed:
# import the packages
import os
import torchaudio
import IPython.display as ipd
import matplotlib.pyplot as plt
2) Get the speech commands Dataset:
PyTorch has a variety of datasets built-in which is super helpful when trying to learn and play around with different audio models. We will use one of these datasets called the Speech Commands. We will download the full dataset but we are going to only use the yes and no classes to create a binary classification model.
Create a data folder:
default_dir = os.getcwd()
folder = 'data'
print(f'Data directory will be: {default_dir}/{folder}')
if os.path.isdir(folder):
print("Data folder exists.")
else:
print("Creating folder.")
os.mkdir(folder)
Download the dataset to the data folder:
trainset_speechcommands = torchaudio.datasets.SPEECHCOMMANDS(f'./{folder}/', download=True)
Visualize the classes available in the dataset:
os.chdir(f'./{folder}/SpeechCommands/speech_commands_v0.02/')
labels = [name for name in os.listdir('.') if os.path.isdir(name)]
# back to default directory
os.chdir(default_dir)
print(f'Total Labels: {len(labels)}')
print(f'Label Names: {labels}')
Convert the sound to Tensor:
You likely have used a wave file before and understand that this is one format in which we save our digital representation of our analog audio to be shared and played. The Speech Commands dataset that we are using for this tutorial is stored in wave files that are all one second or less.
Lets load up one of the wave files and take a look at how the tensors for the waveform looks. We are loading the files using torchaudio.load which loads an audio file into a torch.Tensor object. TorchAudio has abstracted the load functions for different audio back ends meaning you don't have to worry about the implementation. The torch.load function returns the waveform as a tensor and an int of the sample_rate. Check out more on the about load on the PyTorch docs.
filename = "./data/SpeechCommands/speech_commands_v0.02/yes/00f0204f_nohash_0.wav"
waveform, sample_rate = torchaudio.load(filepath=filename, num_frames=3)
print(f'waveform tensor:{waveform}')
waveform, sample_rate = torchaudio.load(filepath=filename, num_frames=3, offset =2)
print(waveform)
waveform, sample_rate = torchaudio.load(filepath=filename)
print(waveform)
Plot the waveform: We will create a plot_audio function to display the waveform and listen to a sample of each class.
def plot_audio(filename):
waveform, sample_rate = torchaudio.load(filename)
print("Shape of waveform: {}".format(waveform.size()))
print("Sample rate of waveform: {}".format(sample_rate))
plt.figure()
plt.plot(waveform.t().numpy())
return waveform, sample_rate
filename = "./data/SpeechCommands/speech_commands_v0.02/yes/00f0204f_nohash_0.wav"
waveform, sample_rate = plot_audio(filename)
ipd.Audio(waveform.numpy(), rate=sample_rate)
filename = "./data/SpeechCommands/speech_commands_v0.02/no/0b40aa8e_nohash_0.wav"
waveform, sample_rate = plot_audio(filename)
ipd.Audio(waveform.numpy(), rate=sample_rate)
III. AUDIO TRANSFORM AND VISUALIZATION:
Now that we have our dataset downloaded, lets learn more about audio data visualization and transforming this dataset. TorchAudio has many transforms available in the library. Take a look at the list below to see the list of supported transformations.
AUDIO MANIPULATION WITH TORCHAUDIO: [Link]
From this list we are going to take a deeper look at understanding the following concepts and transforms:
Resample: Resample waveform to a different sample rate.
Spectrogram: Create a spectrogram from a waveform.
GriffinLim: Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation.
ComputeDeltas: Compute delta coefficients of a tensor, usually a spectrogram.
ComplexNorm: Compute the norm of a complex tensor.
MelScale: This turns a normal STFT into a Mel-frequency STFT, using a conversion matrix.
AmplitudeToDB: This turns a spectrogram from the power/amplitude scale to the decibel scale.
MFCC: Create the Mel-frequency cepstrum coefficients from a waveform.
MelSpectrogram: Create MEL Spectrograms from a waveform using the STFT function in PyTorch.
MuLawEncoding: Encode waveform based on mu-law companding.
MuLawDecoding: Decode mu-law encoded waveform.
TimeStretch: Stretch a spectrogram in time without modifying pitch for a given rate.
FrequencyMasking: Apply masking to a spectrogram in the frequency domain.
TimeMasking: Apply masking to a spectrogram in the time domain.
Once we understand these concepts we will create our spectrogram images of the yes/no dataset to be used in the computer vision model.
1) Load the Dataset folders into a DataLoader:
Here we import the packages and create a load_audio_files function to load audio files from a specified path into a dataset.
import os
import torch
import torchaudio
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
def load_audio_files(path: str, label:str):
dataset = []
walker = sorted(str(p) for p in Path(path).glob(f'*.wav'))
for i, file_path in enumerate(walker):
path, filename = os.path.split(file_path)
speaker, _ = os.path.splitext(filename)
speaker_id, utterance_number = speaker.split("_nohash_")
utterance_number = int(utterance_number)
# Load audio
waveform, sample_rate = torchaudio.load(file_path)
dataset.append([waveform, sample_rate, label, speaker_id, utterance_number])
return dataset
Call the load_audio_files function for each class we are going to use, then print the length of the dataset.
trainset_speechcommands_yes = load_audio_files('./data/SpeechCommands/speech_commands_v0.02/yes', 'yes')
trainset_speechcommands_no = load_audio_files('./data/SpeechCommands/speech_commands_v0.02/no', 'no')
print(f'Length of yes dataset: {len(trainset_speechcommands_yes)}')
print(f'Length of no dataset: {len(trainset_speechcommands_no)}')
# Length of yes dataset: 4044
# Length of no dataset: 3941
Now load the dataset into a DataLoader:
trainloader_yes = torch.utils.data.DataLoader(trainset_speechcommands_yes, batch_size=1,
shuffle=True, num_workers=0)
trainloader_no = torch.utils.data.DataLoader(trainset_speechcommands_no, batch_size=1,
shuffle=True, num_workers=0)
Here we grab the waveform and sample_rate from each class and print out a sample of the dataset to see what our data looks like.
yes_waveform = trainset_speechcommands_yes[0][0]
yes_sample_rate = trainset_speechcommands_yes[0][1]
print(f'Yes Waveform: {yes_waveform}')
print(f'Yes Sample Rate: {yes_sample_rate}')
print(f'Yes Label: {trainset_speechcommands_yes[0][2]}')
print(f'Yes ID: {trainset_speechcommands_yes[0][3]}')
no_waveform = trainset_speechcommands_no[0][0]
no_sample_rate = trainset_speechcommands_no[0][1]
print(f'No Waveform: {no_waveform}')
print(f'No Sample Rate: {no_sample_rate}')
print(f'No Label: {trainset_speechcommands_no[0][2]}')
print(f'No ID: {trainset_speechcommands_no[0][3]}')
2) Transform and visualize:
2.1 Waveform:
The waveform is generated by taking the sample rate and frequency and representing the signal visually. This signal can be represented as a waveform which is the signal representation over time in a graphical format. The audio can be recorded in different channels. For example stereo recording have 2 channels, right and left.
Here we will show you have to use the resample transform to reduce the size of the waveform then graph it to visualize the new waveform shape
def show_waveform(waveform, sample_rate, label):
print("Waveform: {}\nSample rate: {}\nLabels: {}".format(waveform, sample_rate, label))
new_sample_rate = sample_rate/10
print(new_sample_rate)
# Resample applies to a single channel, we resample first channel here
channel = 0
waveform_transformed = torchaudio.transforms.Resample(sample_rate, new_sample_rate)(waveform[channel,:].view(1,-1))
print("Shape of transformed waveform: {}".format(waveform_transformed.size()))
plt.figure()
plt.plot(waveform_transformed[0,:].numpy())
show_waveform(yes_waveform, yes_sample_rate, 'yes')
2.2 Spectrogram:
Next we will look at the Spectrogram transform and concept.
What is a spectrogram anyway?! A spectrogram maps the frequency to time of an audio file and allows us to visualize audio data by frequency. This image is what we will use for our computer vision classification on our audio files.
def show_spectrogram(waveform):
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
#print(spectrogram)
print("Shape of spectrogram: {}".format(spectrogram.size()))
plt.figure()
plt.imshow(spectrogram.log2()[0,:,:].numpy(), cmap='gray')
#plt.imsave(f'test/spectrogram_img.png', spectrogram.log2()[0,:,:].numpy(), cmap='gray')
show_spectrogram(yes_waveform)
2.3 Mel Spectrogram:
Mel Spectrogram is also a frequency to time but the frequency is converted to the Mel Scale.
The Mel Scale takes the frequency and changes it based on the perception of the sound of the scale or melody. This transforms the frequency within to the Mel Scale and then creates the spectrogram image.
def show_melspectrogram(waveform,sample_rate):
mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate)(waveform)
print("Shape of spectrogram: {}".format(mel_spectrogram.size()))
plt.figure()
plt.imshow(mel_spectrogram.log2()[0,:,:].numpy(), cmap='gray')
show_melspectrogram(yes_waveform, yes_sample_rate)
2.4 Mel-frequency cepstral coefficients (MFCC):
A simplified explanation of what the MFCC does is it take our frequency, applies transforms and the result is the amplitudes of the spectrum created from the frequency. Lets take a look at what this looks like.
def show_mfcc(waveform,sample_rate):
mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate= sample_rate)(waveform)
print("Shape of spectrogram: {}".format(mfcc_spectrogram.size()))
plt.figure()
fig1 = plt.gcf()
plt.imshow(mfcc_spectrogram.log2()[0,:,:].numpy(), cmap='gray')
plt.figure()
plt.plot(mfcc_spectrogram.log2()[0,:,:].numpy())
plt.draw()
show_mfcc(no_waveform, no_sample_rate)
Waveform
Spectogram
yes_waveform
Mel_Spectogram
yes_waveform
yes_sample_rate
MFCC: no_waveform, no_sample_rate
3) Create an image from a Spectrogram:
We have broken down some of the ways to understand our audio data and different transformations we can use on our data. Now lets create the images we will use for classification.
Below are two different functions to create the Spectrogram image or the MFCC images for classification. We will use the Spectrogram images in this example however feel free to play around with MFCC classification by using the below MFCC images function.
def create_images(trainloader, label_dir):
#make directory
directory = f'./data/spectrograms/{label_dir}/'
if(os.path.isdir(directory)):
print("Data exists")
else:
os.makedirs(directory, mode=0o777, exist_ok=True)
for i, data in enumerate(trainloader):
waveform = data[0]
sample_rate = data[1][0]
label = data[2]
ID = data[3]
# create transformed waveforms
spectrogram_tensor = torchaudio.transforms.Spectrogram()(waveform)
fig = plt.figure()
plt.imsave(f'./data/spectrograms/{label_dir}/spec_img{i}.png', spectrogram_tensor[0].log2()[0,:,:].numpy(), cmap='gray')
def create_mfcc_images(trainloader, label_dir):
#make directory
os.makedirs(f'./data/mfcc_spectrograms/{label_dir}/', mode=0o777, exist_ok=True)
for i, data in enumerate(trainloader):
waveform = data[0]
sample_rate = data[1][0]
label = data[2]
ID = data[3]
mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate= sample_rate)(waveform)
plt.figure()
fig1 = plt.gcf()
plt.imshow(mfcc_spectrogram[0].log2()[0,:,:].numpy(), cmap='gray')
plt.draw()
fig1.savefig(f'./data/mfcc_spectrograms/{label_dir}/spec_img{i}.png', dpi=100)
#spectorgram_train.append([spectrogram_tensor, label, sample_rate, ID])
create_images(trainloader_yes, 'yes')
create_images(trainloader_no, 'no')
IV. BUILD THE SPEECH MODEL:
Now that we have created the spectrogram images, it's time to build the computer vision model. If you are following along with the learning path then you already created a computer vision model in the second module in this path. We will be using the torchvision package to build our vision model. Lets import the packages we need to build the model.
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, models, transforms
import pandas as pd
import os
1) Load Spectrogram images into a DataLoader for training
Here we provide the path to our image data and use the ImageFolder helper to load the images into tensors. The labels are created based on the name of the folders.
data_path = './data/spectrograms' #looking in subfolder train
yes_no_dataset = datasets.ImageFolder(
root=data_path,
transform=transforms.Compose([transforms.Resize((201,81)),
transforms.ToTensor()
])
)
print(yes_no_dataset)
print(yes_no_dataset[5][0].size())
2) Split the data for training and testing
Split the data to use 80% to train the model and 20% to test.
#split data to test and train
#use 80% to train
train_size = int(0.8 * len(yes_no_dataset))
test_size = len(yes_no_dataset) - train_size
yes_no_train_dataset, yes_no_test_dataset = torch.utils.data.random_split(yes_no_dataset, [train_size, test_size])
print(len(yes_no_train_dataset))
print(len(yes_no_test_dataset))
Load the data into the DataLoader
train_dataloader = torch.utils.data.DataLoader(
yes_no_train_dataset,
batch_size=15,
num_workers=2,
shuffle=True
)
test_dataloader = torch.utils.data.DataLoader(
yes_no_test_dataset,
batch_size=15,
num_workers=2,
shuffle=True
)
Lets take a look at what our tensor looks like
train_dataloader.dataset[0][0][0][0]
Get GPU for training, else use CPU if GPU is not available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))
3) Create the neural network
Create the Convolutional Neural Network and set the device.
class CNNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(51136, 50)
self.fc2 = nn.Linear(50, 2)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
#x = x.view(x.size(0), -1)
x = self.flatten(x)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = F.relu(self.fc2(x))
return F.log_softmax(x,dim=1)
model = CNNet().to(device)
print(model)
3) Create Train and Test functions
Here we will set the cost function, learning_rate, and optimizer. Then set up the train and test functions that we will call next.
# cost function used to determine best parameters
cost = torch.nn.CrossEntropyLoss()
# used to create optimal parameters
learning_rate = 0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Create the training function
def train(dataloader, model, loss, optimizer):
model.train()
size = len(dataloader.dataset)
for batch, (X, Y) in enumerate(dataloader):
X, Y = X.to(device), Y.to(device)
optimizer.zero_grad()
pred = model(X)
loss = cost(pred, Y)
loss.backward()
optimizer.step()
if batch % 100 == 0:
loss, current = loss.item(), batch * len(X)
print(f'loss: {loss:>7f} [{current:>5d}/{size:>5d}]')
# Create the validation/test function
def test(dataloader, model):
size = len(dataloader.dataset)
model.eval()
test_loss, correct = 0, 0
with torch.no_grad():
for batch, (X, Y) in enumerate(dataloader):
X, Y = X.to(device), Y.to(device)
pred = model(X)
test_loss += cost(pred, Y).item()
correct += (pred.argmax(1)==Y).type(torch.float).sum().item()
test_loss /= size
correct /= size
print(f'\nTest Error:\nacc: {(100*correct):>0.1f}%, avg loss: {test_loss:>8f}\n')
4) Train the model
Now lets set the number of epochs and call our train and test functions for each epoch.
epochs = 15
for t in range(epochs):
print(f'Epoch {t+1}\n-------------------------------')
train(train_dataloader, model, cost, optimizer)
test(test_dataloader, model)
print('Done!')
5) Test the model
Awesome! You should have got somewhere between a 93%-95% accuracy by the 15th epoch. Here we grab a batch from our test data and see how the model performs on the predicted vs the actual result.
model.eval()
test_loss, correct = 0, 0
with torch.no_grad():
for batch, (X, Y) in enumerate(test_dataloader):
X, Y = X.to(device), Y.to(device)
pred = model(X)
print("Predicted:")
print(f"{pred.argmax(1)}")
print("Actual:")
print(f"{Y}")
break
V. SUMMARY:
Congratulations on building an audio binary classification speech model!
We have covered the basics of building an audio machine learning model from understanding how analog sound turns to digital sound, to creating spectrogram images of our wave files. We used the Speech Commands dataset, parsed the classes down to yes and no, then looked at ways we can understand and visualize audio data. From there, we took the spectrograms, created images and used a convolutional neural network to build our model.