September 22, 2020 31 min to read

Identifying the Gender of a Movie Character with Deep Learning, NLP, and PyTorch

A Primer on Text Classification with PyTorch

If you were given a single line from a movie, would you be able to identify the gender of the character who delivered the line? Unless you’ve memorized a lot of movie scripts, probably not. Lucky for you, you don’t have to do this as long as we have computers! The field of Natural Language Processing (NLP) has us covered. By applying Deep Learning to NLP and creating a text classifier we can train a computer to identify whether a line from a movie was delivered by a male or female character!

Setting Up Your Environment

Colab

Deep learning usually requires a large amount of computing power and a solid GPU, Deep Learning NLP is no exception. This used to be a barrier to entry in the field, but thanks to Google Colaboratory, it no longer is. Google Colab is a platform that allows you to train models on GPUs that are in the cloud with Jupyter Notebooks completely for free! You’ll be following along with this tutorial on Colab.

To get started with Google Colab, all you need to do is go to https://colab.research.google.com/ and sign in with your google account. Once you’ve done that, you can make Jupyter Notebooks and use them like you normally would. Usually, you’ll want to read from and write to your google drive so that you can actually use data and save your results. In order to do this, add the following block of code as the first cell in every Colab Notebook you create.

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/pathtofolderwithfileshere"

Now you have a great environment to train your models in!

Installing Libraries

One of the best things about Colab is that it comes with all the big data science libraries like PyTorch, Tensorflow, Numpy, Matplotlib, and Scikit-Learn out of the box! In fact, it also comes with NLTK and spaCy, two of the most important NLP libraries. In short, you don’t need to install any libraries to follow along with this tutorial as long as you’re using Colab.

Obtaining Data

The data that this tutorial uses comes from the Cornell Movie-Dialogs Corpus which contains information about 617 Hollywood films. The data this article is concerned with is the conversational data which is just the lines delivered by the characters in the movie and the genders of the character. For that purpose, I extracted all the relevant data and merged it into one file for easy use which you can find here: https://drive.google.com/file/d/1pD6u40QVZ6bHeLgUUmH2eRNZv4F-byz_/view?usp=sharing. This data file contains a lot of lines from movies with associated information about the characters, including the gender of the characters, which is of great importance for building the classifier. Once you download the data file, upload it to a folder named data that is within your root_dir as defined above on Google Drive. When you finish, you’re ready to move on to the preprocessing stage of the tutorial.

Preprocessing Data

The first step one takes in most Data Science projects of any kind is to examine the data they’re working with and then preprocess it. For those of you who are unfamiliar with the term preprocessing, all it really is just making the data usable for whatever task you intend to do with it. The different preprocessing tasks vary based on the field and this section will cover a few common NLP preprocessing tasks that will help with the larger Deep Learning NLP goal.

Preliminary Analysis of the Dataset

Before you begin a Data Science project, it is always good to take a brief look at what your dataset actually looks like. We know that we only have two classes to classify: males and females. Hence, you’ll also need to look for imbalances in between the two classes, which is just a fancy way of saying you need to see if the number of data points for each class is around the same. Thankfully to get a quick overview of the data you just need to use a little Python! While you conduct the preliminary analysis of the data, you’ll also be creating two new files, male.txt and female.txt, which will make it easier when we’re training our models. Before you follow along with the following steps, create a new notebook and call it preprocessing in Colab. Once you do that and the cell that mounts your drive on Colab, you can follow along with the rest of this section.

The first thing you should do is open up the collated_data.txt file and look at its format. You’ll notice that it uses “+++$+++” as the delimiter, which is just what it uses to separate different data values, in a CSV the delimiter is a comma. You’ll also notice that there’s 7 different pieces of information in each line, in order they are: line number, character id, movie id, character name, character gender, line text, and character’s position in the credits. You may also see that there is a ? in place of the character gender in some places and that is because the people who put the dataset together were unable to ascertain the gender of the character.

Now that you’ve taken a brief look over your data you’ll need to separate the data based on the gender of the characters. This can be done in base python without the help of any libraries. Follow along with these steps:

Add the following cell to your notebook, all it does is initialize a list that will contain the text of the lines delivered by males and then another list for females.

   male_lines = []
   female_lines = []

Next, you’ll need to loop through the data file and add the text to either the male or female list depending on the gender of the speaker. You can do this with basic file operations and conditional statements. Note that root_dir is defined in the first cell discussed in this tutorial and that collated_data.txt should be present in that directory.

with open(root_dir+'collated_data.txt', encoding="charmap") as data:
    for line in data:
        line_no, chr_id, mov_id, chr_name, gender, text, credit = line.strip().split("+++$+++")
        if(gender.strip().lower() == 'm'):
            male_lines.append(text)
        elif(gender.strip().lower() == 'f'):
            female_lines.append(text)

You’ll notice that we split the line based on the delimiter and then have variables for each attribute that is on the line. The only two that matter for this tutorial are the character gender and the line text.

You’ll now do some preliminary analysis of the dataset which just boils down to looking at the number of male and female lines. This just requires use of the len() function.

print(len(male_lines)) #Output: 170768	
print(len(female_lines)) #Output: 71255

Yikes! There’s almost 100,000 more male data points than there are female data points! That’s a massive imbalance and something that will need to be corrected before a classifier can be constructed. In the meantime however, we can proceed with writing the male and female lines to separate text files.

By writing the male and female lines to separate files you’ll be doing yourself a favor and making it easier to reuse the data for future projects.

with open(root_dir+'male.txt', mode='w+') as male:
    for line in male_lines:
        male.write(line + '\n')

with open(root_dir+'female.txt', mode='w+') as female:
    for line in female_lines:
        female.write(line + '\n')

The above code blocks are separate because I encourage you to put them in different cells of your notebook for clarity’s sake. Now that you’re done with some very basic preprocessing, it’s time that you do some preprocessing tasks that are exclusive to NLP.

Introduction to NLP Terms and Preprocessing

There’s a lot of information that gleaned be from words in the English language, however you often don’t need the whole sentence to be able to ascertain its meaning. In NLP, there’s usually a lot of unimportant data that we can clear out so as to reduce the noise in the inputs to our model. The most common of these preprocessing steps include: tokenization, stopword removal, and stemming. However, these steps are not always applied because sometimes they remove useful data. In fact, there will be no stopword removal or stemming applied to this dataset because of the important information that may be removed by those steps. Here are some definitions of each of these steps.

Tokenization

Tokenization is the breaking down of a sentence or document into individual tokens which are essentially just words. This can be done with the help of a function from the NLTK library called nltk.tokenize.tokenize(). You can also have this done by PyTorch when you’re loading data into your model and this is what you will be doing when you write the LSTM model. By turning sentences into individual tokens you’re creating sequential data that is very useful for LSTMs.

Stopword Removal

Stopword removal is the process of removing common words in the English language from text. Often this is done so that models don’t weight extremely common words disproportionately in comparison to rarer words in the English language that show up more often in that particular text. However, no stopword removal should be done for this project since the movie lines are already fairly short and all the tokens are valuable.

Stemming

As the name suggests, Stemming is just turning words into their stems. This is helpful when knowing the tense or form of a word doesn’t matter to the task at hand, however, for the task of text classification this may prove to be incredibly useful.

Splitting the Data into Training and Testing sets

One of the most important preprocessing steps in Machine Learning in general is dividing your dataset into training and testing sets. Remember, there are a lot more male data points than female data points which mean’s you’ll have to correct this imbalance somehow. Keep this in mind as you begin to divide the data. The following code will still be a part of the preprocessing notebook. Your main goal will be to create a file that will contain the training data and a file that will contain the testing data.

When creating training and testing sets you must keep in mind that the number of data points for each class should be roughly the same in both the training and testing sets. The testing set will usually be much smaller than the training set, following an 80/20 split of all the data. Scikit-learn has a built in function that splits data into training and testing for you! Before you divide the data, throw your mind back to the imbalance in the data that we saw. There are a lot more male data points than female data points. You can combat this by either randomly oversampling or randomly undersampling our training set. By randomly oversampling the train set you will increase the number of female data points by using some of them multiple times until the number of female lines matches the number of male lines in the train set. By randomly undersampling the train set you will decrease the number of male data points to match the number of female lines in the dataset. Oftentimes, randomly undersampling will lead to a lower accuracy for a model because there just isn’t enough data, and for that reason you’ll be randomly oversampling the train set.

Now you may be wondering why the train set is being randomly oversampled and that is because if we were to randomly oversample the entire dataset, it is likely that there would be some overlap in between the train set and the test set which would then lead to an inaccurate representation of the performance of the model.

Alright, enough theory! It is time for you to write some code. Let us have around 10% of the data be for testing and 90% be for training. To properly split your data into training and testing, follow along with these steps.

Create the testing set first by simply taking the first 10,000 lines from both the male and female lists.

male_test = male_lines[:10000]
female_test = female_lines[:10000]
X_test = male_test + female_test

Now that there exists the X portion of the testing set, the labels need to be constructed. Our labels in this case will be 0 if the line was delivered by a male and a 1 if the line was delivered by a female. This can be accomplished with two simple for loops.

Y_test = []
for x in male_test:
    Y_test.append(0)
for x in female_test:
    Y_test.append(1)

The test set is now complete, it is time for the creation of the train set. First, take everything that wasn’t used in the test set and put that it into two new lists: male_train and female_train.

male_train = male_lines[10000:]
female_train = female_lines[10000:]
X_train = male_train + female_train

Now you need to create Y_train, which will contain the labels for the lines in X_train. This is the same process that was used to make the labels for the test set.

Y_train = []
for x in male_train:
    Y_train.append(0)
for x in female_train:
    Y_train.append(1)

Since the number of male lines significantly outnumber the number of female lines in the train set, you’ll need to oversample the female lines. This can be done with the help of a library called imblearn which is included in your colab environment. You’ll also need to import numpy. The following code oversamples until the number of female lines is equal to the number of the male lines.

import numpy as np
from imblearn.over_sampling import RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')
X_train, Y_train = oversample.fit_resample(np.array(X_train).reshape(-1,1), Y_train)

The X_train that is created in the above code block is actually a list of lists with one element where the one element is the movie line. It should just be a list of strings. This is an easy conversion with a quick for loop and list indexing.

male_lines = []
for phrase in X_train:
    male_lines.append(phrase[0].strip())
X_train = male_lines

Now that both the training and test sets are completely constructed, they need to be converted into pandas dataframes and then saved as CSVs. The dataframes will have two columns: text and target, where text is a movie line and target is either 0 or 1 depending on the gender of the speaker. To do all of this, pandas will need to be imported but the code itself is fairly simple. You will create two dataframes and fill them with the train and test lists that have been created and then save them to root_dir.

import pandas as pd
train_df = pd.DataFrame()
test_df = pd.DataFrame()

train_df['text'] = X_train
train_df['target'] = Y_train

test_df['text'] = X_test
test_df['target'] = Y_test

train_df.to_csv(root_dir + 'train.csv')
test_df.to_csv(root_dir + 'test.csv')

Amazing! You now have cleaned data and a training and testing set! The actual creation of the models will take you less time than the preprocessing stage and this is often true of real-life data science projects. Without further ado, it is time to move on to building the classifiers.

LSTMs for Text Classification

How does a Recurrent Neural Network (RNN) work?

Long Short Term Memory Networks (LSTMs) are a version of RNNs. To properly understand how LSTMs work, one needs to know how RNNs work. What’s great about RNNs is that they have internal memory which other Neural Networks do not. People use RNNs when they’re dealing with sequential data, such as language! In this explanation of the workings of a RNN, it is assumed that you know how basic feed-forward networks work.

In a RNN the input data cycles in a loop, when it comes time to make a decision the RNN takes into account the current input and the input that came before it. A normal RNN has short-term memory which is one of the reasons LSTMs need to be used, so that the network has long-term memory as well. In essence a RNN has two inputs: the current input, and the recent inputs. This provides an edge when doing language related tasks. Additionally, unlike Feed-Forward networks which can only map one input to one output, RNNs can do one to many, many to many, and many to one. This is a brief summary of RNNs and there’s a lot of in-depth math that one can get into and I would advise you to read up on that to get a really thorough understanding of the network.

GloVe Vectors

GloVe vectors are what we’ll be using as the inputs to our model. GloVe stands for global vectors for word representation and it is used to create word embeddings. Word embeddings often serve as the inputs for Deep Learning NLP models and are just a way to convert textual information like sentences into numerical like data. This makes the input understandable for deep learning models. This section will walk you through the first few steps of writing the LSTM in PyTorch which is really just loading the data and creating GloVe vectors.

First you’ll need to create a new notebook on Colab to actually write the LSTM in. You should probably name it something along the lines of GenderClassifierLSTM.ipynb. Before you type up any lines of code, make sure to change the runtime of your notebook and ensure that it is utilizing a GPU. To do this, click on Runtime > Change Runtime Type and then change the Hardware Accelerator to a GPU. To load in the data and set the base for your LSTM, follow along with these steps.

Mount your Google Drive in Colab

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/pathtoyourdatahere"

Import all the necessary libraries. Don’t be scared at everything that is being imported here, you’ll know what everything means by the end. Some quick highlights are that we’re using PyTorch, Numpy, Pandas, and Scikit-Learn. The PyTorch documentation is something that you’ll need to continuously look at and you can find it at https://pytorch.org/docs/stable/index.html.

import torch
import torch.nn as nn 
import torch.nn.functional as F
import torchtext 
import numpy as np
import pandas as pd
from torchtext import data
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence 
from sklearn.metrics import mean_squared_error

Now it is time to load in the data that you have, a fairly easy task.

train_df = pd.read_csv(root_dir + 'train.csv')
test_df = pd.read_csv(root_dir + 'test.csv')

Now, when you load in a CSV into Colab, you’ll end up with an extra column at the beginning and to fix that you’ll need to reconstruct both the train_df and test_df. The way to do this is by just extracting the relevant columns and then putting them into new dataframes.

X_train = train_df['text']
Y_train = train_df['target']
X_test = test_df['text']
Y_test = test_df['target']

Another point you must look into as one final bit of preprocessing is removing NaN values from your data. An easy way to do this is to just remove all data types that are floats from your lists because when dealing with textual data only the nans will be floats.

indices = []
for i in range(len(X_train)):
  if (isinstance(X_train[i], float)):
    indices.append(i)

for index in sorted(indices, reverse=True):
    del X_train[index]
    del Y_train[index]
    
indices = []
for i in range(len(X_test)):
  if (isinstance(X_test[i], float)):
    indices.append(i)

for index in sorted(indices, reverse=True):
    del X_test[index]
    del Y_test[index]

Now you will seed your notebook so that you’ll get the same results everytime you run the notebook.

SEED = 42

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Now if you recall, one of the important parts of preprocessing textual data is tokenization. PyTorch allows us to do this when we’re creating the fields of the model, of which we have two: TEXT and LABEL which are self-explanatory.

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

As you can see, we’re creating two fields using the built in fields that torch.utils.data has. The text is then being tokenized using spaCy, which is one of the text processing libraries that is used in such projects.

When working with Deep Learning NLP in PyTorch and any other type of Deep Learning, you usually need to write classes to accommodate your custom Datasets and make sure you can load it into your model. In this case, you’ll be writing a custom class that will represent a dataframe.

class DataFrameDataset(data.Dataset):

    def __init__(self, df, fields, is_test=False, **kwargs):
        examples = []
        for i, row in df.iterrows():
            label = row.target if not is_test else None
            text = row.text
            examples.append(data.Example.fromlist([text, label], fields))

        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex):
        return len(ex.text)

    @classmethod
    def splits(cls, fields, train_df, val_df=None, test_df=None, **kwargs):
        train_data, val_data, test_data = (None, None, None)
        data_field = fields

        if train_df is not None:
            train_data = cls(train_df.copy(), data_field, **kwargs)
        if val_df is not None:
            val_data = cls(val_df.copy(), data_field, **kwargs)
        if test_df is not None:
            test_data = cls(test_df.copy(), data_field, True, **kwargs)

        return tuple(d for d in (train_data, val_data, test_data) if d is not None)

All of this code is standard among many projects that I’ve done before and you will most likely end up using this class multiple times so be sure to save it! The most important part of this class is the splits method and it used to split the TEXT and LABEL field into a train and test dataset that is readable by the model we create. This also happens to be the next step in this process.

The next step is to make the train and test dataset readable to the model you’ll create and to do do this you’ll be using the splits method in the DataFrameDataset class that you wrote.

fields = [('text',TEXT), ('label',LABEL)]
train_ds, test_ds = DataFrameDataset.splits(fields, train_df=train_df, val_df=test_df)

Now you have your train and test datasets in a readable format, and you are ready to construct GloVe vectors.

When constructing GloVe vectors you’re going to have to define the size of your vocabulary, and in this case it will be the size of the X_train vector. You also have to define the size of the vector, or how many dimensions it will have. 200 dimensions is standard. In the following code block, you are building the vocabulary for your TEXT. As you can see your vocabulary’s size is the number of lines in the X_train list.

MAX_VOCAB_SIZE = len(train_df['text'])

TEXT.build_vocab(train_ds, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = 'glove.6B.200d',
                 unk_init = torch.Tensor.zero_)

Having built the vocabulary for the text, you’ll need to do the same for your labels but you won’t be using GloVe vectors.

LABEL.build_vocab(train_ds)

Alright! You’ve finished all the preprocessing you need and have set the basis for writing your LSTM. It is time to learn more about the wonder that is a Long Short Term Memory Network.

What are LSTMs?

LSTMs are an improvement upon RNNs. It was mentioned earlier that RNNs have short-term memory which is one of their advantages and this is improved upon in LSTMs. LSTMs are able to maintain memories long-term which significantly boosts their performance.

LSTMs are centered around something called the cell state which is commonly thought of as a conveyor belt. This conveyor belt goes through the chain of modules of the neural network. Information usually goes through the chain unchanged and uninterrupted. However, the LSTM can alter the information that the cell state has through the use of “gates”. Gates are made of a pointwise multiplication operation and a sigmoid neural net layer. If you’re familiar with deep learning you’ll know that the sigmoid layer just outputs numbers between zero and one. This corresponds to how much of each component should be let through. As you may surmise, 0 means nothing should be let through and 1 means everything should be let through. LSTMs have three such gates and that is how they control the flow of information in the networks. I’d suggest that you read more about the math behind Long Short Term Memory Networks after you implement one and it will help you gain a better understanding of the network.

Enough theory! Time to implement this in PyTorch. Follow along with these steps and you’ll be golden the next time you want to implement an LSTM.

Before you really get into writing the LSTM there’s some housekeeping things you need to do that is common amongst most PyTorch Neural Network implementations. Namely, making sure that you’ll be using a GPU to train and choosing some hyperparameters. Another important thing that you’re doing is declaring a train_iterator and a valid_iterator. These will be used during the training and testing of the model respectively to, as the name suggests, iterate through data.

BATCH_SIZE = 256

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_ds, test_ds), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

Alright, the only hyperparameter that is defined so far is the BATCH_SIZE. There are a lot of other important hyperparameters that should be discussed. They are all in the code block below with accompanying explanations.

# Hyperparameters
num_epochs = 25 #This is the number of epochs and dictates how long the model trains for
learning_rate = 0.001 #This essentially determines how quickly a model trains

INPUT_DIM = len(TEXT.vocab) #As the name suggests this is the input dimension
EMBEDDING_DIM = 200 #The GloVe Embedding dimensions which is 200
HIDDEN_DIM = 256 #The number of hidden dimensions
OUTPUT_DIM = 1 #The number of output dimensions: 1 (either 0 or 1)
N_LAYERS = 4 #The number of layers in the neural network.
BIDIRECTIONAL = True #LSTMs are Bidirectional so don't change this hyperparameter
DROPOUT = 0.2 # Dropout is when random neurons are ignored, the higher the dropout the greater percentage of neurons are ignored.
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # padding makes it so that sequences are padded to the maximum length of any one of the sequences, in this case that would be the longest utterance delivered by a movie character.

Now comes the exciting part, actually writing the LSTM. You’ll be creating a class called LSTM_net that inherits from PyTorch’s nn.Module. As with any class that one writes in Python, the first thing to do is write the __init__ method.

class LSTM_net(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc1 = nn.Linear(hidden_dim * 2, hidden_dim)
        
        self.fc2 = nn.Linear(hidden_dim, 1)
        
        self.dropout = nn.Dropout(dropout)

If you take a look at the parameters that the __init__ method takes, you’ll notice that they are the hyperparameters we’ve already set and that they’re being used to construct the LSTM. The method starts off with a classic trait of inheritance in Python which is calling super().__init__ to call the init method of the nn.Module class, for which you should look at the documentation of. Next the embedding for the LSTM is being constructed using the vocab size, embedding dimensions, and padding. This embedding is just a simple lookup table that stores embeddings of a fixed dictionary and size and is being used to store the GloVe word embeddings.

You’ll also notice that an RNN is being used as the base of the LSTM with some of the hyperparameters that have already been defined. You may be confused by the two variables called self.fc1 and self.fc2, but don’t fear, these are just the two activation layers of the LSTM with the first one being larger than the second one. FC is shorthand for fully connected layer. The last variable that is initialized is the dropout of the network which was discussed earlier.

Now it is time to move on the second of the two methods that this class will have: forward() which encodes the forward pass for the network.

def forward(self, text, text_lengths):  
        embedded = self.embedding(text)

        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)

        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        output = self.fc1(hidden)
        output = self.dropout(self.fc2(output)) 
        #hidden = [batch size, hid dim * num directions] 
        return output

The forward pass deals with the embedded text, packing said text and then applying the dropout to the final forward and backward hidden layers, and applying dropout to that to get the final output for the method. The above code represents that process and if you would like to know more about these functions I would suggest you take a look at the PyTorch documentation.

Ok, now that the LSTM class has been created, you’ll need to make an instance of the class.

#creating instance of our LSTM_net class

model = LSTM_net(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

It is time to store the embeddings in a variable, conveniently labeled pretrained_embeddings and then imparting this knowledge to the model.

pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

To make the padding of the network you’ll need to fill it out with a bunch of zero which can be done with the following line of code.

model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

To make sure the model trains on the GPU, use the following line of code.

model.to(device)

All neural networks need a loss function and optimizer! Add them with the following block of code.

# Loss and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), learning_rate)

Next, you’ll be writing a function that will calculate the accuracy of your model’s predictions with some basic logic. Pay attention to the comments to understand what’s happening!

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

Training the LSTM

Whew! You’ve worked through a lot so far and you’re almost at the end of the road! It is time to train and test the model!

You’ll need to write a function that you’ll use to train the model.

def train(model, iterator):
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        text, text_lengths = batch.text
        optimizer.zero_grad()
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)

        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

This function keeps track of both the accuracy and loss for each Epoch that you’re training the model for and goes through forward passes and backpropagation and then measures the accuracy by using the binary_accuracy function that you wrote earlier. It returns the accuracy and loss for the epoch when it is done training.

You’ll also need a function that you’ll use to evaluate the model’s performance.

def evaluate(model, iterator):
    
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text

            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

This function loops through the test data, feeds it to the model and then measures the prediction against the actual label. It then outputs the loss and accuracy for the epoch.

This will be the last block of code you write, it is what will actually train the model. It is fairly plain python and requires one import, the time library that is already included with colab so you don’t have to install anything.

import time

t = time.time()
loss=[]
acc=[]
val_acc=[]
val_losses=[]

for epoch in range(num_epochs):
    train_loss, train_acc = train(model, train_iterator)
    val_loss, valid_acc = evaluate(model, valid_iterator)
    print("Epoch " + str(epoch) + " :")
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\tVal Loss: {val_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
    print('\n')
    loss.append(train_loss)
    acc.append(train_acc)
    val_acc.append(valid_acc)
    val_losses.append(val_loss)
print(f'time:{time.time()-t:.3f}')

The code block above keeps track of the loss and accuracy for each epoch and then stores them in a list that you can use to graph and see the performance of the model over epochs. It also keeps track of the time the model takes to train on each epoch. With the current hyperparameters, you’ll end with a validation accuracy in the range of 70% and a training time of 30 minutes. By adjusting the hyperparameters you can boost the performance of the model, but that may come at the cost of having higher training time.

Conclusion

You’ve learned a lot in this article, mainly how to perform binary text classification on a dataset with PyTorch. The skills you learned through this article are transferrable to any other textual dataset where you want to classify two labels but the level of work required for datasets will vary. Some come pre-cleaned and in that case you just have to make a model but others are rough and you’ll have to do a lot of textual preprocessing before you even think about making a model. Preprocessing is usually the most time consuming part of developing a model besides training the model itself. You are now armed with incredibly valuable knowledge and I advise you to go out and find a dataset and practice the skills you just learned.

If you enjoyed this post and feel like you learned something, consider subscribing to my newsletter. Every Sunday, I send out a newsletter that contains the best programming and learning related content I’ve seen in the past week along with my own thoughts on the events of the week. The main goal of the newsletter is to bring meaningful and thought-provoking ideas to your inbox every Sunday. Consider signing up if you’re interested.

Siddhant Dubey

Identifying the Gender of a Movie Character with Deep Learning, NLP, and PyTorch

Setting Up Your Environment

Colab

Installing Libraries

Obtaining Data

Preprocessing Data

Preliminary Analysis of the Dataset

Introduction to NLP Terms and Preprocessing

Tokenization

Stopword Removal

Stemming

Splitting the Data into Training and Testing sets

LSTMs for Text Classification

How does a Recurrent Neural Network (RNN) work?

GloVe Vectors

What are LSTMs?

Training the LSTM

Conclusion

Siddhant Dubey

Comments

Identifying the Gender of a Movie Character with Deep Learning, NLP, and PyTorch

Setting Up Your Environment

Colab

Installing Libraries

Obtaining Data

Preprocessing Data

Preliminary Analysis of the Dataset

Introduction to NLP Terms and Preprocessing

Tokenization

Stopword Removal

Stemming

Splitting the Data into Training and Testing sets

LSTMs for Text Classification

How does a Recurrent Neural Network (RNN) work?

GloVe Vectors

What are LSTMs?

Training the LSTM

Conclusion

Share

Siddhant Dubey

Comments