Getting Started With NLP + Transformers
Introduction
This tutorial goes over natural language processing (NLP) using deep learning in Python for people who are completely new to this topic. I’ll be using the Hugging Face Transformers library (docs) with a pre-trained model and applying it to a current Kaggle competition — the English Language Learners competition. In this competition, non-native English language proficiency by 8th-12th graders is assessed using a dataset of written essays. This technique of using deep learning to give automated feedback can be useful, resulting in faster feedback for students and reducing teachers’ grading burden.
This is part of the work I’ve done for Jeremy Howard’s fast.ai 2022 Part 1 course. If you haven’t heard of this course, I recommend checking it out as a great introduction to deep learning.
To follow along this tutorial, create a Kaggle notebook linked to this competition so that you can access the dataset and online computing resources.
Data Exploration
Load in the training dataset to a pandas DataFrame, and print a list of columns to check out their non-null count and data types:
path = '../input/feedback-prize-english-language-learning/'
import pandas as pd
df = pd.read_csv(path + 'train.csv')
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911 entries, 0 to 3910
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text_id 3911 non-null object
1 full_text 3911 non-null object
2 cohesion 3911 non-null float64
3 syntax 3911 non-null float64
4 vocabulary 3911 non-null float64
5 phraseology 3911 non-null float64
6 grammar 3911 non-null float64
7 conventions 3911 non-null float64
dtypes: float64(6), object(2)
memory usage: 244.6+ KB
None
There are 3911 rows in the training set. Looks like there are no missing values — great! text_id
and full_text
are strings whereas the 6 essay scoring measures are floats.
Now I’m going to look at the distribution of values for the 6 scoring measures:
df.describe()
Looks like the average is roughly 3 points. Note that the scores range from 1.0 to 5.0 in increments of 0.5.
Print out some sample rows to get an idea of what the dataset looks like:
df.sample(n=5, random_state=1)
You can see that text_id
is a unique identifier for each student essay, and full_text
is the column that contains the full text of each essay.
One Hot Encoding
Each scoring measure can take on values from 1.0 to 5.0, in increments of 0.5. Since there is a limited number of possible values (as opposed to being continuous), I’m going to treat this as a classification problem. Accordingly, I need to perform one hot encoding to make dummy variables representing each outcome (e.g., grammar-1.0, grammar-1.5, grammar-2.0, …). Each dummy variable will have a value that is either 0 or 1 to denote if it is a member of that category.
Make dummy variables:
df_original = df.copy()
target_cols = ['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions']
df = pd.get_dummies(df, columns=target_cols, dtype='float64')
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911 entries, 0 to 3910
Data columns (total 56 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text_id 3911 non-null object
1 full_text 3911 non-null object
2 cohesion_1.0 3911 non-null float64
3 cohesion_1.5 3911 non-null float64
4 cohesion_2.0 3911 non-null float64
5 cohesion_2.5 3911 non-null float64
6 cohesion_3.0 3911 non-null float64
7 cohesion_3.5 3911 non-null float64
8 cohesion_4.0 3911 non-null float64
9 cohesion_4.5 3911 non-null float64
10 cohesion_5.0 3911 non-null float64
11 syntax_1.0 3911 non-null float64
12 syntax_1.5 3911 non-null float64
13 syntax_2.0 3911 non-null float64
14 syntax_2.5 3911 non-null float64
15 syntax_3.0 3911 non-null float64
16 syntax_3.5 3911 non-null float64
17 syntax_4.0 3911 non-null float64
18 syntax_4.5 3911 non-null float64
19 syntax_5.0 3911 non-null float64
20 vocabulary_1.0 3911 non-null float64
21 vocabulary_1.5 3911 non-null float64
22 vocabulary_2.0 3911 non-null float64
23 vocabulary_2.5 3911 non-null float64
24 vocabulary_3.0 3911 non-null float64
25 vocabulary_3.5 3911 non-null float64
26 vocabulary_4.0 3911 non-null float64
27 vocabulary_4.5 3911 non-null float64
28 vocabulary_5.0 3911 non-null float64
29 phraseology_1.0 3911 non-null float64
30 phraseology_1.5 3911 non-null float64
31 phraseology_2.0 3911 non-null float64
32 phraseology_2.5 3911 non-null float64
33 phraseology_3.0 3911 non-null float64
34 phraseology_3.5 3911 non-null float64
35 phraseology_4.0 3911 non-null float64
36 phraseology_4.5 3911 non-null float64
37 phraseology_5.0 3911 non-null float64
38 grammar_1.0 3911 non-null float64
39 grammar_1.5 3911 non-null float64
40 grammar_2.0 3911 non-null float64
41 grammar_2.5 3911 non-null float64
42 grammar_3.0 3911 non-null float64
43 grammar_3.5 3911 non-null float64
44 grammar_4.0 3911 non-null float64
45 grammar_4.5 3911 non-null float64
46 grammar_5.0 3911 non-null float64
47 conventions_1.0 3911 non-null float64
48 conventions_1.5 3911 non-null float64
49 conventions_2.0 3911 non-null float64
50 conventions_2.5 3911 non-null float64
51 conventions_3.0 3911 non-null float64
52 conventions_3.5 3911 non-null float64
53 conventions_4.0 3911 non-null float64
54 conventions_4.5 3911 non-null float64
55 conventions_5.0 3911 non-null float64
dtypes: float64(54), object(2)
memory usage: 1.7+ MB
None
As you can see, all of the dummy variables are now present in the DataFrame df
in 54 different columns. However, Transformers assumes that a single column called labels
contains all of the “labels” (i.e., the correct answers), so I’m going to create a labels
column to hold all 54 numbers for a given essay:
labels = df.columns[2:]
df['labels'] = [df.iloc[i][labels].to_numpy() for i in df.index]
df = df.drop(labels=labels, axis=1)
df
Looks good! You can see the newly added labels
column on the right hand side of the DataFrame, with a vector representing score predictions. I’m going to print the first row:
df.iloc[0]['labels']
array([0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
0.0, 0.0], dtype=object)
As expected, most values are 0.0 since only one category per scoring measure should have a value of 1.0.
Next, convert the DataFrame to the Dataset object that Transformers uses:
from datasets import Dataset, DatasetDict
ds = Dataset.from_pandas(df)
ds
Dataset({
features: ['text_id', 'full_text', 'labels'],
num_rows: 3911
})
You can see that ds
contains all of the DataFrame’s columns and the correct number of rows.
Tokenization & Numericalization
To perform tokenization and numericalization, the next step is to pick a pre-trained model. I want to use something small so that it runs fast (deberta-v3-small
) and I can iterate quickly. There are several ways you can access this dataset. If internet is turned ON in your Kaggle notebook, you can use trained_model = 'microsoft/deberta-v3-small'
in lieu of the code below. If internet is OFF, add this public Kaggle dataset to your notebook’s input data and use trained_model = '../input/debertav3small'
. If you intend to submit your notebook to the competition’s Leaderboard, you will eventually need to turn off internet and so the latter option is best.
Create a tokenizer for this model:
trained_model = '../input/debertav3small'
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(trained_model)
Encoding means translating text to numbers. It’s a 2-step process: (1) tokenization, (2) numericalization. Tokenization is when the text is split into tokens, which can be words or sub-words. Numericalization is when each token is converted into a number based on a given vocabulary.
This tokenizer can split the student essays, or any text, into tokens:
tokz.tokenize('Hello everyone! I hope you are having a wonderful day.')
['▁Hello',
'▁everyone',
'!',
'▁I',
'▁hope',
'▁you',
'▁are',
'▁having',
'▁a',
'▁wonderful',
'▁day',
'.']
The underscore represents the start of a new word.
Now I’m going to tokenize every student essay in the full_text
column using batched=True
(speeds it up by applying the tokenization function tok_func
to multiple elements of the dataset at once, not on each element separately). I’m also going to define tok_func
with truncation on the maximum text length (to speed up training):
def tok_func(x):
return tokz(x['full_text'], max_length=512, truncation=True)
tok_ds = ds.map(tok_func, batched=True)
tok_ds
Dataset({
features: ['text_id', 'full_text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 3911
})
Notice that there are 3 new columns: input_ids
, token_type_ids
, and attention_mask
input_ids
contains a vector of integers that corresponds to the tokens in each student’s essay, based on the tokenizer’s vocabularytoken_type_ids
is usually used for classification on pairs of sentences or question-answering tasks, so 0s indicate the first sentence and 1s indicate the second sentence (for models that accept them)attention_mask
has 0s or 1s (think how you would use a Boolean mask), where tokens with 1s should be paid attention and tokens with 0s should be ignored; these are padding tokens for text that is shorter than the maximum text (since each text sequence needs to have the same length)
Training, Validation, and Test Sets
I’m going to split the dataset into 75% and 25% for the training set and validation set, respectively. I’m going to use a DatasetDict to hold these datasets since that’s what Transformers uses. Note that the “test” dataset below actually refers to the validation set, so I’m doing some renaming to make its purpose clear:
train_valid = tok_ds.train_test_split(0.25, seed=42)
ds_dict = DatasetDict({
'train': train_valid['train'],
'valid': train_valid['test']
})
ds_dict
DatasetDict({
train: Dataset({
features: ['text_id', 'full_text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 2933
})
valid: Dataset({
features: ['text_id', 'full_text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 978
})
})
I also need to prepare the test set. Repeat the processing steps (performed in the previous section) for the test set:
test_df = pd.read_csv(path + 'test.csv')
test_ds = Dataset.from_pandas(test_df).map(tok_func, batched=True)
Note that the test set only has 3 rows. Since this is a Kaggle code competition, the notebook itself is submitted and it will be internally run with a secret test set. This 3-row test set is just a formality to check that the notebook executes without error and can successfully create a submission.csv file.
Model Training
Now I’m almost ready to train the model! As described in the competition, the predetermined Kaggle metric is the mean columnwise root mean squared error (MCRMSE):
$$\textrm{MCRMSE} = \frac{1}{N_{t}}\sum_{j=1}^{N_{t}}\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_{ij} - \hat{y}_{ij})^2}$$
where $N_t$ is the number of scored ground truth target columns (a total of 6), and $y$ and $\hat{y}$ are the actual and predicted values, respectively.
Since the MCRMSE is based on the actual scores (not the one-hot-encoded categories), I need to write a function to construct the scores from the one-hot-encoded columns by taking a weighted average:
# Creates a 54-element vector of all one-hot-encoded categories: 1.0, 1.5, 2.0, 2.5, ...
import numpy as np
levels_9 = np.arange(2, 11) / 2
levels_54 = np.tile(levels_9, 6)
def construct_scores(weights):
levels_matrix = np.tile(levels_54, (len(weights), 1))
weights_levels = weights * levels_matrix
weighted_avg = np.array([np.sum(weights_levels[:, 0:9], axis=1), np.sum(weights_levels[:, 9:18], axis=1),
np.sum(weights_levels[:, 18:27], axis=1), np.sum(weights_levels[:, 27:36], axis=1),
np.sum(weights_levels[:, 36:45], axis=1), np.sum(weights_levels[:, 45:54], axis=1)])
weighted_avg = np.transpose(weighted_avg)
return weighted_avg
Next, I’m going to define a function called mcrmse_d
that returns a dict
, which maps strings (name of returned metric) to floats (the metric’s value), since that’s what Transformers expects. This function must take an EvalPrediction
object, which is a named tuple with a predictions
field and a label_ids
field.
Before the MCRMSE metric can be calculated, I need to first convert the logits (raw output from the model; see Hugging Face Transformers course for more info) into probabilities using a sigmoid layer, then reconstruct the scores using the probabilities as weights.
Convert raw outputs to the MCRMSE metric:
import torch
def mcrmse_d(eval_pred):
# Extract predictions and actual labels
y = eval_pred.predictions
yhat = eval_pred.label_ids
# Sigmoid layer to output prediction probabilities between 0 and 1 for dummy variables
sigmoid = torch.nn.Sigmoid()
prob = sigmoid(torch.Tensor(y))
prob = torch.Tensor.numpy(prob)
# (Re)construct scores using a weighted average
predic_scores = construct_scores(prob)
actual_scores = construct_scores(yhat)
# Calculate MCRMSE
mcrmse = np.mean(np.sqrt(np.mean(np.square(predic_scores - actual_scores), axis=0)), axis=0)
return {'mcrmse': mcrmse}
I’m going to define some training hyperparameters: batch size, number of epochs to run the model, and the learning rate. The batch size should fit the GPU I’m using, and I should select a small number of epochs so that I can iterate quickly. For the learning rate, I used trial-and-error to figure out the ideal learning rate. Feel free to experiment with different values. This is what I used:
bs = 16
epochs = 4
lr = 8e-5
Define a TrainingArguments class, which will contain all of the hyperparameters needed for training and evaluation:
from transformers import TrainingArguments, Trainer
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
evaluation_strategy='epoch', per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
num_train_epochs=epochs, weight_decay=0.01, report_to='none')
The only required argument is the directory where the trained model will be saved; in this case, 'outputs'
.
Create the model, using the trained_model
we specified earlier as well as the number of labels:
model = AutoModelForSequenceClassification.from_pretrained(trained_model, problem_type='multi_label_classification', num_labels=54)
Define the Trainer
class, where train_dataset
refers to the training set and eval_dataset
refers to the validation set:
trainer = Trainer(model, args, train_dataset=ds_dict['train'], eval_dataset=ds_dict['valid'], tokenizer=tokz, compute_metrics=mcrmse_d)
At this point, you should turn on the GPU setting in Kaggle, if not enabled already. Here’s my GPU info:
!nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 38C P0 34W / 250W | 1315MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Finally, train the model:
trainer.train()
Training took about 9 minutes, and yielded a validation loss of 0.25 and MCRMSE of 0.53 at the last epoch. At this point, you can save the model if you wish: trainer.save_model("trainer")
.
Test Predictions
Now that I’ve concluded training, I’m going to calculate predictions on the test set:
preds = trainer.predict(test_ds).predictions.astype(float)
Recall that these outputs are one-hot-encoded values.
Convert to actual scores:
# Sigmoid layer to output prediction probabilities between 0 and 1 for dummy variables
sigmoid = torch.nn.Sigmoid()
prob = sigmoid(torch.Tensor(preds))
prob = torch.Tensor.numpy(prob)
# Reconstruct scores using a weighted average
predic_scores = construct_scores(prob)
Almost done!
Create a submission file:
import datasets
submission = datasets.Dataset.from_dict({
'text_id': test_ds['text_id'],
'cohesion': predic_scores[:, 0],
'syntax': predic_scores[:, 1],
'vocabulary': predic_scores[:, 2],
'phraseology': predic_scores[:, 3],
'grammar': predic_scores[:, 4],
'conventions': predic_scores[:, 5]
})
submission.to_csv('submission.csv', index=False)
And that’s it! Since this is a code competition, we don’t need to upload a CSV file to Kaggle; instead, we submit the notebook and it will be run on a secret test set to determine scoring. The notebook is required to have a submission.csv output. If you run into errors during notebook submission, check out Code Competitions’ Errors & Debugging Tips.
At the time of my submission, the top score on the Leaderboard was 0.43. My submission scored 0.50. Examining the MCRMSE equation, this metric essentially gives the average difference between the correct essay grade and the predicted one. Getting it right within 0.5 is pretty good, considering how two humans (teachers) may not even agree within 0.5 for the same essay.
Resources
- Hugging Face Transformers library
- Hugging Face Transformers course
- fast.ai Lesson 4 Course Page