End of Sequence Token Explained

Let's talk about the End of Sequence (eos) token: specifically, how it's being leveraged in instruction fine-tuning. In learning to fine-tune large language models (LLMs), I've read over some great tutorials from Phil Schmid and others that walk through the process step by step.

Training seems to generally be explained using one of two different classes: either the Huggingface Trainer class is used directly, or the TRL SFTTrainer class is used. In almost all these tutorials, we see the line

tokenizer.pad_token = tokenizer.eos_token

But what is this doing?

I found that I'm probably not the first one to be confused about this, given at least one stack overflow article that I stumbled upon. https://stackoverflow.com/questions/76446228/setting-padding-token-as-eos-token-when-using-datacollatorforlanguagemodeling-fr

In this blog post, we'll dive into the differences between the EOS and pad tokens and explore their significance in the fine-tuning process, and by the end you should understand the impact of setting pad_token = eos_token!

What are the Pad and EOS tokens?

Pad Token

The pad token, often abbreviated as [PAD], is a special token used to standardize the length of input sequences during training. In fine-tuning, language models are trained on batches of data, and these batches typically consist of sequences of varying lengths. To efficiently process these sequences, they need to be of the same length. This is where the pad token comes into play.

When a sequence is shorter than the desired length, pad tokens are added to the end of the sequence until it reaches the desired length. These pad tokens carry no meaningful information and serve as placeholders to ensure consistent batch processing. The model learns to ignore these pad tokens and focus on the actual content of the sequences.

End of Sequence Token:

The end of sequence token, denoted as [EOS] or another similar label, serves as a signal to the model that a sequence has reached its conclusion. It indicates the termination point of a sequence and helps the model understand the boundaries between different pieces of text. In natural language generation tasks, the end of sequence token guides the model to produce coherent and well-structured output. Producing the EOS token tells the generation algorithm to stop generating additional text. This is crucial for training a chatbot or text summarizer: if no EOS token is generated, the model will endlessly output text which will inevitably be repetitive, incorrect, and unwanted.

When fine-tuning a language model, the end of sequence token is particularly important. It allows the model to generate text while respecting syntactic and contextual constraints. For instance, when generating paragraphs or sentences, the model should stop producing text once it encounters the end of sequence token, preventing text from becoming fragmented or nonsensical.

How is the pad token handled in Trainer and SFTTrainer classes?

Although the Trainer and SFTTrainer have similar APIs, they differ in several important ways that is important to understand when choosing one and designing processes for training an LLM.

Trainer

The Huggingface Trainer class is instantiated using a DataCollator, which is Huggingface's standard method to organize and tokenize the training data. The Data Collator has a torch_call function which is executed when the Trainer class requests the next batch of training data (there are corresponding tf_call functions for Tensorflow, etc).

The code has the logic here

if self.tokenizer.pad_token_id is not None:
  labels[labels == self.tokenizer.pad_token_id] = -100

This means that every pad token is given a label of -100. And where does this -100 value come into play? Turns out that -100 is a way to denote that the loss should be ignored, and is a default in the pytorch implementation of CrossEntropyLoss, which is the loss function usually used when training a transformer model. The pytorch CrossEntropyLoss here implementation happens in C++, where we observe the code here which is masking out the loss for everything with that ignore_index value, which is the -100 index defaulted in torch.

// Note that this is the code for label_smoothed loss, but similar logic exists for non label-smoothed loss.
auto ignore_mask = target == ignore_index;
smooth_loss.index_put_({ignore_mask}, 0.0);

Finally, we can track this implementation through a sample forward pass of the LlaMA architecture here, where they document confirmation that an id of -100 means that the loss is ignored for that token, but now we know this to be true because we've looked at the pytorch source code and confirmed it!

SFTTrainer

Next, how is the behavior handled in SFTTrainer? The biggest difference that can be seen immediately is that the SFTTrainer doesn't used the data collator class directly, which means that all of the default behavior of setting the label for all the pad tokens to -100 no longer applies.

SFTTrainer is configurable to use a variety of different data generating classes, but a common one used is the ConstantLengthDataset

With this dataset, the functionality is quite different from the DataCollator: while the DataCollator assumes that the EOS token is already integrated into your dataset, the ConstantLengthDataset adds the eos token in for you here

for tokenized_input in tokenized_inputs:
    all_token_ids.extend(tokenized_input + [self.concat_token_id])

Since the purpose of the ConstantLengthDataset is to not waste any space, the pad token is never used, so there is never a need to set any tokens to -100.

Some of the other DataCollators that can be used in SFTTrainer do support padding and use the -100 ignore_index to ignore the loss for the pad token (e.g. DPODataCollatorWithPadding)

What's the point?

The main takeway is that data preparation is massively important and easy to get confused. Two libraries (Huggingface and TRL) that on the outside look relatively similar are actually different in important ways, and it can be a cause for frustrating results.