Part 2 : Transformers: Input Embedding and Positional Encoding

aimlfastrace
Jan 19
10 min read

Updated: Jan 24

Understanding Transformers: Input Embedding and Positional Encoding

Before reading this blog refer to Part 1 below to understand the different components of Transformers

Part 1 : Transformers: Introduction to the backbone of Generative AI - https://www.aiconnectgrow.com/post/part-1-transformers-introduction-to-the-backbone-of-generative-ai or refer the video https://youtu.be/acUX5uIJYVs

Youtube Video: You can refer the video link https://youtu.be/JQzlCrqTMLU for the detailed explanation of the blog.

Code using Pytorch for this blog: Youtube Video https://youtu.be/Ing1vzG9Rjk explains How to easily code Input Embedding and Positional Encoding in detail

Input Embedding and Positional Encoding:

Welcome to my blog! Today, I'll guide you through the essential building blocks of the Transformer, focusing on Input Embedding and Positional Encoding. We'll explore both the underlying theories and practical implementations using PyTorch. My goal is to break down the complex mechanics of Transformers, making each component clear and easy to understand.

The encoder-decoder structure of the Transformer architecture Taken from “Attention Is All You Need“

The initial step in any transformer model involves input embeddings and positional encoding. These two components work together to prepare the raw input data so the model can process it effectively.

Input Embedding and Positional Encoding of Transformer architecture

Input embeddings are like a translator for the model. They convert words or tokens from the input sequence into dense vectors of numbers that the transformer can understand. These vectors represent not just the word itself but also capture its meaning and relationship to other words in the context of the sequence.
Positional encoding adds crucial information about the order of words in the sequence. Unlike humans, transformers don’t inherently understand the sequence of data since they process inputs all at once (in parallel). Positional encoding fills this gap, helping the model know which word comes first, second, and so on.

Together, input embeddings and positional encoding act as the foundation for the transformer. Without them, the model wouldn’t be able to grasp the meaning of the input or how the parts of the input relate to each other. Let’s now take a closer look at input embeddings to see how they work!

Let us dive into the minute details of Input Embedding and Positional Encoding. The image below illustrates how data is transformed by these layers. Once you have read the complete blog, you can refer back to this image to revise your concepts.

Input Embedding

In any deep learning model, input embeddings are used to convert raw data, such as words or tokens, into numerical representations that can be processed by the model. In the context of transformers, input embeddings are crucial because they translate each word or token in a sequence into a high-dimensional vector. This vector represents the meaning and semantic properties of the word in a way that the model can understand and manipulate.

Why Is It Needed?

The natural language is made up of words or tokens, and these individual pieces of data need to be transformed into a form that a machine learning model can process.The input embedding helps to convert each word into a vector, which the model can then use to perform tasks such as classification, translation, or generation.

For example, the word "cat" might be represented by a vector like [0.2, -0.5, 0.7]. This vector contains information that the model uses to understand the word’s relationship with other words in the sequence. The more advanced the embedding method, the more effectively it captures these semantic relationships.

How Does It Work?

In the Transformer model, the input embedding is typically created using pre-trained embeddings such as Word2Vec, GloVe, or even embeddings learned from scratch during training. Each word in the sequence is mapped to a vector of a fixed size. These vectors are learned in a way that similar words (like "dog" and "cat") will have embeddings that are closer in the vector space.

Here’s a breakdown of how input embedding works:

Tokenization: First, the input text is tokenized into smaller pieces (words, subwords, or characters).
Lookup Table: A lookup table, also called an embedding matrix, contains the embeddings for each token in the vocabulary. Each token is assigned an embedding vector, typically of high dimension (e.g., 512 or 1024)
Embedding Layer: The tokenized sequence is passed through an embedding layer that maps each token to its corresponding embedding vector from the lookup table.

Mathematics Behind It:

Key Features of Input Embedding:

Fixed Representation: Each token is mapped to a fixed-dimensional vector that remains constant throughout training and inference.
Learned during Training: The embeddings are initialized randomly and then fine-tuned during training to capture the nuances of the specific task the model is being trained for.
Semantic Similarity: Words with similar meanings tend to have similar embeddings, making it easier for the model to capture relationships between words.

PyTorch Code for Input Embedding

Explanation of the Code

InputEmbedding Class:
- The class InputEmbedding inherits from nn.Module and contains an embedding layer (nn.Embedding) that is responsible for converting token indices into dense vectors.
- The constructor init initializes the embedding layer, where vocab_size is the number of unique tokens, and embed_dim is the size of the embedding vectors.
Forward Method:
- The forward method takes an input tensor x (of shape (batch_size, seq_len)) containing token indices. It then returns the corresponding embeddings of shape (batch_size, seq_len, embed_dim) by passing x through the embedding layer.
Example Usage:
- The code demonstrates how to instantiate the InputEmbedding class and use it to transform a batch of tokenized sequences into embeddings.
- vocab_size: Represents the size of your vocabulary. For example, if you have 10,000 unique tokens in your text corpus, set this to 10000.
- embed_dim: Defines the size of the dense embedding vectors. This is typically a hyperparameter (e.g., 128, 256, 512).
- example_input: A tensor where each row represents a sequence of token indices. Each value in the tensor corresponds to a token's index in the vocabulary.
- Output: The output_embeddings tensor will have a shape of (batch_size, seq_len, embed_dim)(e.g., 2, 3, 512)

Input Embedding Input / Output Tensor Dimensions

The dimension of the input and output tensors for the Input Embedding example can be described as follows:

Input Tensor Dimensions

Thus, in the above example the input tensor shape is:(N, T) = (2, 3) — A batch of 2 sequences, each containing 3 tokens.

Output Tensor Dimensions

Thus, in the above example the output tensor shape is:(N, T, D) = (2, 3, 512) — The batch size remains 2, the sequence length remains 3, but each token is now represented by a 512-dimensional vector.

Sample Input and Output

Raw Text:

["I love programming", "Transformers are powerful"]

This text would be tokenized into the following indices based on the vocabulary of the model:

Sample Input (Token Indices):

tensor([[6542, 8882, 8341], # Token indices for the first sentence

[8972, 3651, 9371]]) # Token indices for the second sentence

Output Embeddings (Shape)

After passing the input tensor through the InputEmbedding layer, the output tensor will have the shape (2, 3, 512), as explained above.

Output Embeddings (Values)

The actual output tensor, which contains the embedding vectors for each token, will look like this (values are randomly initialized for illustration purposes):

tensor([[[ 1.1783, -0.4895, 1.2987, ..., -0.2909, -0.1299, -1.4825],

[ 0.7400, 0.2925, 1.4605, ..., -0.9125, 0.8454, -0.3570],

[ 0.2104, 0.6715, -0.2358, ..., 0.1131, -0.3567, 0.5761]],

[[ 0.3729, 0.1159, 0.6355, ..., -0.5798, 1.2385, 0.0199],

[-0.7880, -0.0142, -0.0356, ..., -0.3110, -0.2363, 0.7364],

[-0.3859, -0.2320, 0.1284, ..., 0.0567, 0.5954, -0.1985]]])

Key Points:

Input Tensor: Contains the token indices for a batch of sequences.
Output Tensor: Contains the embeddings corresponding to each token index. Each token is represented as a 512-dimensional vector.

Github Input Embedding Sample code:

https://github.com/kalpa-subbaiah/transformer/blob/main/inp_emb_pos_encode.py

Positional Encoding

In simple terms, positional encoding is a clever trick used in Transformer models to help them understand the order of words in a sequence, such as in a sentence. Unlike RNNs or LSTMs, which process words sequentially and naturally capture word order whereas Transformers process all words in parallel. This parallel processing means the model lacks a built-in sense of the order of words. Positional encoding solves this problem by adding information about the position of each word in the sequence, enabling the Transformer to understand the structure and meaning of the input.