The Multi-Headed Attention Mechanism in the Transformer Model

Aug 03, 2024

∙ Paid

Introduction

The Transformer model, introduced by Vaswani et al. in their seminal 2017 paper “Attention is All You Need,” has revolutionized the field of natural language processing (NLP). Central to the Transformer’s success is the attention mechanism, particularly the multi-headed attention mechanism, which enables the model to capture diverse aspects of the input data. This paper delves into the workings of the multi-headed attention mechanism, its advantages, and its impact on the performance of the Transformer model.

Background

Traditional sequence models, such as recurrent neural networks (RNNs) and their variants like long short-term memory (LSTM) networks, face challenges in capturing long-range dependencies due to their sequential nature. The attention mechanism addresses this limitation by allowing the model to focus on different parts of the input sequence simultaneously, regardless of their distance from the current position. The multi-headed attention mechanism extends this idea by enabling the model to attend to multiple aspects of the input simultaneously.

The Attention Mechanism

At its core, the attention mechanism computes a weighted sum of input values (also called values, V), where the weights are determined by the similarity between a query (Q) and keys (K). The attention score is calculated as:

\( \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \)

where

\(d_k\)

is the dimension of the key vectors. The softmax function ensures that the weights sum to one, highlighting the most relevant parts of the input.

Steps in Single-Head Attention

1. Dot-Product Attention Calculation: The attention mechanism uses dot-product attention, where the similarity between the query and key is computed using a dot product. The result is scaled by the square root of the dimension of the keys to prevent the dot products from growing too large.
2. Softmax Normalization: The resulting scores are normalized using the softmax function to obtain the attention weights.
3. Weighted Sum: These attention weights are then used to compute a weighted sum of the values, producing the output of the attention mechanism.

Example in Code

A simplified implementation of single head attention in PyTorch might look like this:

import torch
import torch.nn.functional as F

def single_head_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(d_k)
    attn_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attn_weights, V)
    return output

# Example usage
Q = torch.rand(1, 10, 64)  # (batch_size, seq_length, embedding_dim)
K = torch.rand(1, 10, 64)
V = torch.rand(1, 10, 64)
output = single_head_attention(Q, K, V)

Multi-Headed Attention

The multi-headed attention mechanism enhances the standard attention mechanism by employing multiple attention heads. Each head performs its own attention operation, allowing the model to capture different relationships and features from the input data. The outputs of all heads are then concatenated and linearly transformed to produce the final output. Mathematically, this can be expressed as:

\( \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W^O \)

where each attention head

\(\text{head}_i\)

is computed as:

\( \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \)

and

\(W_i^Q, W_i^K, W_i^V, and\ W^O\)

are learned projection matrices.[1]

Example in Code

A simplified implementation of multi-head attention in PyTorch might look like this:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        attn_probs = F.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

# Example usage
d_model = 512
num_heads = 8
seq_length = 10
batch_size = 32

mha = MultiHeadAttention(d_model, num_heads)
x = torch.randn(batch_size, seq_length, d_model)
output = mha(x, x, x)
print(output.shape)  # Should be (batch_size, seq_length, d_model)

Detailed Steps in Multi-Headed Attention

1. Linear Projections: The input queries, keys, and values are linearly projected into lower-dimensional spaces using learned projection matrices.

\(W_i^Q, W_i^K, and\ W_i^V\)

Each projection creates different representations of the inputs for each attention head.

2. Independent Attention Heads: Each attention head independently performs the dot-product attention calculation, producing different sets of attention weights and outputs.

3. Concatenation: The outputs from all attention heads are concatenated to form a single tensor.

4. Final Linear Projection: The concatenated outputs are projected back to the original input dimension using a learned matrix W^O, ensuring the final output can be used in subsequent layers of the Transformer.

Benefits of Multi-Headed Attention

1. Parallel Processing: By attending to different parts of the input in parallel, multi-headed attention reduces the risk of losing information that might be overlooked by a single head.

2. Learning Diverse Representations: Each head can learn to focus on different aspects of the input, capturing a richer set of features and relationships.

3. Improved Performance: Empirically, multi-headed attention has been shown to improve the performance of the Transformer model on various NLP tasks, including machine translation and text generation.[1]

Rob’s POV

The Multi-Headed Attention Mechanism in the Transformer Model

Introduction

Background

The Attention Mechanism

Steps in Single-Head Attention

Example in Code

Multi-Headed Attention

Example in Code

Detailed Steps in Multi-Headed Attention

Benefits of Multi-Headed Attention

This post is for paid subscribers