Attention Mechanisms in Transformers
Attention mechanisms are a way for a model to selectively focus on different parts of the input sequence when making predictions. In the case of NLP, this can be useful when translating a sentence from one language to another, where certain words may have more importance than others.
In the Transformer architecture, the self-attention mechanism is used to weigh the importance of different parts of the input sequence. Specifically, for each input token, the model computes a set of attention scores that determine how much each token should contribute to the output.
The self-attention mechanism in the Transformer works by computing a set of query, key, and value vectors for each input token. These vectors are then used to compute a set of attention scores, which are normalized using a softmax function to ensure that they sum to 1. The attention scores are then used to compute a weighted sum of the value vectors, which is the output of the self-attention mechanism.
The mathematical formulation of the self-attention mechanism can be expressed as follows:
$$\text{Attention}(Q,K,V) = \text{softmax} \left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
where $Q$, $K$, and $V$ are the query, key, and value vectors, respectively, and $d_k$ is the dimensionality of the key vectors.
Here's some code that shows how to implement the self-attention mechanism in PyTorch:
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, hidden_dim, num_heads):
super().__init__()
self.hidden_dim = hidden_dim
self.num_heads = num_heads
self.head_dim = hidden_dim // num_heads
self.q_linear = nn.Linear(hidden_dim, hidden_dim)
self.k_linear = nn.Linear(hidden_dim, hidden_dim)
self.v_linear = nn.Linear(hidden_dim, hidden_dim)
self.out_linear = nn.Linear(hidden_dim, hidden_dim)
def forward(self, x):
batch_size, seq_len, hidden_dim = x.size()
q = self.q_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) # B x num_heads x seq_len x head_dim
k = self.k_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) # B x num_heads x seq_len x head_dim
v = self.v_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) # B x num_heads x seq_len x head_dim
scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)) # B x num_heads x seq_len x seq_len
attn = torch.softmax(scores, dim=-1)
out = torch.matmul(attn, v).transpose(1, 2).contiguous().view(batch_size, seq_len, hidden_dim) # B x seq_len x hidden_dim
out = self.out_linear(out)
return out
In this code, we define a SelfAttention
class that takes in a hidden dimension and the number of attention heads. We create three linear layers for the query, key, and value vectors, respectively. We then apply these linear layers to the input sequence and reshape the resulting tensors to have the correct dimensions for the attention calculation. We compute the attention scores using a matrix multiplication and normalize them using a softmax function. We then compute the weighted sum of the value vectors using another matrix multiplication, and reshape the resulting tensor back to the original shape. Finally, we apply another linear layer to the output of the self-attention mechanism.
To use this SelfAttention
class in a Transformer model, we would typically stack multiple instances of the self-attention mechanism, along with other components such as feedforward layers and layer normalization.
Here's an example of how to use the SelfAttention
class in a simple Transformer model:
class Transformer(nn.Module):
def __init__(self, vocab_size, hidden_dim, num_heads, num_layers):
super().__init__()
self.embedding = nn.Embedding(vocab_size, hidden_dim)
self.pos_encoding = PositionalEncoding(hidden_dim)
self.layers = nn.ModuleList([TransformerLayer(hidden_dim, num_heads) for _ in range(num_layers)])
self.fc = nn.Linear(hidden_dim, vocab_size)
def forward(self, x):
x = self.embedding(x)
x = self.pos_encoding(x)
for layer in self.layers:
x = layer(x)
x = self.fc(x)
return x
In this code, we define a Transformer
class that takes in the vocabulary size, hidden dimension, number of attention heads, and number of layers as arguments. We create an embedding layer and a positional encoding layer, and then stack multiple instances of the TransformerLayer
class, which contains a self-attention mechanism and feedforward layers. Finally, we apply a linear layer to the output of the final layer to get the predicted logits.
Overall, the self-attention mechanism is a key component of the Transformer architecture that allows it to selectively focus on different parts of the input sequence. It has become a popular choice for NLP tasks due to its strong performance and ability to handle variable-length inputs.
Leave a Comment