Show List

Attention Mechanisms in Transformers

Attention mechanisms are a way for a model to selectively focus on different parts of the input sequence when making predictions. In the case of NLP, this can be useful when translating a sentence from one language to another, where certain words may have more importance than others.

In the Transformer architecture, the self-attention mechanism is used to weigh the importance of different parts of the input sequence. Specifically, for each input token, the model computes a set of attention scores that determine how much each token should contribute to the output.

The self-attention mechanism in the Transformer works by computing a set of query, key, and value vectors for each input token. These vectors are then used to compute a set of attention scores, which are normalized using a softmax function to ensure that they sum to 1. The attention scores are then used to compute a weighted sum of the value vectors, which is the output of the self-attention mechanism.

The mathematical formulation of the self-attention mechanism can be expressed as follows:

$$\text{Attention}(Q,K,V) = \text{softmax} \left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

where $Q$, $K$, and $V$ are the query, key, and value vectors, respectively, and $d_k$ is the dimensionality of the key vectors.

Here's some code that shows how to implement the self-attention mechanism in PyTorch:

python
Copy code
import torch import torch.nn as nn class SelfAttention(nn.Module): def __init__(self, hidden_dim, num_heads): super().__init__() self.hidden_dim = hidden_dim self.num_heads = num_heads self.head_dim = hidden_dim // num_heads self.q_linear = nn.Linear(hidden_dim, hidden_dim) self.k_linear = nn.Linear(hidden_dim, hidden_dim) self.v_linear = nn.Linear(hidden_dim, hidden_dim) self.out_linear = nn.Linear(hidden_dim, hidden_dim) def forward(self, x): batch_size, seq_len, hidden_dim = x.size() q = self.q_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) # B x num_heads x seq_len x head_dim k = self.k_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) # B x num_heads x seq_len x head_dim v = self.v_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) # B x num_heads x seq_len x head_dim scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)) # B x num_heads x seq_len x seq_len attn = torch.softmax(scores, dim=-1) out = torch.matmul(attn, v).transpose(1, 2).contiguous().view(batch_size, seq_len, hidden_dim) # B x seq_len x hidden_dim out = self.out_linear(out) return out

In this code, we define a SelfAttention class that takes in a hidden dimension and the number of attention heads. We create three linear layers for the query, key, and value vectors, respectively. We then apply these linear layers to the input sequence and reshape the resulting tensors to have the correct dimensions for the attention calculation. We compute the attention scores using a matrix multiplication and normalize them using a softmax function. We then compute the weighted sum of the value vectors using another matrix multiplication, and reshape the resulting tensor back to the original shape. Finally, we apply another linear layer to the output of the self-attention mechanism.

To use this SelfAttention class in a Transformer model, we would typically stack multiple instances of the self-attention mechanism, along with other components such as feedforward layers and layer normalization.

Here's an example of how to use the SelfAttention class in a simple Transformer model:

python
Copy code
class Transformer(nn.Module): def __init__(self, vocab_size, hidden_dim, num_heads, num_layers): super().__init__() self.embedding = nn.Embedding(vocab_size, hidden_dim) self.pos_encoding = PositionalEncoding(hidden_dim) self.layers = nn.ModuleList([TransformerLayer(hidden_dim, num_heads) for _ in range(num_layers)]) self.fc = nn.Linear(hidden_dim, vocab_size) def forward(self, x): x = self.embedding(x) x = self.pos_encoding(x) for layer in self.layers: x = layer(x) x = self.fc(x) return x

In this code, we define a Transformer class that takes in the vocabulary size, hidden dimension, number of attention heads, and number of layers as arguments. We create an embedding layer and a positional encoding layer, and then stack multiple instances of the TransformerLayer class, which contains a self-attention mechanism and feedforward layers. Finally, we apply a linear layer to the output of the final layer to get the predicted logits.

Overall, the self-attention mechanism is a key component of the Transformer architecture that allows it to selectively focus on different parts of the input sequence. It has become a popular choice for NLP tasks due to its strong performance and ability to handle variable-length inputs.


    Leave a Comment


  • captcha text