Transformer-based Image Recognition
Transformers were originally developed for natural language processing tasks, but they have also shown promising results for image recognition tasks. One of the key challenges in adapting Transformers for image recognition is how to process 2D input data (i.e., images) using the self-attention mechanism, which was originally designed for 1D sequences.
One approach to handling 2D input data is to flatten the image into a 1D sequence of vectors and apply the self-attention mechanism as usual. However, this approach may not capture the spatial relationships between pixels in the image.
A more effective approach is to modify the self-attention mechanism to process 2D input data directly. One way to do this is to use 2D convolutions to compute the query, key, and value tensors, which capture the spatial relationships between pixels. Specifically, we can use a 1x1 convolution to compute the query, key, and value tensors for each position in the input image. We can then apply the self-attention mechanism as usual, using the modified query, key, and value tensors.
Here's an example of how to implement a Transformer-based image recognition model using PyTorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
class SelfAttention2d(nn.Module):
def __init__(self, in_channels, num_heads):
super().__init__()
self.in_channels = in_channels
self.num_heads = num_heads
self.q_conv = nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, bias=False)
self.k_conv = nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, bias=False)
self.v_conv = nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, bias=False)
self.out_conv = nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, bias=False)
def forward(self, x):
batch_size, channels, height, width = x.size()
# Compute query, key, and value tensors
q = self.q_conv(x).view(batch_size, self.num_heads, channels // self.num_heads, height * width).transpose(2, 3)
k = self.k_conv(x).view(batch_size, self.num_heads, channels // self.num_heads, height * width)
v = self.v_conv(x).view(batch_size, self.num_heads, channels // self.num_heads, height * width)
# Compute attention scores
scores = torch.matmul(q, k) / (channels // self.num_heads) ** 0.5
attn = F.softmax(scores, dim=-1)
# Apply attention to value tensor
out = torch.matmul(attn, v.transpose(2, 3)).transpose(2, 3).contiguous().view(batch_size, channels, height, width)
out = self.out_conv(out)
return out
class TransformerBlock2d(nn.Module):
def __init__(self, in_channels, num_heads):
super().__init__()
self.norm1 = nn.LayerNorm(in_channels)
self.attn = SelfAttention2d(in_channels, num_heads)
self.norm2 = nn.LayerNorm(in_channels)
self.fc = nn.Sequential(
nn.Conv2d(in_channels, 4 * in_channels, kernel_size=1, stride=1, bias=False),
nn.GELU(),
nn.Conv2d(4 * in_channels, in_channels, kernel_size=1, stride=1, bias=False)
)
def forward(self, x):
# Apply self-attention
residual = x
x = self.norm1(x)
x = self.attn(x)
x = x + residual
# Apply feedforward network
residual = x
x = self.norm2(x)
x = self.fc(x)
x = x + residual
return x
class TransformerEncoder2d(nn.Module):
def __init__(self, in_channels, num_heads, num_layers):
super().__init__()
self.blocks = nn.ModuleList([
TransformerBlock2d(in_channels, num_heads) for _ in range(num_layers)
])
def forward(self, x):
for block in self.blocks:
x = block(x)
return x
class TransformerImageClassifier(nn.Module):
def __init__(self, in_channels, num_heads, num_layers, num_classes):
super().__init__()
self.encoder = TransformerEncoder2d(in_channels, num_heads, num_layers)
self.fc = nn.Linear(in_channels, num_classes)
def forward(self, x):
x = self.encoder(x)
x = x.mean(dim=(2, 3))
x = self.fc(x)
return x
In this code, we define three modules: SelfAttention2d
, TransformerBlock2d
, and TransformerEncoder2d
. SelfAttention2d
is the modified self-attention mechanism that can process 2D input data. TransformerBlock2d
is a block that contains a self-attention layer and a feedforward layer, and TransformerEncoder2d
is a stack of TransformerBlock2d
layers. Finally, we define a TransformerImageClassifier
module that uses the TransformerEncoder2d
module to process the input image and a fully connected layer to classify the image.
To use this model to classify images, we can simply instantiate a TransformerImageClassifier
object and call its forward
method with an input image tensor:
model = TransformerImageClassifier(in_channels=3, num_heads=4, num_layers=6, num_classes=10)
x = torch.randn(1, 3, 224, 224) # Input image tensor
y = model(x) # Output logits
To use the SelfAttention2d
module in a TransformerBlock2d
, we can simply replace the nn.MultiheadAttention
module in the original TransformerBlock
with our SelfAttention2d
module. Here's the modified TransformerBlock2d
:
class TransformerBlock2d(nn.Module):
def __init__(self, in_channels, num_heads):
super().__init__()
self.norm1 = nn.LayerNorm(in_channels)
self.attn = SelfAttention2d(in_channels, num_heads) # Replace nn.MultiheadAttention with SelfAttention2d
self.norm2 = nn.LayerNorm(in_channels)
self.fc = nn.Sequential(
nn.Linear(in_channels, in_channels * 4),
nn.GELU(),
nn.Linear(in_channels * 4, in_channels)
)
def forward(self, x):
# Apply self-attention
residual = x
x = self.norm1(x)
x = self.attn(x)
x = x + residual
# Apply feedforward network
residual = x
x = self.norm2(x)
x = self.fc(x)
x = x + residual
return x
Notice that the SelfAttention2d
module has the same input and output dimensions as nn.MultiheadAttention
, so we can directly substitute one for the other in the TransformerBlock2d
module.
To create a TransformerEncoder2d
module that uses TransformerBlock2d
modules, we can simply create a stack of TransformerBlock2d
modules:
class TransformerEncoder2d(nn.Module):
def __init__(self, in_channels, num_heads, num_layers):
super().__init__()
self.blocks = nn.ModuleList([
TransformerBlock2d(in_channels, num_heads) for _ in range(num_layers)
])
def forward(self, x):
for block in self.blocks:
x = block(x)
return x
This TransformerEncoder2d
module simply applies a stack of TransformerBlock2d
modules to the input tensor.
Finally, we can define a TransformerImageClassifier
module that uses the TransformerEncoder2d
module to process the input image and a fully connected layer to classify the image:
class TransformerImageClassifier(nn.Module):
def __init__(self, in_channels, num_heads, num_layers, num_classes):
super().__init__()
self.encoder = TransformerEncoder2d(in_channels, num_heads, num_layers)
self.fc = nn.Linear(in_channels, num_classes)
def forward(self, x):
x = self.encoder(x)
x = x.mean(dim=(2, 3))
x = self.fc(x)
return x
In this TransformerImageClassifier
module, we first apply the TransformerEncoder2d
module to the input image tensor. Then, we apply a mean pooling operation along the spatial dimensions (height and width) of the tensor to obtain a tensor of shape (batch_size, in_channels)
. Finally, we apply a fully connected layer to obtain the logits for each class.
To use this model to classify images, we can simply instantiate a TransformerImageClassifier
object and call its forward
method with an input image tensor:
model = TransformerImageClassifier(in_channels=3, num_heads=4, num_layers=6, num_classes=10)
x = torch.randn(1, 3, 224, 224) # Input image tensor
y = model(x) # Output logits
To train the TransformerImageClassifier
model, we can use a standard cross-entropy loss and stochastic gradient descent (SGD) optimizer:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
Then, we can train the model on a dataset of images using PyTorch's DataLoader
class:
dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms.ToTensor())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
for epoch in range(num_epochs):
for i, (x, y_true) in enumerate(dataloader):
optimizer.zero_grad()
y_pred = model(x)
loss = criterion(y_pred, y_true)
loss.backward()
optimizer.step()
if i % log_interval == 0:
print(f"Epoch {epoch}, Iteration {i}: Loss = {loss.item():.4f}")
In this example, we use the CIFAR-10 dataset, which consists of 32x32 RGB images in 10 classes. We create a DataLoader
object that loads batches of 32 images at a time and shuffles the order of the images at each epoch.
During training, we loop over the batches in the DataLoader
, compute the model's predictions on the input images, and compute the cross-entropy loss between the predicted and true labels. We then backpropagate the loss and update the model's parameters using the SGD optimizer.
After training, we can evaluate the model's performance on a held-out test set using the same cross-entropy loss:
test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms.ToTensor())
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)
total_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for x, y_true in test_dataloader:
y_pred = model(x)
loss = criterion(y_pred, y_true)
total_loss += loss.item() * x.size(0)
_, predicted = y_pred.max(1)
correct += predicted.eq(y_true).sum().item()
total += y_true.size(0)
test_loss = total_loss / total
test_acc = correct / total
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}")
In this example, we use the same CIFAR-10 dataset, but with the train=False
argument to load the test set. We then compute the model's predictions on the test set and compute the cross-entropy loss and accuracy. By comparing the test accuracy to the training accuracy, we can get an estimate of how well the model generalizes to new data.
That's it for using Transformer models for image recognition tasks! By adapting the self-attention mechanism to process 2D input data, we can achieve state-of-the-art performance on a wide range of image classification tasks, with relatively few modifications to the original Transformer architecture.
Leave a Comment