# PyTorch Tutorial - The basics

### EE-556: Mathematics of Data: From Theory to Computation. Instructor: Prof. Volkan Cevher

This tutorial introduces the basics of using the PyTorch library.

## What is PyTorch?
PyTorch is a deep learning framework that is widely used in research and industry. It is mainly developed by Facebook's AI Research lab (FAIR), and is known for its flexibility and ease of use. Its dynamic computation graph and automatic differentiation capabilities make it especially well-suited for research in machine learning and deep learning.

## Installation
The installation of PyTorch is straightforward via a package manager like conda (https://conda.io) or pip. The exact installation command depends on your OS, package manager, python version, and whether you have a GPU with CUDA support. In the PyTorch homepage (https://pytorch.org) you can find the correct command in your case. For example, using pip you an simply run the following command in a Linux terminal to install PyTorch:

In [22]:
!pip install torch torchvision matplotlib

Collecting matplotlib
  Downloading matplotlib-3.9.2-cp39-cp39-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.54.1-cp39-cp39-macosx_11_0_arm64.whl.metadata (163 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.7-cp39-cp39-macosx_11_0_arm64.whl.metadata (6.3 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.2.0-py3-none-any.whl.metadata (5.0 kB)
Collecting importlib-resources>=3.2.0 (from matplotlib)
  Downloading importlib_resources-6.4.5-py3-none-any.whl.metadata (4.0 kB)
Downloading matplotlib-3.9.2-cp39-cp39-macosx_11_0_arm64.whl (7.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m7.2 MB/s[0m eta

## Tensor creation and manipulation
The basic data structure in PyTorch is the tensor. A tensor is a multi-dimensional array that can be used to represent scalars, vectors, matrices, and higher-dimensional arrays. Tensors are similar to numpy arrays, but they have additional features that make them suitable for deep learning tasks. For example, tensors can be moved to a GPU to accelerate computations.

In [2]:
import torch

x = torch.randn(10,3,28,28) # usually the shape for images NCHW by default pytorch convention
v = torch.rand(28,28)
y1 = x * v # elementwise addition, with v replicated across N,C
y2 = x @ v == torch.matmul(x,v) #matrix multiplication
x = x.permute(0,2,3,1) # tensorflow layout
x.reshape(10,3,784) # flatten spatial dimensions
x.flatten().shape[0] == x.reshape(-1).shape[0] == (10*3*28*28) # flatten everything

True

## Autograd
One of the most important features of PyTorch is its automatic differentiation engine, called autograd. Autograd allows you to compute gradients of functions with respect to specified tensors. This is useful for training neural networks using gradient-based optimization algorithms with backpropagation.

In [3]:
x = torch.randn(10,3,28,28)
v = torch.randn(28,28)
x.requires_grad = True # <= enable gradient tracking

# forward pass
A = torch.matmul(x, v) # matrix multiplication 
L = A.sum() # reduce to scalar
# backward pass
L.backward() # compute gradients

assert v.grad is None # no tracking == no gradient
assert x.grad is not None # gradient with respect to x

## Module API
PyTorch provides a high-level API for building neural networks, the `torch.nn.module` API. A simple example is provided below.

In [4]:
class Bias(torch.nn.Module): 
    def __init__(self, dim):
        super().__init__()
        # after this is called, can register parameters
        self.bias = torch.nn.Parameter(torch.rand(dim), requires_grad=True) # self.bias will now be found by optimisers etc. and have a gradient
    
    def forward(self, a): 
        return a + self.bias

a = torch.rand(10,5)
mu = Bias(5)
output = mu(a)

Question: what is the size of the output variable?

## Linear layers
The `torch.nn.Linear` module is used to define a linear transformation with learnable weights. The input and output sizes are specified when the module is created. The forward method is used to compute the output of the layer. Example:


In [5]:
import torch.nn as nn

# Create a linear layer with input size 10 and output size 5
linear = nn.Linear(10, 5, bias=True)

# Create a random input tensor of size 10
x = torch.randn(10)

# Compute the output of the linear layer
output = linear(x)


Mathematically, the above is equivalent to the following:
$$
\text{output} = W x + b
$$
where $W$ is a matrix of size $5 \times 10$ and $b$ is a vector of size $5$.

Multi-layer perceptron (MLP): stack several (≥ 2) linear layers, interleaved with activation functions.

## Activation functions
Non-linear functions that are applied element-wise and give the neural network its expressivity. Historically sigmoid $\sigma(x) = \frac{1}{1 + e^{-x}}$ was common, but due to optimization issues, the rectified linear unit (ReLU) $f(x) = \max(0, x)$ is the most common choice nowadays. In PyTorch, these activation functions are available as `torch.nn.Sigmoid` and `torch.nn.ReLU`, respectively.

## MLP
We can now define a simple multi-layer perceptron (MLP) using the `torch.nn.Module` API. The following code defines a simple MLP with one hidden layer and a ReLU activation function.

In [6]:
class MLP(torch.nn.Module):
    def __init__(self, dim, hidden_dim=1, out_dim=1):
        super().__init__()
        # after this is called, can register parameters
        self.layer1 = torch.nn.Linear(dim, hidden_dim)
        self.layer2 = torch.nn.Linear(hidden_dim, out_dim)
        self.activation = torch.nn.ReLU()

    def forward(self, a):
        hidden = self.activation(self.layer1(a)) 
        return self.layer2(hidden)

a = torch.rand(10,5)
mu = MLP(5)
output = mu(a)

## Convolutional layers
Convolutional layers are used in computer vision tasks to extract features from images. The `torch.nn.Conv2d` module is used to define a convolutional layer. The input and output channels, kernel size, and stride are specified when the module is created. The forward method is used to compute the output of the layer. Example:

In [7]:
import torch
import torch.nn as nn

# Define a convolutional layer
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)

# Create a random input tensor
x = torch.randn(1, 3, 32, 32)

# Compute the output of the layer
output = conv(x)
print(output.size())

torch.Size([1, 16, 32, 32])


The parameters of the `nn.Conv2d` module are:
- `in_channels`: the number of input channels (e.g., 3 for RGB images)
- `out_channels`: the number of output channels (i.e., the number of filters)
- `kernel_size`: the size of the convolutional kernel
- `stride`: the stride of the convolution, i.e., the number of pixels by which the kernel is shifted
- `padding`: the number of pixels to add to the input image before applying the convolution

## Attention layers
(Self-)attention is a sequence-to-sequence operation that assigns weights to the input sequence elements based on their relevance to each other. Self-attention operates on input sequence $X$ and projects it into three spaces: query $Q$, key $K$, and value $V$:

$$
Q = XW^Q, \quad K = XW^K, \quad V = XW^V
$$
where $W^Q$, $W^K$, and $W^V$ are learnable weight matrices. The attention matrix is computed as:

$$
A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
where $d_k$ is the dimension of the key space. The output of the attention layer is finally:
$$
Y = A V.
$$

In [8]:
class SoftmaxAttention(nn.Module):
    def __init__(self):
        super(SoftmaxAttention, self).__init__()
        self.wq = nn.Linear(64, 64)
        self.wk = nn.Linear(64, 64)
        self.wv = nn.Linear(64, 64)

    def forward(self, x):
        q = self.wq(x)
        k = self.wk(x)
        v = self.wv(x)

        attention = torch.matmul(q, k.transpose(-2, -1))
        attention = torch.nn.functional.softmax(attention, dim=-1)
        out = torch.matmul(attention, v)
        return out

attn = SoftmaxAttention()
x = torch.randn(10, 20, 64)
output = attn(x)
print(output.size())

torch.Size([10, 20, 64])


The above is known as non-causal (or bidirectional) attention, as it allows each element to attend to all other elements. In the causal (or autoregressive) attention, the attention matrix is maksed to prevent each element from attending to future elements in the input sequence.

## Loss functions
Express the task that your model is intended to perform on the data

- For regression, a sensible default is the mean squareerror $MSE(a,b) = \frac{1}{N}\sum_{i=1}^N(a_i - b_i)^2$
- For classification, a sensible default is cross-entropy $H(a,b) = -\sum_{i=1}^N a_i \log(b_i)$, with $a,b$ being the predicted and true class probabilities, respectively
- In PyTorch, `torch.nn.MSELoss` and `torch.nn.CrossEntropyLoss` respectively

In [9]:
import torch
import torch.nn as nn

# Define the loss function
loss_fn = nn.CrossEntropyLoss()

# Define the input and target
input = torch.randn(3, 5) # (batch_size, num_classes)
target = torch.empty(3, dtype=torch.long).random_(5) # (batch_size)

# Compute the loss
output = loss_fn(input, target)

# Print the loss
print(output)


tensor(0.4129)


## Optimizer
The optimizer is used to update the parameters of the model during training. The most common optimizer is stochastic gradient descent (SGD), but there are many others available in PyTorch, such as Adam, RMSprop, and Adagrad. The optimizer is responsible for updating the weights of the model using the gradients computed by autograd

In [10]:
class NestedModel(torch.nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.bias = torch.nn.Parameter(torch.rand(dim), requires_grad=True)
        self.mod1 = Bias(dim)
        self.mod2 = Bias(dim)
        
    def forward(self, a):
        out = self.mod1(a + self.bias) 
        return self.mod2(out)

a, b = torch.rand(10,5), torch.rand(10,5)
model = NestedModel(5)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001) # recursively tracks module parameters
output = model(a)
loss = torch.mean((output - b)**2)
optimizer.zero_grad() # clear old gradients
loss.backward() # calculate gradients
optimizer.step() # one optimization step

## Putting things together: training loop for MLP classification
The following code shows a simple training loop for a multi-layer perceptron (MLP) that is trained on the MNIST dataset. The model consists of two linear layers with ReLU activation functions. The loss function is cross-entropy, and the optimizer is stochastic gradient descent (SGD).

In [21]:
from torch.utils.data import DataLoader 
from torchvision.datasets import MNIST 
from torchvision import transforms
import numpy as np

network = NestedModel(dim=784)
optimizer = torch.optim.SGD(network.parameters(),lr=1e-3)
cross_entropy = torch.nn.CrossEntropyLoss()
EPOCHS = 5
for epoch in range(EPOCHS):
    loss_avg = []
    for a, b in DataLoader(MNIST(root=".", transform=transforms.ToTensor(), download=True), batch_size=32):
        output = network(a.view(-1, 784)) # flatten the image
        loss = cross_entropy(output, b)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_avg.append(loss.item())
    print(f"Epoch {epoch} Loss: {np.mean(loss_avg)}")

Epoch 0 Loss: 6.449501095326742
Epoch 1 Loss: 5.902548930358886
Epoch 2 Loss: 5.373903276062012
Epoch 3 Loss: 4.873682944997151
Epoch 4 Loss: 4.414437736256917
