Convolution Operation
Last updated
Last updated
Using 3x3 filter applied on a single channel 4x4 image
Given an 32x32 image with RGB channels, we can represent it as a tensor of shape (32, 32, 3)
which is (height, width, channels). When we perform convolution, we need a filter that has the same channel depth as the image. For example, we can use a 5x5 filter which is of shape (5, 5, 3)
and slide it across the image left to right, top to bottom with a stride of 1 to perform convolution.
Question is, since we are starting from the top left corner, what if the filter goes index out of bound? We can use zero padding in this case. Thus, now we have four hyperparameters:
Filter size of 3x3, stride is 1 and with no padding
Filter size of 3x3, stride is 2 and with padding
F
: number of filters
Hf
or Wf
: spatial extend of the filters, which is 5 in this case
S
: stride size
P
: amount of padding
Ignore the example in the picture and use the original example, let's assume padding=2
and stride=1
:
From the calculation, we can see that given a tensor of shape (W, H, C)
, convolution produces a tensor of shape (W_out, H_out, F)
. The depth of the output tensor is dependent on number of filters being applied to the input.
The backward pass of a convolution operation (for both the input and weight) is also a convolution, but with spatially flipped filters. It is easy to derive using 1 dimensional example.
Let's say we have x of shape (3, 2, 2)
that is a 2x2 image with 3 channels, and a filter of shape (3, 1, 1)
which is a one-pixel filter; just imagine the filter as [weight[0], weight[1], weight[2]]
. When we perform convolution using stride=1
. We can see that:
When we calculate derivative of loss with respect to each weight:
Notice how this is flipped? For forward propoagation, we iterate through the number of filters for each pair of {i, j}
. For back propagation, we iterate through each pair of {i, j}
for every filter.
So far the convolution we have seen actually downsample the image, i.e. creating an output that is smaller in spatial dimension than the input. When I say spatial dimension, I am referring to the width and height of the tensor input. However, there are times we need to upsample and there is an operation called transpose convolution that does exactly this.
Given a 4x4 input, using a 3x3 filter with stride=1
and pad=1
, we should expect an output of 4x4. Similarly, if we increase the stride size to 2. We should expect an output of 2x2. Now, a transpose convolution does the opposite. Given an input of 2x2, we produce an output of 4x4 using 3x3 transpose filter with stride=2
and pad=1
. What it does is that it takes one element of the input and multiplies it to the filter matrix as a scalar multiplcation on the filter. This is also called fractionally strided convolution.
We simply sum the overlapping regions.
Given an input array of [a, b]
and a filter array of [x, y, z]
, using stride=2
. We can see that the output should be y = [ax, ay, az + bx, by, bz]
Let's extend the example to a bit more complicated. Given an input array of [a, b, c, d]
and a filter of [x, y, z]
. We can express a convolution using matrix multiplication. Using pad=1
and stride=1
, notice that we are sliding [x, y, z]
by one step to the right per row in the matrix on the left hand side.
Now let's perform a transpose convolution, with stride=1
. Padding rule in transpose convolution is different. We cannot arbitrarily insert padding. It must correspond to the previous padded input prior to convolution transformation.
The primary takeaway for transpose convolution is that it RESTORES the dimension of an input that was previously downsampled.