This is equivalent to the reverse of convolution, hence the term transpose. I've already briefly talked about transpose convolution operation in the Convolution Operation section of my GitBook. I want to dive deeper into various techniques of upsampling where transpose convolution is just one of the many techniques.
Downsample via Convolution
Downsampling is what convolution normally does. Given an input tensor and a filter/kernel tensor, e.g. input=(5, 5, 3) and kernel=(3, 3, 3), using stride=1, the output is a (3, 3, 1) tensor. Every filter matches with input in channel size or depth. The result is always a tensor with depth=1. We can compute the height and width of each output tensor by a simple formula.
W: widthH: heightP: paddingS: stride
Woutput=1+SWinput−Wkernel+2P
Houtput=1+SHinput−Hkernel+2P
Therefore, if we use our example and subsitute in the values.
Woutput=Houtput=1+15−3+2∗0=3
Upsample Techniques
K-Nearest Neighbors
We take every element of an input tensor and duplicate it by a factor of K. For example, K=4:
input=[1324]↦output=1133113322442244
Bi-Linear Interpolation
We take every element of an input tensor and set them to be the corners of the output. Then we interpolate every missing elements by weighted average.
We copy every element of an input tensor to the output tensor and set everything else to zero. Each input tensor value is set to the top left corner of the expanded cell.
input=[1324]↦output=1030000020400000
Max-Unpooling
Max pooling takes the maximum among all values in a kernel. Max unpooling performs the opposite but it requires information from the previous max pooling layer to know what was the original index localization of each max element.
We keep track of the original position of the max elements. After some layers later, we perform unpooling using those positional information. We fill the rest using zeros.
[1324]↦max−unpool0000013020040000
Upsample via Transpose Convolution
Suppose we have an input
input=[1324]
We have a kernel that is trainable. Backpropagation computes derivatives of kernel respect to loss. For now, let's assume the kernel is initialized to some integers =5 for the ease of demonstration.
kernel=[5555]
Assuming 0 padding and unit stride size, we have W=H=2 for both inputs and kernels, P=0, and S=1.
Now we take an element of the input and multiply it to every element of kernel to produce a partially filled output. We do this to every element of the input.
[1][5555]=5555
[2][5555]=10101010
[3][5555]=15151515
[4][5555]=20202020
Then we sum all of them to produce the final output of a transpose convolution operation.