A convolutional neural network / CNN / ConvNet is an artificial neural network that detects patterns in images using filters. They are mostly used for analyzing images for computer vision tasks.
The inputs to convolutional layers are called input channels, and the outputs are called output channels.
The transformation is called a convolution operation in the deep learning community, and cross-correlation mathematically. A convolution operation maps an input to an output using a filter and a sliding window.
We need to specify the number of filters each layer should have. The number of filters determines the number of output channels.
These filters are what detect the patterns. Patterns can be edges, shapes, textures, curves, objects, colors, etc. in the first layers. The deeper the network goes, the more sophisticated the filters become.
The convolution of a filter and each subset of the same size in the input channel is calculated as the summation of the element-wise products.
For example, for a 3x3 filter:
| a11 a12 a13 | | b11 b12 b13 | filter = | a21 a22 a23 | subset = | b21 b22 b23 | | a31 a32 a33 | | b31 b32 b33 | product = a11 b11 + a12 b12 + ... + a33 b33
Because the filter is 3x3, the resulting output channel will be smaller by a margin of 1 pixel on all sides.
Sometimes this is called dot product since it is an inner product (generalization of the dot product). Other names are "Frobenius inner product" or "summation of the Hadamard product".
In the past, computer vision experts would develop filters manually (for example, the Sobel filter). Today, pattern detectors are derived automatically by the network as it learns. The filter values start out with random values, and the values change as the network learns during training.
Convolutions reduce output channel dimensions by
(n - f + 1)(m - g + 1), where
n x m are the dimensions of the input and
f x g are the dimension of the filter.
This is a problem when there is meaningful information around the edges of the image. The solution is to use zero padding, a technique where we add a border of pixels with value zero around the edges of the input image. This allows us to preserve the original input size as convolutions are applied.
||No padding||Dimensions reduce|
||Zeros around the edges||Dimensions stay the same|
In Keras we specify the filter size with
kernel_size and the padding with
model = Sequential([ Dense(16, input_shape=(20,20,3), activation='relu'), Conv2D(32, kernel_size=(3,3), activation='relu', padding='valid'), Dense(2, activation='softmax') ])
Output goes from
(20, 20) to
model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_2 (Dense) (None, 20, 20, 16) 64 _________________________________________________________________ conv2d_1 (Conv2D) (None, 18, 18, 32) 4640 _________________________________________________________________ dense_3 (Dense) (None, 2) 16386
Max pooling is added after a convolutional layer, and it reduces the number of pixels in the output from the previous convolutional layer. It's used to reduce computational load (smaller image -> less parameters) and reduce over-fitting.
We define the size
n x m of the "pool", and a stride, or how many pixels we will move to find the next region. Every pixel of the output is calculated as the maximum of the values in the corresponding region of the image. With a pool of 2x2 and a stride of 2, the output will be reduced to half the size of the input.
Average pooling is where we take the average value from each region rather than the maximum. Currently, max pooling is used vastly more.
In Keras, there is a specific layer for max pooling, where we can specify both the size of the pool and the stride. They usually have no padding:
MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid')
There are also