Conquering ImageNet with a 1MB Model
One of the hottest new models for image recognition is SqueezeNet, which provides great performance at an incredibly small size (< 1MB). Due to its compact size, the SqueezeNet model is computationally efficient and can be directly used in smaller devices such as smartphones.
In this blog post, we give an introduction to SqueezeNet and show how to implement its essential fire module in TensorFlow.
Example images from the ImageNet dataset.
The ImageNet Challenge requires creating a model that can accurately identify the images in the dataset.
While AlexNet (and more recent models inspired by AlexNet) are incredibly powerful and accurate in image recognition, they come at the cost of being incredibly large and computationally expensive. This makes training and storing the models quite difficult.
In contrast, the SqueezeNet model provides the same image recognition accuracy as AlexNet, at a fraction of its size (over 500x smaller).
Convolutional Neural Networks
At the heart of image recognition is the convolutional neural network (CNN). Nearly every top image recognition model is a variation of a CNN, including AlexNet and SqueezeNet. A CNN will take in an image (represented as pixels) for the input layer, then pass the image through various convolution layers to produce a prediction.
What differentiates a convolution layer from a regular feed-forward neural network layer is its kernel filters. The kernel filters are much better at extracting latent features from image data, which is why CNNs are so popular in image-related tasks.
An example of a convolution layer. The 3-channel input passes through 2 3x3 kernel filters to produce a 2-channel output.
After convolution layers, CNNs normally will use max-pooling layers to reduce the size of the convolution output. This keeps the data from growing too large and reduces the chance of overfitting.
An example of a max-pooling layer.
For more information on convolution layers, max-pooling layers, and CNNs, check out the CNN Lab.
The problem with regular convolution layers is that they use a large number of weight parameters, which use a lot of memory. To mitigate this, the SqueezeNet model combines multiple smaller convolution layers to create a fire module. The fire module uses a squeeze-expand concept to maintain performance while drastically reducing the number of weight parameters used.
The fire module first squeezes the input to a smaller depth, i.e. the number of channels in the input data, with a small convolution layer. Then the fire module expands the squeezed input to a larger depth to produce the output. By doing this, it can match the performance of a much larger convolution layer while using fewer parameters.
The squeeze-expand concept (figure from the original SqueezeNet paper).
For example, a regular convolution layer with 100 3x3 kernel filters will use 45,100 weight parameters for a 50-channel input.
Number of weight parameters using the regular convolution layer. The addition term represents the kernel bias parameters.
In contrast, if we first squeeze the input to 10 channels using a convolution layer with 10 1x1 kernel filters, then use the 100 3x3 kernel filters to expand the squeezed input, the entire process only uses 9,610 weight parameters (about 5x fewer).
Number of weight parameters using squeeze-expand. The \(P_S\) and \(P_E\) variables represent the squeeze and expand steps, respectively.
While the fire module sounds complex, it is actually pretty easy to code. The code for a generic fire module function, taken directly from chapter 4 of the SqueezeNet Lab, takes less than 20 lines in TensorFlow.
custom_conv2d function used in the
fire_module function is just a wrapper around TensorFlow's
conv2d function, which implements a convolution layer.
The SqueezeNet model starts off with a regular convolution layer followed by a max-pooling layer. We then stack fire modules (interspersed with a max-pooling layer) for the core of the model. At the end of the model we use dropout, a fully-connected layer, and global average-pooling to produce the logits.
An overview of the SqueezeNet architecture.
Training the model for image recognition is equivalent to training the model for multiclass classification. In other words, we reduce the model's softmax cross-entropy loss.
A full and interactive walkthrough of the model architecture implementation is in the SqueezeNet Lab.
As previously mentioned, one of the main benefits of SqueezeNet is that it can be directly used in smaller devices like smartphones. There are numerous uses of image recognition on phone applications, such as facial recognition and product labeling.
With SqueezeNet, these phone applications can store a good image recognition model directly in memory, rather than needing to make an API call to a model stored on the cloud. This results in a much faster processing time and does not require Wi-Fi or LTE data.