Conquering ImageNet with a 1MB Model

Allen Lu

One of the hottest new models for image recognition is SqueezeNet, which provides great performance at an incredibly small size (< 1MB). Due to its compact size, the SqueezeNet model is computationally efficient and can be directly used in smaller devices such as smartphones.

In this blog post, we give an introduction to SqueezeNet and show how to implement its essential fire module in TensorFlow.

Before SqueezeNet

First, a bit of background. In 2012, Alex Krizhevsky developed AlexNet. It won the 2012 ImageNet Challenge, and has since spurred on even greater advances in image recognition.

Example images from the ImageNet dataset.
The ImageNet Challenge requires creating a model that can accurately identify the images in the dataset.

While AlexNet (and more recent models inspired by AlexNet) are incredibly powerful and accurate in image recognition, they come at the cost of being incredibly large and computationally expensive. This makes training and storing the models quite difficult.

In contrast, the SqueezeNet model provides the same image recognition accuracy as AlexNet, at a fraction of its size (over 500x smaller).

Convolutional Neural Networks

At the heart of image recognition is the convolutional neural network (CNN). Nearly every top image recognition model is a variation of a CNN, including AlexNet and SqueezeNet. A CNN will take in an image (represented as pixels) for the input layer, then pass the image through various convolution layers to produce a prediction.

What differentiates a convolution layer from a regular feed-forward neural network layer is its kernel filters. The kernel filters are much better at extracting latent features from image data, which is why CNNs are so popular in image-related tasks.

An example of a convolution layer. The 3-channel input passes through 2 3x3 kernel filters to produce a 2-channel output.

After convolution layers, CNNs normally will use max-pooling layers to reduce the size of the convolution output. This keeps the data from growing too large and reduces the chance of overfitting.

An example of a max-pooling layer.

For more information on convolution layers, max-pooling layers, and CNNs, check out the CNN Lab.

Fire Module

The problem with regular convolution layers is that they use a large number of weight parameters, which use a lot of memory. To mitigate this, the SqueezeNet model combines multiple smaller convolution layers to create a fire module. The fire module uses a squeeze-expand concept to maintain performance while drastically reducing the number of weight parameters used.

The fire module first squeezes the input to a smaller depth, i.e. the number of channels in the input data, with a small convolution layer. Then the fire module expands the squeezed input to a larger depth to produce the output. By doing this, it can match the performance of a much larger convolution layer while using fewer parameters.

The squeeze-expand concept (figure from the original SqueezeNet paper).

For example, a regular convolution layer with 100 3x3 kernel filters will use 45,100 weight parameters for a 50-channel input.

\begin{align} P_C &= 3 \times 3 \times 100 \times 50 + 100 \\ &= 45,100 \end{align}

Number of weight parameters using the regular convolution layer. The addition term represents the kernel bias parameters.

In contrast, if we first squeeze the input to 10 channels using a convolution layer with 10 1x1 kernel filters, then use the 100 3x3 kernel filters to expand the squeezed input, the entire process only uses 9,610 weight parameters (about 5x fewer).

\begin{align} P_S &= 1 \times 1 \times 10 \times 50 + 10 \\ P_E &= 3 \times 3 \times 100 \times 10 + 100 \\ P_{SE} &= P_S + P_E = 9,610 \end{align}

Number of weight parameters using squeeze-expand. The \(P_S\) and \(P_E\) variables represent the squeeze and expand steps, respectively.

While the fire module sounds complex, it is actually pretty easy to code. The code for a generic fire module function, taken directly from chapter 4 of the SqueezeNet Lab, takes less than 20 lines in TensorFlow.

The custom_conv2d function used in the fire_module function is just a wrapper around TensorFlow's conv2d function, which implements a convolution layer.

Building SqueezeNet

The SqueezeNet model starts off with a regular convolution layer followed by a max-pooling layer. We then stack fire modules (interspersed with a max-pooling layer) for the core of the model. At the end of the model we use dropout, a fully-connected layer, and global average-pooling to produce the logits.

An overview of the SqueezeNet architecture.

Training the model for image recognition is equivalent to training the model for multiclass classification. In other words, we reduce the model's softmax cross-entropy loss.

A full and interactive walkthrough of the model architecture implementation is in the SqueezeNet Lab.

SqueezeNet Applications

As previously mentioned, one of the main benefits of SqueezeNet is that it can be directly used in smaller devices like smartphones. There are numerous uses of image recognition on phone applications, such as facial recognition and product labeling.

With SqueezeNet, these phone applications can store a good image recognition model directly in memory, rather than needing to make an API call to a model stored on the cloud. This results in a much faster processing time and does not require Wi-Fi or LTE data.

You May Also Like

Dissecting the Industry Shortage of Machine Learning Talent
An assessment of the current approaches taken by companies trying to build internal machine learning teams.
The Rise of Open-Source Technology in Machine Learning
A discussion on the rise of open-source machine learning frameworks and how they have transformed industry .
How to Ace Your Next ML Interview
Learn how to prepare for machine learning interviews in the tech industry.
AdaptiLab Founders' Interview Featured on Hackernoon
AdaptiLab's co-founders, James and Allen, discuss why they founded AdaptiLab and future plans for the company.

Get Started Now

Dive into a free, interactive Lab to understand the basics of deep learning.