Interpreting how convnets decompose an input image
Visualizing the filters learned by convnets
Visualizing areas in an image responsible for a certain classification decision
A fundamental problem when building a computer vision application is that of interpretability: why did our classifier think a particular image contained a refrigerator, when all we can see is a truck? This is especially relevant to use cases where deep learning is used to complement human expertise, such as medical imaging use cases. This chapter will familiarize you with a range of different techniques for visualizing what convnets learn and understanding the decisions they make.
It’s often said that deep learning models are “black boxes”: they learn representations that are difficult to extract and present in a human-readable form. Although this is partially true for certain types of deep learning models, it’s definitely not true for convnets. The representations learned by convnets are highly amenable to visualization, in large part because they’re representations of visual concepts. Since 2013, a wide array of techniques has been developed for visualizing and interpreting these representations. We won’t survey all of them, but we’ll cover three of the most accessible and useful ones:
Visualizing intermediate convnet outputs (intermediate activations)—Useful for understanding how successive convnet layers transform their input, and for getting a first idea of the meaning of individual convnet filters
Visualizing convnet filters—Useful for understanding precisely what visual pattern or concept each filter in a convnet is receptive to
Visualizing heatmaps of class activation in an image—Useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in images
For the first method—activation visualization—we’ll use the small convnet that we trained from scratch on the dogs vs. cats classification problem in chapter 8. For the next two methods, we’ll use a pretrained Xception model.
Visualizing intermediate activations consists of displaying the values returned by various convolution and pooling layers in a model, given a certain input (the output of a layer is often called its activation, the output of the activation function). This gives a view into how an input is decomposed into the different filters learned by the network. We want to visualize feature maps with three dimensions: width, height, and depth (channels). Each channel encodes relatively independent features, so the proper way to visualize these feature maps is by independently plotting the contents of every channel as a 2D image. Let’s start by loading the model that we saved in section 8.2:
display_image <-function(x, ..., max =255L, margin =0) {par(mar =rep(margin, 4)) x |>as.array() |>drop() |>as.raster(max = max) |>plot(..., interpolate =FALSE)}
display_image(img)
The test cat picture
To extract the feature maps we want to look at, we’ll create a Keras model that takes batches of images as input and outputs the activations of all convolution and pooling layers.
Listing 10.3: Instantiating a model that returns layer activations
Extracts the outputs of all Conv2D and MaxPooling2D layers and puts them in a named list
2
Creates a model that will return these outputs, given the model input
When fed an image input, this model returns the values of the layer activations in the original model, as a list. This is the first time you’ve encountered a multi-output model in this book in practice since you learned about them in chapter 7: until now, the models you’ve seen have had exactly one input and one output. This model has one input and nine outputs—one output per layer activation.
Listing 10.4: Using the model to compute layer activations
It’s a 178 × 178 feature map with 32 channels. Let’s try plotting the sixth channel of the activation of the first layer of the original model (see figure 10.2).
Sixth channel of the activation of the first layer on the test cat picture
This channel appears to encode a diagonal edge detector, but note that your own channels may vary because the specific filters learned by convolution layers aren’t deterministic.
Now, let’s plot a complete visualization of all the activations in the network (see figure 10.3). We’ll extract and plot every channel in each of the layer activations, and we’ll stack the results in one big grid, with channels stacked side by side.
Listing 10.6: Visualizing every channel in every intermediate activation
Iterates over the activations (and the names of the corresponding layers)
2
The layer activation has shape (1, size, size, n_features).
3
Prepares a plotting grid layout for all channels using par()
4
Plots a single channel (or feature)
5
Adds a title for the full grid of activations for the layer
Every channel of every layer activation on the test cat picture
There are a few things to note here:
The first layer acts as a collection of various edge detectors. At that stage, the activations retain almost all of the information present in the initial picture.
As we go higher, the activations become increasingly abstract and less visually interpretable. They begin to encode higher-level concepts such as “cat ear” and “cat eye.” Higher representations carry increasingly less information about the visual contents of the image and increasingly more information related to the class of the image.
The sparsity of the activations increases with the depth of the layer: in the first layer, all filters are activated by the input image, but in the following layers, more and more filters are blank. This means the pattern encoded by the filter isn’t found in the input image.
We have just observed an important universal characteristic of the representations learned by deep neural networks: the features extracted by a layer become increasingly abstract with the depth of the layer. The activations of higher layers carry less and less information about the specific input being seen and more and more information about the target (in this case, the class of the image: cat or dog). A deep neural network effectively acts as an information distillation pipeline, with raw data going in (in this case, RGB pictures) and being repeatedly transformed so that irrelevant information is filtered out (for example, the specific visual appearance of the image), and useful information is magnified and refined (for example, the class of the image).
This is analogous to the way humans and animals perceive the world: after observing a scene for a few seconds, a human can remember which abstract objects were present in it (bicycle, tree) but can’t remember the specific appearance of these objects. In fact, if you tried to draw a generic bicycle from memory, chances are you couldn’t get it remotely right, even though you’ve seen thousands of bicycles in your lifetime (see, for example, figure 10.4). Try it right now: this effect is absolutely real. Your brain has learned to completely abstract its visual input—to transform it into high-level visual concepts while filtering out irrelevant visual details—making it tremendously difficult to remember how things around you look.
Left: Attempts to draw a bicycle from memory. Right: What a schematic bicycle should look like.
Another easy way to inspect the filters learned by convnets is to display the visual pattern that each filter is meant to respond to. This can be done with gradient ascent in input space, directly optimizing the values of the input image of a convnet so as to maximize the response of a specific filter, starting from a blank input image. The resulting input image will be one that the chosen filter is maximally responsive to.
Let’s try this with the filters of the Xception model. The process is simple: we’ll build a loss function that maximizes the value of a given filter in a given convolution layer, and then we’ll use gradient ascent to adjust the values of the input image so as to maximize this activation value. This will be your second example of a low-level gradient-based loop: the first one in chapter 2 performed gradient descent to minimize a loss; here we’ll use gradient ascent to maximize an activation. We will show it for TensorFlow, PyTorch, and JAX.
First, let’s instantiate the Xception model trained on the ImageNet dataset. We can once again use the KerasHub library, exactly as we did in chapter 8:
Instantiates the feature extractor network from pretrained weights
2
Loads the matching preprocessing to scale our input images
We’re interested in the convolutional layers of the model: the Conv2D and SeparableConv2D layers. We need to know their names so we can retrieve their outputs. Let’s print their names in order of depth.
Listing 10.7: Printing the names of Xception convolutional layers
You’ll notice that the SeparableConv2D layers here are all named something like block6_sepconv1, block7_sepconv2, etc.—Xception is structured into blocks, each containing several convolutional layers.
Now, let’s create a second model that returns the output of a specific layer: a “feature extractor” model. Because our model is a Functional API model, it is inspectable: we can query the output of one of its layers and reuse it in a new model. No need to copy all of the Xception code.
Listing 10.8: Feature extractor model returning a specific output
We could replace this with the name of any layer in the Xception convolutional base.
2
The layer object we’re interested in
3
Uses model$input and layer$output to create a model that, given an input image, returns the output of our target layer
To use this model, we can simply call it on some input data, but we should be careful to apply our model-specific image preprocessing so that our images are scaled to the same range as the Xception pretraining data:
Let’s use our feature extractor model to define a function that returns a scalar value quantifying how much a given input image “activates” a given filter in the layer. This is the loss function that we’ll maximize during the gradient ascent process:
The loss function takes an image tensor and the index of the filter we consider (an integer).
2
We avoid border artifacts by involving only non-border pixels in the loss: we discard the first 2 pixels along the sides of the activation.
3
Returns the mean of the activation values for the filter
NoteThe difference between predict(model, x) and model(x)
In the previous chapter, we used predict(model, x) for feature extraction. Here, we’re using model(x). What gives?
Both y <- predict(model, x) and y <- model(x) (where x is an array of input data) mean “run the model on x and retrieve the output y.” But they aren’t exactly the same thing.
predict() loops over the data in batches (in fact, we can specify the batch size via predict(model, x, batch_size=64)) and extracts the array value of the outputs. It’s schematically equivalent to
This means predict() calls can scale to very large arrays. Meanwhile, model(x) happens in memory and doesn’t scale. On the other hand, predict() is not differentiable: TensorFlow, PyTorch, and JAX cannot backpropagate through it.
Another practical difference is that predict() returns an R array, whereas model() returns a tensor object (or, in the case of a multi-output model, a list of R arrays and a list of tensor objects, respectively).
You should use model(x) when you need to retrieve the gradients of the model call. And you should use predict() if you just need the output value. In other words, always use predict() unless you’re in the middle of writing a low-level gradient descent loop (as we are now).
A non-obvious trick to help the gradient-ascent process go smoothly is to normalize the gradient tensor by dividing it by its L2 norm (the square root of the sum of the squares of the values in the tensor). This ensures that the magnitude of the updates done to the input image is always within the same range.
Let’s set up the gradient ascent step function. Anything that involves gradients requires calling backend-level APIs, such as GradientTape() in TensorFlow, $backward() in PyTorch, and jax$grad() in JAX. Let’s line up all the code snippets for each of the three backends, starting with TensorFlow.
10.2.1 Gradient ascent in TensorFlow
For TensorFlow, we can open a GradientTape() scope and compute the loss inside of it to retrieve the gradients we need. We’ll wrap it with tf_function() to speed up computation.
Listing 10.9: Stochastic gradient ascent in TensorFlow
Now we have all the pieces. Let’s put them together in an R function that takes a filter index as input and returns a tensor representing the pattern that maximizes the activation of the specified filter in our target layer.
Listing 10.10: Function to generate filter visualizations
Initializes an image tensor with random values (the Xception model expects input values in the [0, 1] range, so here we pick a range centered on 0.5)
4
Repeatedly updates the values of the image tensor to maximize our loss function
The resulting image tensor is a floating-point array of shape (200, 200, 3), with values that may not be integers within [0, 255]. Hence, we need to post-process this tensor to turn it into a displayable image. We do so with the following straightforward utility function.
Listing 10.11: Utility function to convert a tensor into a valid image
Some filter patterns for layers block2_sepconv1, block4_sepconv1, and block8_sepconv1.
These filter visualizations tell us a lot about how convnet layers see the world: each layer in a convnet learns a collection of filters such that their inputs can be expressed as a combination of the filters. This is similar to how the Fourier transform decomposes signals onto a bank of cosine functions. The filters in these convnet filter banks become increasingly complex and refined as we go higher in the model:
The filters from the first layers in the model encode simple directional edges and colors (or colored edges, in some cases).
The filters from layers a bit further up the stack, such as block4_sepconv1, encode simple textures made from combinations of edges and colors.
The filters in higher layers begin to resemble textures found in natural images: feathers, eyes, leaves, and so on.
Here’s one last visualization technique—one that is useful for understanding which parts of a given image led a convnet to its final classification decision. This is helpful for “debugging” the decision process of a convnet, particularly in the case of a classification mistake (a problem domain called model interpretability). It can also allow us to locate specific objects in an image.
This general category of techniques is called class activation map (CAM) visualization, and it consists of producing heatmaps of class activation over input images. A class activation heatmap is a 2D grid of scores associated with a specific output class, computed for every location in any input image, indicating how important each location is with respect to the class under consideration. For instance, given an image fed into a dogs vs. cats convnet, CAM visualization would allow us to generate a heatmap for the class “cat,” indicating how cat-like different parts of the image are, and also a heatmap for the class “dog,” indicating how dog-like parts of the image are. The specific implementation we’ll use is described in Ramprasaath R. Selvaraju et al. (“Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization” (2019), https://arxiv.org/abs/1610.02391).
Grad-CAM consists of taking the output feature map of a convolution layer, given an input image, and weighting every channel in that feature map by the gradient of the class with respect to the channel. Intuitively, one way to understand this trick is that we’re weighting a spatial map of “how intensely the input image activates different channels” by “how important each channel is with regard to the class,” resulting in a spatial map of “how intensely the input image activates the class.”
Let’s demonstrate this technique using the pretrained Xception model. Consider the image of two African elephants shown in figure 10.7, possibly a mother and her calf, strolling on the savanna. We can start by downloading this image and converting it to an array.
Downloads the image and stores it locally under the path img_path
2
Loads, converts to an array, and adds a batch dimension
So far, we have used KerasHub only to instantiate a pretrained feature extractor network using the backbone class. For Grad-CAM, we need the entire Xception model, including the classification head: recall that Xception was trained on the ImageNet dataset with ~1 million labeled images belonging to 1,000 different classes.
KerasHub provides a high-level task API for common end-to-end workflows like image classification, text classification, image generation, and so on. A task wraps preprocessing, a feature extraction network, and a task-specific head into a single class that is easy to use. Let’s try it out:
model <- keras_hub$models$ImageClassifier$from_preset("xception_41_imagenet",1activation ="softmax",)preds <-predict(model, img)2str(preds)
1
We can configure the final activation of the classifier. Here we use a softmax activation, so our outputs are probabilities.
2
ImageNet has 1,000 classes, so each prediction from our classifier has 1,000 entries.
num [1, 1:1000] 0.00000535 0.00002684 0.0000167 0.00001146 0.00001136 ...
The top five classes predicted for this image are as follows:
African elephant (with 90% probability)
Tusker (with 5% probability)
Indian elephant (with 2% probability)
Triceratops and Mexican hairless dog with less than 0.1% probability
The network has recognized the image as containing an undetermined quantity of African elephants. The entry in the prediction vector that was maximally activated is the one corresponding to the “African elephant” class, at index 387:
which.max(preds[1, ])
[1] 387
To visualize which parts of the image are the most African elephant–like, let’s set up the Grad-CAM process.
Note that we didn’t need to preprocess our image before calling the task model. That’s because the KerasHub ImageClassifier preprocesses inputs for us as part of predict(). Let’s preprocess the image ourselves so we can use the preprocessed inputs directly:
Then we compute the gradient of the top predicted class for our input image with respect to the activations of the last convolution layer. Once again, needing to compute gradients means we have to use backend APIs.
10.3.1 Getting the gradient of the top class: TensorFlow version
Let’s start with the TensorFlow version, once again using GradientTape.
Listing 10.15: Computing top class gradients with TensorFlow
Creates a copy of last_conv_layer_output that we can get gradients for
3
Retrieves the activation channel corresponding to the top predicted class
4
Gets the gradient of the top predicted class with regard to the output feature map of the last convolutional layer
10.3.3 Getting the gradient of the top class: JAX version
Finally, let’s do JAX. We define a separate loss-computation function that takes the final layer’s output and returns the activation channel corresponding to the top predicted class. We use this activation value as our loss, allowing us to compute the gradient.
Listing 10.17: Computing the top class gradients with JAX
Vector in which each entry is the mean intensity of the gradient for a given channel. It quantifies the importance of each channel with regard to the top predicted class. It has shape (1, 1, 1, 2048).
2
Multiplies each channel in the output of the last convolutional layer by “how important this channel is”. The output has shape (1, 10, 10, 2048).
3
The channel-wise mean of the resulting feature map is our heatmap of class activation, with shape (10, 10).
The result is shown in Figure 10.8.
Listing 10.19: Visualizing the heatmap
plot_activations(heatmap)
Standalone class activation heatmap.
Finally, let’s generate an image that superimposes the original image on the heatmap we just obtained (see figure 10.9).
Listing 10.20: Superimposing the heatmap on the original picture
Converts the heatmap into a raster object with 40% opacity
2
Loads the original image, this time without resizing or preprocessing
3
Superimposes the heatmap over the original image
4
Passes ncol(img) and nrow(img) so that the heatmap, which has fewer pixels, is drawn to match the size of the original image
5
Passes interpolate=FALSE so we can clearly see the activation map pixel boundaries
African elephant class activation heatmap over the test picture
This visualization technique answers two important questions:
Why did the network think this image contained an African elephant?
Where is the African elephant located in the picture?
In particular, it’s interesting to note that the ears of the elephant calf are strongly activated: this is probably how the network can tell the difference between African and Indian elephants.
10.4 Summary
Convnets process images by applying a set of learned filters. Filters from earlier layers detect edges and basic textures, and filters from later layers detect increasingly abstract concepts.
You can visualize both the pattern that a filter detects and a filter’s response map across an image.
You can use the Grad-CAM technique to visualize what area(s) in an image were responsible for a classifier’s decision.
Together, these techniques make convnets highly interpretable.