Scientific Papers

Visualizing Weights

This article is part of the Circuits thread, an experimental format collecting invited short articles and critical commentary delving into the inner workings of neural networks.


The problem of understanding a neural network is a little bit like reverse engineering a large compiled binary of a computer program. In this analogy, the weights of the neural network are the compiled assembly instructions. At the end of the day, the weights are the fundamental thing you want to understand: how does this sequence of convolutions and matrix multiplications give rise to model behavior?

Trying to understand artificial neural networks also has a lot in common with neuroscience, which tries to understand biological neural networks. As you may know, one major endeavor in modern neuroscience is mapping the connectomes of biological neural networks: which neurons connect to which. These connections, however, will only tell neuroscientists which weights are non-zero. Getting the weights – knowing whether a connection excites or inhibits, and by how much – would be a significant further step. One imagines neuroscientists might give a great deal to have the access to weights that those of us studying artificial neural networks get for free.

And so, it’s rather surprising how little attention we actually give to looking at the weights of neural networks. There are a few exceptions to this, of course. It’s quite common for researchers to show pictures of the first layer weights in vision models (these are directly connected to RGB channels, so they’re easy to understand as images). In some work, especially historically, we see researchers reason about the weights of toy neural networks by hand. And we quite often see researchers discuss aggregate statistics of weights. But actually looking at the weights of a neural network other than the first layer is quite uncommon – to the best of our knowledge, mapping weights between hidden layers to meaningful algorithms is novel to the circuits project.

What’s the difference between visualizing activations, weights, and attributions?

In this article, we’re focusing on visualizing weights. But people often visualize activations, attributions, gradients, and much more. How should we think about the meaning of visualizing these different objects?

  • Activations: We generally think of these as being “what” the network saw. If understanding a neural network is like reverse compiling a computer program, the neurons are the variables, and the activations are the values of those variables.
  • Weights: We generally think of these as being “how” the neural network computes one layer from the previous one. In the reverse engineering analogy, these are compiled assembly instructions.
  • Attributions: Attributions try to tell us the extent to which one neuron influenced a later neuron.We often think of this as “why” the neuron fired. We need to be careful with attributions, because they’re a human-defined object on top of a neural network rather than a fundamental object. They aren’t always well defined, and people mean different things by them. (They are very well defined if you are only operating across adjacent layers!)

Why it’s non-trivial to study weights in hidden layers

It seems to us that there are three main barriers to making sense of the weights in neural networks, which may have contributed to researchers tending to not directly inspect them:

  • Lack of Contextualization: Researchers often visualize weights in the first layer, because they are linked to RGB values that we understand. That connection makes weights in the first layer meaningful. But weights between hidden layers are meaningless by default: knowing nothing about either the source or the destination, how can we make sense of them?
  • Indirect Interaction: Sometimes, the meaningful weight interactions are between neurons which aren’t literally adjacent in a neural network. For example, in a residual network, the output of one neuron can pass through the additive residual stream and linearly interact with a neuron much later in the network. In other cases, neurons may interact through intermediate neurons without significant nonlinear interactions. How can we efficiently reason about these interactions?
  • Dimensionality and Scale: Neural networks have lots of neurons. Those neurons connect to lots of other neurons. There’s a lot of data to display! How can we reduce it to a human-scale amount of information?

Many of the methods we’ll use to address these problems were previously explored in Building Blocks in the context of understanding activation vectors. The goal of this article is to show how similar ideas can be applied to weights instead of activations. Of course, we’ve already implicitly used these methods in various circuit articles, but in those articles the methods have been of secondary interest to the results. It seems useful to give some dedicated discussion to the methods.

Aside: One Simple Trick

Interpretability methods often fail to take off because they’re hard to use. So before diving into sophisticated approaches, we wanted to offer a simple, easy to apply method.

In a convolutional network, the input weights for a given neuron have shape [width, height, input_channels]. Unless this is the first convolutional layer, this probably can’t be easily visualized because input_channels is large. (If this is the first convolutional layer, visualize it as is!) However, one can use dimensionality reduction to collapse input_channels down to 3 dimensions. We find one-sided NMF especially effective for this.

NMF of input weights in InceptionV1 mixed4d_5x5, for a selection of ten neurons. The red, green, and blue channels on each grid indicate the weights for each of the 3 NMF factors.

This visualization doesn’t tell you very much about what your weights are doing in the context of the larger model, but it does show you that they are learning nice spatial structures. This can be an easy sanity check that your neurons are learning, and a first step towards understanding your neuron’s behavior. We’ll also see later that this general approach of factoring weights can be extended into a powerful tool for studying neurons.

Despite this lack of contextualization, one-sided NMF can be a great technique for investigating multiple channels at a glance. One thing you may quickly discover using this method is that, in models with global average pooling at the end of their convolutional layers, the last few layers will have all their weights be horizontal bands.

Horizontally-banded weights in InceptionV1 mixed5b_5x5, for a selection of eight neurons. As in Figure 1, the red, green, and blue channels on each grid indicate the weights for each of the 3 NMF factors.

We call this phenomenon weight banding. One-sided NMF allows for quickly testing and validating hypotheses about phenomena such as weight banding.

Contextualizing Weights with Feature Visualization

Of course, looking at weights in a vacuum isn’t very interesting. In order to really understand what’s going on, we need to contextualize weights in the broader context of the network. The challenge of contextualization is a recurring challenge in understanding neural networks: we can easily observe every activation, every weight, and every gradient; the challenge lies in determining what those values represent.

Recall that the weights between two convolutional layers are a four dimensional array of the shape:

[relative x position, relative y position,
input channels, output channels]

If we fix the input channel and the output channel, we get a 2D array we can present with traditional data visualization. Let’s assume we know which neuron we’re interested in understanding, so we have the output channel. We can pick the input channels with high magnitude weights to our output channel.

But what does the input represent? What about the output?

The key trick is that techniques like feature visualization (or deeper investigations of neurons) can help us understand what the input and output neurons represent, contextualizing the graph. Feature visualizations are especially attractive because they’re automatic, and produce a single image which is often very informative about the neuron. As a result, we often represent neurons as feature visualizations in weight diagrams.

3: Contextualizing weights.

This approach is the weight analogue of using feature visualizations to contextualize activation vectors in Building Blocks (see the section titled “Making Sense of Hidden Layers”).

We can liken this to how, when reverse-engineering a normal compiled computer program, one would need to start assigning variable names to the values stored in registers to keep track of them. Feature visualizations are essentially automatic variable names for neurons, which are roughly analogous to those registers or variables.

Small Multiples

Of course, neurons have multiple inputs, and it can be helpful to show the weights to several inputs at a time as a small multiple:

And if we have two families of related neurons interacting, it can sometimes even be helpful to show the weights between all of them as a grid of small multiples:

5: Small multiple weights for a variety of curve detectors.

Advanced Approaches to Contextualization with Feature Visualization

Although we most often use feature visualization to visualize neurons, we can visualize any direction (linear combination of neurons). This opens up a very wide space of possibilities for visualizing weights, of which we’ll explore a couple particularly useful ones.

Visualizing Spatial Position Weights

Recall that the weights for a single neuron have shape [width, height, input_channels]. In the previous section we split up input_channels and visualized each [width, height] matrix. But an alternative approach is to think of there as being a vector over input neurons at each spatial position, and to apply feature visualization to each of those vectors. You can think of this as telling us what the weights in that position are collectively looking for.

6. Left: Feature visualization of a car neuron. Right: Feature visualizations of the vector over input neurons at each spatial position of the car neuron’s weights. As we see, the car neuron broadly responds to window features above wheel features.

This visualization is the weight analogue of the “Activation Grid” visualization from Building Blocks. It can be a nice, high density way to get an overview of what the weights for one neuron are doing. However, it will be unable to capture cases where one position responds to multiple very different things, as in a multi-faceted or polysemantic neuron.

Visualizing Weight Factors

Feature visualization can also be applied to factorizations of the weights, which we briefly discussed earlier. This is the weight analogue to the “Neuron Groups” visualization from Building Blocks.

This can be especially helpful when you have a group of neurons like high-low frequency detectors or black and white vs color detectors that look are all mostly looking for a small number of factors. For example, a large number of high-low frequency detectors can be significantly understood as combining just two factors – a high frequency factor and a low-frequency factor – in different patterns.



NMF factorization on the weights (excitatory and inhibitory) connecting six high-low frequency detectors in InceptionV1 to the
layer conv2d2.

These factors can then be decomposed into individual neurons for more detailed understanding.

8: Neurons (by their feature visualizations) that contribute to the two NMF factors from Figure 7, plus the weighted amount they contribute to
each factor.

Dealing with Indirect Interactions

As we mentioned earlier, sometimes the meaningful weight interactions are between neurons which aren’t literally adjacent in a neural network, or where the weights aren’t directly represented in a single weight tensor. A few examples:

  • In a residual network, the output of one neuron can pass through the additive residual stream and linearly interact with a neuron much later in the network.
  • In a separable convolution, weights are stored as two or more factors, and need to be expanded to link neurons.
  • In a bottleneck architecture, neurons in the bottleneck may primarily be a low-rank projection of neurons from the previous layer.
  • The behavior of an intermediate layer simply doesn’t introduce much non-linear behavior, leaving two neurons in non-adjacent layers with a significant linear interaction.

As a result, we often work with “expanded weights” – that is, the result of multiplying adjacent weight matrices, potentially ignoring non-linearities. We generally implement expanded weights by taking gradients through our model, ignoring or replacing all non-linear operations with the closest linear one.

These expanded weights have the following properties:

  • If two layers interact linearly, the expanded weights will give the true linear map, even if the model doesn’t explicitly represent the weights in a single weight matrix.
  • If two layers interact non-linearly, the expanded weights can be seen as the expected value of the gradient up to a constant factor, under the assumption that all neurons have an equal (and independent) probability of firing.

They also have one additional benefit, which is more of an implementation detail: because they’re implemented in terms of gradients, you don’t need to know how the weights are represented. For example, in TensorFlow, you don’t need to know which variable object represents the weights. This can be a significant convenience when you’re working with unfamiliar models!

Benefits of Expanded Weights

Multiplying out the weights like this can sometimes help us see a simpler underlying structure. For example, mixed3b 208 is a black and white center detector. It’s built by combining a bunch of black and white vs color detectors together.

9. mixed3b 208 along with five neurons from mixed3a that contribute the strongest weights to it.

Expanding out the weights allows us to see an important aggregate effect of these connections: together, they look for the absence of color in the center one layer further back.

10. Top eighteen expanded weights from conv2d2 to mixed3b 208, organized in two rows according to weight factorization.

A particularly important use of this method – which we’ve been implicitly using in earlier examples – is to jump over “bottleneck layers.” Bottleneck layers are layers of the network which squeeze the number of channels down to a much smaller number, typically in a branch, making large spatial convolutions cheaper. The bottleneck layers of InceptionV1 are one example. Since so much information is compressed, these layers are often polysemantic, and it can often be more helpful to jump over them and understand the connection to the wider layer before them.

Cases where expanded weights are misleading

Expanded weights can, of course, be misleading when non-linear structure is important. A good example of this is boundary detectors. Recall that boundary detectors usually detect both low-to-high and high-to-low frequency transitions:

11. Boundary detectors such as mixed3b 345 detect both low-to-high and high-to-low frequency transitions.

Since high-low frequency detectors are usually excited by high-frequency patterns on one side and inhibited on the other (and vice versa for low frequency), detecting both directions means that the expanded weights cancel out! As a result, expanded weights appear to show that boundary detectors are neither excited or inhibited by high frequency detectors two layers back, when in fact they are both excited and also inhibited by high frequency, depending on the context, and it’s just that those two different cases cancel out.


Neurons two layers back (such as conv2d2 89) may have a strong influence on the high-low frequency detectors that contribute to mixed3b 345 (top), but that influence washes out when we look at the expanded weights (bottom) directly between conv2d2 89 and mixed3b 345.

More sophisticated techniques for describing multi-layer interactions can help us understand cases like this. For example, one can determine what the “best case” excitation interaction between two neurons is (that is, the maximum achievable gradient between them). Or you can look at the gradient for a particular example. Or you can factor the gradient over many examples to determine major possible cases. These are all useful techniques, but we’ll leave them for a future article to discuss.

Qualitative properties

One qualitative property of expanding weights across many layers deserves mention before we end our discussion of them. Expanded weights often get this kind of “electron orbital”-like smooth spatial structures:

13. Smooth spatial structure of some expanded weights from mixed3b 268 to conv2d1.

Although the exact structures present may vary from neuron to neuron, this example is not cherry-picked: this smoothness is typical of most multiple-layer expanded weights.

It’s not clear how to interpret this, but it’s suggestive of rich spatial structure on the scale of multiple layers.

Dimensionality and Scale

So far, we’ve addressed the challenges of contextualization and indirection interactions. But we’ve only given a bit of attention to our third challenge of dimensionality and scale. Neural networks contain many neurons and each one connects to many others, creating a huge amount of weights. How do we pick which connections between neurons to look at?

For the purposes of this article, we’ll put the question of which neurons we want to study outside of our scope, and only discuss the problem of picking which connections to study. (We may be trying to comprehensively study a model, in which case we want to study all neurons. But we might also, for example, be trying to study neurons we’ve determined related to some narrower aspect of model behavior.)

Generally, we chose to look at the largest weights, as we did at the beginning of the section on contextualization. Unfortunately, there tends to be a long tail of small weights, and at some point it generally gets impractical to look at these. How much of the story is really hiding in these small weights? We don’t know, but polysemantic neurons suggest there could be a very important and subtle story hiding here! There’s some hope that sparse neural networks might make this much better, by getting rid of small weights, but whether such conclusions can be drawn about non-sparse networks is presently speculative.

An alternative strategy that we’ve brushed on a few times is to reduce your weights into a few components and then study those factors (for example, with NMF). Often, a very small number of components can explain much of the variance. In fact, sometimes a small number of factors can explain the weights of an entire set of neurons! Prominent examples of this are high-low frequency detectors (as we saw earlier) and black and white vs color detectors.

However, this approach also has downsides. Firstly, these components can be harder to understand and even polysemantic. For example, if you apply the basic version of this method to a boundary detector, one component will contain both high-to-low and low-to-high frequency detectors which will make it hard to analyze. Secondly, your factors no longer align with activation functions, which makes analysis much messier. Finally, because you will be reasoning about every neuron in a different basis, it is difficult to build a bigger picture view of the model unless you convert your components back to neurons.

This article is part of the Circuits thread, an experimental format collecting invited short articles and critical commentary delving into the inner workings of neural networks.

Source link