Deep Orientation

Brain-Like Orientation Invariance in Deep Nets

Colin Conwell https://colinconwell.github.io/ (Harvard University) , George Alvarez https://scorsese.wjh.harvard.edu/George/ (Harvard University)
January 5, 2020

Orientation Invariance

… in the Human Visual System

The ability to recognize objects despite substantial variation in the position, size, lighting and orientation in which those objects appear is a defining characteristic of biological visual intelligence often referred to simply as invariance or invariant representation. Orientation invariance – defined here as the representation of a stimulus that does not significantly vary as the stimulus is viewed from different angles – emerges abruptly in the human visual cortical information processing cascade (Morgan & Alvarez, VSS2014). In V3, for example, we observe little to no invariance: the difference between the neural patterns elicited by a given stimulus and the same stimulus rotated to 90 degrees is as large the difference between the neural patterns elicited by two different stimuli. In LOC, on the other hand, we observe strong invariance, with no statistically significant differences across the neural activity elicited by the same stimuli across any rotations.

… in Deep Neural Networks

Deep neural network models (computer vision algorithms defined by distributed computations in depth) have in the past been shown to capture the representational geometry of neural responses to different objects, but it’s still unclear whether they show the same types of invariance we observe in different parts of the human visual system. The question we ask, then, is: where, if anywhere, does orientation invariance emerge in deep neural networks (DNNs)?

fMRI Data

Participants in the fMRI study were shown a series of 8 stimuli at 5 different degrees of rotation (0, 45,90,135,180) – examples of which you can explore in the carousel below.

We organize our neural responses by voxel into large-scale regions of interest (ROI) across the dorsal and ventral visual streams of visual cortex, including: early visual cortex (EVC), lateral occipital complex (LOC), occipitotemporal cortex (OTC), and occipitoparietal cortex (OPC). The responses in each ROI are then used to compute representational dissimilarity matrices (RDMs) with a Pearson distance metric.

The key to assessing orientation invariance here is in properly parsing our representational (dis)similarities (the cells of the RDM) into two broad categories: the similarity of each stimulus to the same stimulus at different rotations (‘within category’ similarity) and to different stimuli at the same rotation (‘across category’ similarity). (A third similarity – the similarity of a stimulus to itself – can be used to compute a “reliability ceiling”, though this does require some variance across repeated measures.)

When we’ve properly parsed our RDMs for each brain area, we obtain the following plot:

On the x axis here are the differences in orientation between the reference stimulus and the stimulus being compared. On the y axis is the similarity score. Error bars are 95% confidence intervals calculated across the 8 different stimulus types.

The tell-tale sign of invariance in this plot is the overlap (or absence thereof) in the errorbars of the within and across category similarities at 90 degrees of rotation. If an overlap is present, it means invariance is absent, and vice versa. Notice that early visual cortex (EVC) shows little to no invariance while lateral occipital complex (LOC) and occipitotemporal cortex (OTC) do.

Models and Modeling

To survey the possibility space of orientation invariant representations in DNNs, we gather a large battery of models (60 trained and randomly initialized models with different architectures from the Torchvision model zoo, and 24 models with the same architecture trained on different tasks fromthe Taskonomy project). A table of all models we use is shown below:

model train_type description
alexnet imagenet AlexNet trained on image classification with the ImageNet dataset.
vgg11 imagenet VGG11 trained on image classification with the ImageNet dataset.
vgg13 imagenet VGG13 trained on image classification with the ImageNet dataset.
vgg16 imagenet VGG16 trained on image classification with the ImageNet dataset.
vgg19 imagenet VGG19 trained on image classification with the ImageNet dataset.
vgg11_bn imagenet VGG11-BatchNorm trained on image classification with the ImageNet dataset.
vgg13_bn imagenet VGG13-BatchNorm trained on image classification with the ImageNet dataset.
vgg16_bn imagenet VGG16-BatchNorm trained on image classification with the ImageNet dataset.
vgg19_bn imagenet VGG19-BatchNorm trained on image classification with the ImageNet dataset.
resnet18 imagenet ResNet18 trained on image classification with the ImageNet dataset.
resnet34 imagenet ResNet34 trained on image classification with the ImageNet dataset.
resnet50 imagenet ResNet50 trained on image classification with the ImageNet dataset.
resnet101 imagenet ResNet101 trained on image classification with the ImageNet dataset.
resnet152 imagenet ResNet152 trained on image classification with the ImageNet dataset.
squeezenet1_0 imagenet SqueezeNet1.0 trained on image classification with the ImageNet dataset.
squeezenet1_1 imagenet SqueezeNet1.1 trained on image classification with the ImageNet dataset.
densenet121 imagenet DenseNet121 trained on image classification with the ImageNet dataset.
densenet161 imagenet DenseNet161 trained on image classification with the ImageNet dataset.
densenet169 imagenet DenseNet169 trained on image classification with the ImageNet dataset.
densenet201 imagenet DenseNet201 trained on image classification with the ImageNet dataset.
googlenet imagenet GoogleNet trained on image classification with the ImageNet dataset.
shufflenet_v2_x0_5 imagenet ShuffleNet-V2-x0.5 trained on image classification with the ImageNet dataset.
shufflenet_v2_x1_0 imagenet ShuffleNet-V2-x1.0 trained on image classification with the ImageNet dataset.
mobilenet_v2 imagenet MobileNet-V2 trained on image classification with the ImageNet dataset.
resnext50_32x4d imagenet ResNext50-32x4D trained on image classification with the ImageNet dataset.
resnext101_32x8d imagenet ResNext50-32x8D trained on image classification with the ImageNet dataset.
wide_resnet50_2 imagenet Wide-ResNet50 trained on image classification with the ImageNet dataset.
wide_resnet101_2 imagenet Wide-ResNet101 trained on image classification with the ImageNet dataset.
mnasnet0_5 imagenet MNASNet0.5 trained on image classification with the ImageNet dataset.
mnasnet1_0 imagenet MNASNet1.0 trained on image classification with the ImageNet dataset.
alexnet random AlexNet randomly initialized, with no training.
vgg11 random VGG11 randomly initialized, with no training.
vgg13 random VGG13 randomly initialized, with no training.
vgg16 random VGG16 randomly initialized, with no training.
vgg19 random VGG19 randomly initialized, with no training.
vgg11_bn random VGG11-BatchNorm randomly initialized, with no training.
vgg13_bn random VGG13-BatchNorm randomly initialized, with no training.
vgg16_bn random VGG16-BatchNorm randomly initialized, with no training.
vgg19_bn random VGG19-BatchNorm randomly initialized, with no training.
resnet18 random ResNet18 randomly initialized, with no training.
resnet34 random ResNet34 randomly initialized, with no training.
resnet50 random ResNet50 randomly initialized, with no training.
resnet101 random ResNet101 randomly initialized, with no training.
resnet152 random ResNet152 randomly initialized, with no training.
squeezenet1_0 random SqueezeNet1.0 randomly initialized, with no training.
squeezenet1_1 random SqueezeNet1.1 randomly initialized, with no training.
densenet121 random DenseNet121 randomly initialized, with no training.
densenet161 random DenseNet161 randomly initialized, with no training.
densenet169 random DenseNet169 randomly initialized, with no training.
densenet201 random DenseNet201 randomly initialized, with no training.
googlenet random GoogleNet randomly initialized, with no training.
shufflenet_v2_x0_5 random ShuffleNet-V2-x0.5 randomly initialized, with no training.
shufflenet_v2_x1_0 random ShuffleNet-V2-x1.0 randomly initialized, with no training.
mobilenet_v2 random MobileNet-V2 randomly initialized, with no training.
resnext50_32x4d random ResNext50-32x4D randomly initialized, with no training.
resnext101_32x8d random ResNext50-32x8D randomly initialized, with no training.
wide_resnet50_2 random Wide-ResNet50 randomly initialized, with no training.
wide_resnet101_2 random Wide-ResNet101 randomly initialized, with no training.
mnasnet0_5 random MNASNet0.5 randomly initialized, with no training.
mnasnet1_0 random MNASNet1.0 randomly initialized, with no training.
autoencoding taskonomy Image compression and decompression
class_object taskonomy 1000-way object classification (via knowledge distillation from ImageNet).
class_scene taskonomy Scene Classification (via knowledge distillation from MIT Places).
curvature taskonomy Magnitude of 3D principal curvatures
denoising taskonomy Uncorrupted version of corrupted image.
depth_euclidean taskonomy Depth estimation
depth_zbuffer taskonomy Depth estimation.
edge_occlusion taskonomy Edges which include parts of the scene.
edge_texture taskonomy Edges computed from RGB only (texture edges).
egomotion taskonomy Odometry (camera poses) given three input images.
fixated_pose taskonomy Relative camera pose with matching optical centers.
inpainting taskonomy Filling in masked center of image.
jigsaw taskonomy Putting scrambled image pieces back together.
keypoints2d taskonomy Keypoint estimation from RGB-only (texture features).
keypoints3d taskonomy 3D Keypoint estimation from underlying scene 3D.
nonfixated_pose taskonomy Relative camera pose with distinct optical centers.
normal taskonomy Pixel-wise surface normals.
point_matching taskonomy Classifying if centers of two images match or not.
reshading taskonomy Reshading with new lighting placed at camera location.
room_layout taskonomy Orientation and aspect ratio of cubic room layout.
segment_semantic taskonomy Pixel-wise semantic labeling (via knowledge distillation from MS COCO).
segment_unsup25d taskonomy Segmentation (graph cut approximation) on RGB-D-Normals-Curvature image.
segment_unsup2d taskonomy Segmentation (graph cut approximation) on RGB.
vanishing_point taskonomy Three Manhattan-world vanishing points.


We extract the responses of each layer of each network to the same stimulus set used in the fMRI, and perform the same operation as before to extract the within- and across-category similarities per layer. As an example, let’s visualize the result of this process for AlexNet trained on ImageNet:

The 18 facets in this plot are the 18 layers of AlexNet (in order from first to last). The rest of the plot elements are the same as we saw with the fMRI data above.

Notice the pattern here: in the earlier layers, invariance is totally absent: within-category similarities are as low (if not lower) than across-category similarities. But around the 4th or 5th convolutional layer, we start to see some seperation – the emergence of orientation invariance. It’s weak at first, but by the second linear layer, all rotations (except the selfsame 0 degrees of rotation) are more or less equally similiar.

We can summarise this development in a few ways. We could, for example, average the distance between the points in each curve for each layer. The greater the average (positive) distance, the more invariant the representations. Another summary we could take, though, is the similarity of these curves at any given layer to the various brain areas we analysed above. Because we have a sense of where orientation invariance emerges in the brain, knowing how similar these model curves are to the curves in different brain areas should give us a gestalt of where orientation invariance emerges in our models.

To get this summary, we aggregate together the within- and across-category similarities (per model per layer) and use a second Pearson distance metric to calculate their similarity to each brain area. Once this is done, we get plots that look something like this:

This looks messy, I know, but what it represents is how similar each model layer is to each brain area. On the x axis is the index of the model layer (what you might also call the depth of the layer). On the y axis is the Pearson distance metric we apply to the aggregated similarities. Error bars are 95% confidence intervals calculated across the 8 stimulus types. In the next few steps, we’ll clean this up a bit, but keep your eye for now on the (statistically significant) gap that emerges around layer 12 between the brain areas we know to show little to no invariance (EVC / OPC) and the brain areas we know to show strong invariance (LOC / OPC).

Let’s first zoom in on EVC and LOC, the key locations in the fMRI data:

Now, let’s see if we can smooth out some of the roughness in these trajectories, the smaller peaks and valleys of which we don’t care too much about since they typically represent various suboperations in a full block of computation. To smooth these trajectories in a principled way, we’ll use a generalized additive model (GAM), a linear model that numerically penalizes complexity and only keeps a nonlinearity if the data strongly suggests it’s real. When we apply our GAM to these trajectories, we get the following:

With some of the chaos tamed, what do these plots show us? Well, for AlexNet trained on ImageNet, at least, it shows us that earlier and intermediate layers are more similar to EVC – with no invarance – and later layers are more similar to LOC – with strong invariance. Pivotally, the switch (what we henceforth call a ‘crossover’) occurs around layer 12, just as we saw in the layer-wise plots above.

We can, of course, repeat all the steps above for every model in our repertoire…

Major Takeaways

Across the tabs below (organized by training category), you can visualize the trajectories for every model in our survey.

ImageNet

Random

Taskonomy

While we’re still analyzing the details of these trajectories, zooming out to the maximum extent possible, some larger trends become evident:

Overall, our results provide a preliminary signature of human brain-like orientation invariance in deep neural networks. Future psychophysical work could help to clarify whether the computations undergirding this signature are (at a more granular level) mechanistically and algorithmically comparable to what we see in the brain, but for now, the main takeaway is that the signature (so far as we’ve computed it) does indeed exist.

For questions, please contact Colin Conwell: conwell[at]g[dot]harvard[dot]edu

For the homepage of this and other projects, visit: https://colinconwell.github.io/

Citation

For attribution, please cite this work as

Conwell & Alvarez (2020, Jan. 5). Deep Orientation. Retrieved from https://colinconwell.github.io/pages/orientation.html

BibTeX citation

@misc{conwell2020deep,
  author = {Conwell, Colin and Alvarez, George},
  title = {Deep Orientation},
  url = {https://colinconwell.github.io/pages/orientation.html},
  year = {2020}
}