Brain-Like Orientation Invariance in Deep Nets
The ability to recognize objects despite substantial variation in the position, size, lighting and orientation in which those objects appear is a defining characteristic of biological visual intelligence often referred to simply as invariance or invariant representation. Orientation invariance – defined here as the representation of a stimulus that does not significantly vary as the stimulus is viewed from different angles – emerges abruptly in the human visual cortical information processing cascade (Morgan & Alvarez, VSS2014). In V3, for example, we observe little to no invariance: the difference between the neural patterns elicited by a given stimulus and the same stimulus rotated to 90 degrees is as large the difference between the neural patterns elicited by two different stimuli. In LOC, on the other hand, we observe strong invariance, with no statistically significant differences across the neural activity elicited by the same stimuli across any rotations.
Deep neural network models (computer vision algorithms defined by distributed computations in depth) have in the past been shown to capture the representational geometry of neural responses to different objects, but it’s still unclear whether they show the same types of invariance we observe in different parts of the human visual system. The question we ask, then, is: where, if anywhere, does orientation invariance emerge in deep neural networks (DNNs)?
Participants in the fMRI study were shown a series of 8 stimuli at 5 different degrees of rotation (0, 45,90,135,180) – examples of which you can explore in the carousel below.
We organize our neural responses by voxel into large-scale regions of interest (ROI) across the dorsal and ventral visual streams of visual cortex, including: early visual cortex (EVC), lateral occipital complex (LOC), occipitotemporal cortex (OTC), and occipitoparietal cortex (OPC). The responses in each ROI are then used to compute representational dissimilarity matrices (RDMs) with a Pearson distance metric.
The key to assessing orientation invariance here is in properly parsing our representational (dis)similarities (the cells of the RDM) into two broad categories: the similarity of each stimulus to the same stimulus at different rotations (‘within category’ similarity) and to different stimuli at the same rotation (‘across category’ similarity). (A third similarity – the similarity of a stimulus to itself – can be used to compute a “reliability ceiling”, though this does require some variance across repeated measures.)
When we’ve properly parsed our RDMs for each brain area, we obtain the following plot:
On the x axis here are the differences in orientation between the reference stimulus and the stimulus being compared. On the y axis is the similarity score. Error bars are 95% confidence intervals calculated across the 8 different stimulus types.
The tell-tale sign of invariance in this plot is the overlap (or absence thereof) in the errorbars of the within and across category similarities at 90 degrees of rotation. If an overlap is present, it means invariance is absent, and vice versa. Notice that early visual cortex (EVC) shows little to no invariance while lateral occipital complex (LOC) and occipitotemporal cortex (OTC) do.
To survey the possibility space of orientation invariant representations in DNNs, we gather a large battery of models (60 trained and randomly initialized models with different architectures from the Torchvision model zoo, and 24 models with the same architecture trained on different tasks fromthe Taskonomy project). A table of all models we use is shown below:
model | train_type | description |
---|---|---|
alexnet | imagenet | AlexNet trained on image classification with the ImageNet dataset. |
vgg11 | imagenet | VGG11 trained on image classification with the ImageNet dataset. |
vgg13 | imagenet | VGG13 trained on image classification with the ImageNet dataset. |
vgg16 | imagenet | VGG16 trained on image classification with the ImageNet dataset. |
vgg19 | imagenet | VGG19 trained on image classification with the ImageNet dataset. |
vgg11_bn | imagenet | VGG11-BatchNorm trained on image classification with the ImageNet dataset. |
vgg13_bn | imagenet | VGG13-BatchNorm trained on image classification with the ImageNet dataset. |
vgg16_bn | imagenet | VGG16-BatchNorm trained on image classification with the ImageNet dataset. |
vgg19_bn | imagenet | VGG19-BatchNorm trained on image classification with the ImageNet dataset. |
resnet18 | imagenet | ResNet18 trained on image classification with the ImageNet dataset. |
resnet34 | imagenet | ResNet34 trained on image classification with the ImageNet dataset. |
resnet50 | imagenet | ResNet50 trained on image classification with the ImageNet dataset. |
resnet101 | imagenet | ResNet101 trained on image classification with the ImageNet dataset. |
resnet152 | imagenet | ResNet152 trained on image classification with the ImageNet dataset. |
squeezenet1_0 | imagenet | SqueezeNet1.0 trained on image classification with the ImageNet dataset. |
squeezenet1_1 | imagenet | SqueezeNet1.1 trained on image classification with the ImageNet dataset. |
densenet121 | imagenet | DenseNet121 trained on image classification with the ImageNet dataset. |
densenet161 | imagenet | DenseNet161 trained on image classification with the ImageNet dataset. |
densenet169 | imagenet | DenseNet169 trained on image classification with the ImageNet dataset. |
densenet201 | imagenet | DenseNet201 trained on image classification with the ImageNet dataset. |
googlenet | imagenet | GoogleNet trained on image classification with the ImageNet dataset. |
shufflenet_v2_x0_5 | imagenet | ShuffleNet-V2-x0.5 trained on image classification with the ImageNet dataset. |
shufflenet_v2_x1_0 | imagenet | ShuffleNet-V2-x1.0 trained on image classification with the ImageNet dataset. |
mobilenet_v2 | imagenet | MobileNet-V2 trained on image classification with the ImageNet dataset. |
resnext50_32x4d | imagenet | ResNext50-32x4D trained on image classification with the ImageNet dataset. |
resnext101_32x8d | imagenet | ResNext50-32x8D trained on image classification with the ImageNet dataset. |
wide_resnet50_2 | imagenet | Wide-ResNet50 trained on image classification with the ImageNet dataset. |
wide_resnet101_2 | imagenet | Wide-ResNet101 trained on image classification with the ImageNet dataset. |
mnasnet0_5 | imagenet | MNASNet0.5 trained on image classification with the ImageNet dataset. |
mnasnet1_0 | imagenet | MNASNet1.0 trained on image classification with the ImageNet dataset. |
alexnet | random | AlexNet randomly initialized, with no training. |
vgg11 | random | VGG11 randomly initialized, with no training. |
vgg13 | random | VGG13 randomly initialized, with no training. |
vgg16 | random | VGG16 randomly initialized, with no training. |
vgg19 | random | VGG19 randomly initialized, with no training. |
vgg11_bn | random | VGG11-BatchNorm randomly initialized, with no training. |
vgg13_bn | random | VGG13-BatchNorm randomly initialized, with no training. |
vgg16_bn | random | VGG16-BatchNorm randomly initialized, with no training. |
vgg19_bn | random | VGG19-BatchNorm randomly initialized, with no training. |
resnet18 | random | ResNet18 randomly initialized, with no training. |
resnet34 | random | ResNet34 randomly initialized, with no training. |
resnet50 | random | ResNet50 randomly initialized, with no training. |
resnet101 | random | ResNet101 randomly initialized, with no training. |
resnet152 | random | ResNet152 randomly initialized, with no training. |
squeezenet1_0 | random | SqueezeNet1.0 randomly initialized, with no training. |
squeezenet1_1 | random | SqueezeNet1.1 randomly initialized, with no training. |
densenet121 | random | DenseNet121 randomly initialized, with no training. |
densenet161 | random | DenseNet161 randomly initialized, with no training. |
densenet169 | random | DenseNet169 randomly initialized, with no training. |
densenet201 | random | DenseNet201 randomly initialized, with no training. |
googlenet | random | GoogleNet randomly initialized, with no training. |
shufflenet_v2_x0_5 | random | ShuffleNet-V2-x0.5 randomly initialized, with no training. |
shufflenet_v2_x1_0 | random | ShuffleNet-V2-x1.0 randomly initialized, with no training. |
mobilenet_v2 | random | MobileNet-V2 randomly initialized, with no training. |
resnext50_32x4d | random | ResNext50-32x4D randomly initialized, with no training. |
resnext101_32x8d | random | ResNext50-32x8D randomly initialized, with no training. |
wide_resnet50_2 | random | Wide-ResNet50 randomly initialized, with no training. |
wide_resnet101_2 | random | Wide-ResNet101 randomly initialized, with no training. |
mnasnet0_5 | random | MNASNet0.5 randomly initialized, with no training. |
mnasnet1_0 | random | MNASNet1.0 randomly initialized, with no training. |
autoencoding | taskonomy | Image compression and decompression |
class_object | taskonomy | 1000-way object classification (via knowledge distillation from ImageNet). |
class_scene | taskonomy | Scene Classification (via knowledge distillation from MIT Places). |
curvature | taskonomy | Magnitude of 3D principal curvatures |
denoising | taskonomy | Uncorrupted version of corrupted image. |
depth_euclidean | taskonomy | Depth estimation |
depth_zbuffer | taskonomy | Depth estimation. |
edge_occlusion | taskonomy | Edges which include parts of the scene. |
edge_texture | taskonomy | Edges computed from RGB only (texture edges). |
egomotion | taskonomy | Odometry (camera poses) given three input images. |
fixated_pose | taskonomy | Relative camera pose with matching optical centers. |
inpainting | taskonomy | Filling in masked center of image. |
jigsaw | taskonomy | Putting scrambled image pieces back together. |
keypoints2d | taskonomy | Keypoint estimation from RGB-only (texture features). |
keypoints3d | taskonomy | 3D Keypoint estimation from underlying scene 3D. |
nonfixated_pose | taskonomy | Relative camera pose with distinct optical centers. |
normal | taskonomy | Pixel-wise surface normals. |
point_matching | taskonomy | Classifying if centers of two images match or not. |
reshading | taskonomy | Reshading with new lighting placed at camera location. |
room_layout | taskonomy | Orientation and aspect ratio of cubic room layout. |
segment_semantic | taskonomy | Pixel-wise semantic labeling (via knowledge distillation from MS COCO). |
segment_unsup25d | taskonomy | Segmentation (graph cut approximation) on RGB-D-Normals-Curvature image. |
segment_unsup2d | taskonomy | Segmentation (graph cut approximation) on RGB. |
vanishing_point | taskonomy | Three Manhattan-world vanishing points. |
We extract the responses of each layer of each network to the same stimulus set used in the fMRI, and perform the same operation as before to extract the within- and across-category similarities per layer. As an example, let’s visualize the result of this process for AlexNet trained on ImageNet:
The 18 facets in this plot are the 18 layers of AlexNet (in order from first to last). The rest of the plot elements are the same as we saw with the fMRI data above.
Notice the pattern here: in the earlier layers, invariance is totally absent: within-category similarities are as low (if not lower) than across-category similarities. But around the 4th or 5th convolutional layer, we start to see some seperation – the emergence of orientation invariance. It’s weak at first, but by the second linear layer, all rotations (except the selfsame 0 degrees of rotation) are more or less equally similiar.
We can summarise this development in a few ways. We could, for example, average the distance between the points in each curve for each layer. The greater the average (positive) distance, the more invariant the representations. Another summary we could take, though, is the similarity of these curves at any given layer to the various brain areas we analysed above. Because we have a sense of where orientation invariance emerges in the brain, knowing how similar these model curves are to the curves in different brain areas should give us a gestalt of where orientation invariance emerges in our models.
To get this summary, we aggregate together the within- and across-category similarities (per model per layer) and use a second Pearson distance metric to calculate their similarity to each brain area. Once this is done, we get plots that look something like this:
This looks messy, I know, but what it represents is how similar each model layer is to each brain area. On the x axis is the index of the model layer (what you might also call the depth of the layer). On the y axis is the Pearson distance metric we apply to the aggregated similarities. Error bars are 95% confidence intervals calculated across the 8 stimulus types. In the next few steps, we’ll clean this up a bit, but keep your eye for now on the (statistically significant) gap that emerges around layer 12 between the brain areas we know to show little to no invariance (EVC / OPC) and the brain areas we know to show strong invariance (LOC / OPC).
Let’s first zoom in on EVC and LOC, the key locations in the fMRI data:
Now, let’s see if we can smooth out some of the roughness in these trajectories, the smaller peaks and valleys of which we don’t care too much about since they typically represent various suboperations in a full block of computation. To smooth these trajectories in a principled way, we’ll use a generalized additive model (GAM), a linear model that numerically penalizes complexity and only keeps a nonlinearity if the data strongly suggests it’s real. When we apply our GAM to these trajectories, we get the following:
With some of the chaos tamed, what do these plots show us? Well, for AlexNet trained on ImageNet, at least, it shows us that earlier and intermediate layers are more similar to EVC – with no invarance – and later layers are more similar to LOC – with strong invariance. Pivotally, the switch (what we henceforth call a ‘crossover’) occurs around layer 12, just as we saw in the layer-wise plots above.
We can, of course, repeat all the steps above for every model in our repertoire…
Across the tabs below (organized by training category), you can visualize the trajectories for every model in our survey.
While we’re still analyzing the details of these trajectories, zooming out to the maximum extent possible, some larger trends become evident:
Overall, our results provide a preliminary signature of human brain-like orientation invariance in deep neural networks. Future psychophysical work could help to clarify whether the computations undergirding this signature are (at a more granular level) mechanistically and algorithmically comparable to what we see in the brain, but for now, the main takeaway is that the signature (so far as we’ve computed it) does indeed exist.
For questions, please contact Colin Conwell: conwell[at]g[dot]harvard[dot]edu
For the homepage of this and other projects, visit: https://colinconwell.github.io/
For attribution, please cite this work as
Conwell & Alvarez (2020, Jan. 5). Deep Orientation. Retrieved from https://colinconwell.github.io/pages/orientation.html
BibTeX citation
@misc{conwell2020deep, author = {Conwell, Colin and Alvarez, George}, title = {Deep Orientation}, url = {https://colinconwell.github.io/pages/orientation.html}, year = {2020} }