Deep Orientation

Colin Conwell; George Alvarez

Orientation Invariance

… in the Human Visual System

The ability to recognize objects despite substantial variation in the position, size, lighting and orientation in which those objects appear is a defining characteristic of biological visual intelligence often referred to simply as invariance or invariant representation. Orientation invariance – defined here as the representation of a stimulus that does not significantly vary as the stimulus is viewed from different angles – emerges abruptly in the human visual cortical information processing cascade (Morgan & Alvarez, VSS2014). In V3, for example, we observe little to no invariance: the difference between the neural patterns elicited by a given stimulus and the same stimulus rotated to 90 degrees is as large the difference between the neural patterns elicited by two different stimuli. In LOC, on the other hand, we observe strong invariance, with no statistically significant differences across the neural activity elicited by the same stimuli across any rotations.

… in Deep Neural Networks

Deep neural network models (computer vision algorithms defined by distributed computations in depth) have in the past been shown to capture the representational geometry of neural responses to different objects, but it’s still unclear whether they show the same types of invariance we observe in different parts of the human visual system. The question we ask, then, is: where, if anywhere, does orientation invariance emerge in deep neural networks (DNNs)?

fMRI Data

Participants in the fMRI study were shown a series of 8 stimuli at 5 different degrees of rotation (0, 45,90,135,180) – examples of which you can explore in the carousel below.

We organize our neural responses by voxel into large-scale regions of interest (ROI) across the dorsal and ventral visual streams of visual cortex, including: early visual cortex (EVC), lateral occipital complex (LOC), occipitotemporal cortex (OTC), and occipitoparietal cortex (OPC). The responses in each ROI are then used to compute representational dissimilarity matrices (RDMs) with a Pearson distance metric.

The key to assessing orientation invariance here is in properly parsing our representational (dis)similarities (the cells of the RDM) into two broad categories: the similarity of each stimulus to the same stimulus at different rotations (‘within category’ similarity) and to different stimuli at the same rotation (‘across category’ similarity). (A third similarity – the similarity of a stimulus to itself – can be used to compute a “reliability ceiling”, though this does require some variance across repeated measures.)

When we’ve properly parsed our RDMs for each brain area, we obtain the following plot:

On the x axis here are the differences in orientation between the reference stimulus and the stimulus being compared. On the y axis is the similarity score. Error bars are 95% confidence intervals calculated across the 8 different stimulus types.

The tell-tale sign of invariance in this plot is the overlap (or absence thereof) in the errorbars of the within and across category similarities at 90 degrees of rotation. If an overlap is present, it means invariance is absent, and vice versa. Notice that early visual cortex (EVC) shows little to no invariance while lateral occipital complex (LOC) and occipitotemporal cortex (OTC) do.

Models and Modeling

To survey the possibility space of orientation invariant representations in DNNs, we gather a large battery of models (60 trained and randomly initialized models with different architectures from the Torchvision model zoo, and 24 models with the same architecture trained on different tasks fromthe Taskonomy project). A table of all models we use is shown below:

model	train_type	description
alexnet	imagenet	AlexNet trained on image classification with the ImageNet dataset.
vgg11	imagenet	VGG11 trained on image classification with the ImageNet dataset.
vgg13	imagenet	VGG13 trained on image classification with the ImageNet dataset.
vgg16	imagenet	VGG16 trained on image classification with the ImageNet dataset.
vgg19	imagenet	VGG19 trained on image classification with the ImageNet dataset.
vgg11_bn	imagenet	VGG11-BatchNorm trained on image classification with the ImageNet dataset.
vgg13_bn	imagenet	VGG13-BatchNorm trained on image classification with the ImageNet dataset.
vgg16_bn	imagenet	VGG16-BatchNorm trained on image classification with the ImageNet dataset.
vgg19_bn	imagenet	VGG19-BatchNorm trained on image classification with the ImageNet dataset.
resnet18	imagenet	ResNet18 trained on image classification with the ImageNet dataset.
resnet34	imagenet	ResNet34 trained on image classification with the ImageNet dataset.
resnet50	imagenet	ResNet50 trained on image classification with the ImageNet dataset.
resnet101	imagenet	ResNet101 trained on image classification with the ImageNet dataset.
resnet152	imagenet	ResNet152 trained on image classification with the ImageNet dataset.
squeezenet1_0	imagenet	SqueezeNet1.0 trained on image classification with the ImageNet dataset.
squeezenet1_1	imagenet	SqueezeNet1.1 trained on image classification with the ImageNet dataset.
densenet121	imagenet	DenseNet121 trained on image classification with the ImageNet dataset.
densenet161	imagenet	DenseNet161 trained on image classification with the ImageNet dataset.
densenet169	imagenet	DenseNet169 trained on image classification with the ImageNet dataset.
densenet201	imagenet	DenseNet201 trained on image classification with the ImageNet dataset.
googlenet	imagenet	GoogleNet trained on image classification with the ImageNet dataset.
shufflenet_v2_x0_5	imagenet	ShuffleNet-V2-x0.5 trained on image classification with the ImageNet dataset.
shufflenet_v2_x1_0	imagenet	ShuffleNet-V2-x1.0 trained on image classification with the ImageNet dataset.
mobilenet_v2	imagenet	MobileNet-V2 trained on image classification with the ImageNet dataset.
resnext50_32x4d	imagenet	ResNext50-32x4D trained on image classification with the ImageNet dataset.
resnext101_32x8d	imagenet	ResNext50-32x8D trained on image classification with the ImageNet dataset.
wide_resnet50_2	imagenet	Wide-ResNet50 trained on image classification with the ImageNet dataset.
wide_resnet101_2	imagenet	Wide-ResNet101 trained on image classification with the ImageNet dataset.
mnasnet0_5	imagenet	MNASNet0.5 trained on image classification with the ImageNet dataset.
mnasnet1_0	imagenet	MNASNet1.0 trained on image classification with the ImageNet dataset.
alexnet	random	AlexNet randomly initialized, with no training.
vgg11	random	VGG11 randomly initialized, with no training.
vgg13	random	VGG13 randomly initialized, with no training.
vgg16	random	VGG16 randomly initialized, with no training.
vgg19	random	VGG19 randomly initialized, with no training.
vgg11_bn	random	VGG11-BatchNorm randomly initialized, with no training.
vgg13_bn	random	VGG13-BatchNorm randomly initialized, with no training.
vgg16_bn	random	VGG16-BatchNorm randomly initialized, with no training.
vgg19_bn	random	VGG19-BatchNorm randomly initialized, with no training.
resnet18	random	ResNet18 randomly initialized, with no training.
resnet34	random	ResNet34 randomly initialized, with no training.
resnet50	random	ResNet50 randomly initialized, with no training.
resnet101	random	ResNet101 randomly initialized, with no training.
resnet152	random	ResNet152 randomly initialized, with no training.
squeezenet1_0	random	SqueezeNet1.0 randomly initialized, with no training.
squeezenet1_1	random	SqueezeNet1.1 randomly initialized, with no training.
densenet121	random	DenseNet121 randomly initialized, with no training.
densenet161	random	DenseNet161 randomly initialized, with no training.
densenet169	random	DenseNet169 randomly initialized, with no training.
densenet201	random	DenseNet201 randomly initialized, with no training.
googlenet	random	GoogleNet randomly initialized, with no training.
shufflenet_v2_x0_5	random	ShuffleNet-V2-x0.5 randomly initialized, with no training.
shufflenet_v2_x1_0	random	ShuffleNet-V2-x1.0 randomly initialized, with no training.
mobilenet_v2	random	MobileNet-V2 randomly initialized, with no training.
resnext50_32x4d	random	ResNext50-32x4D randomly initialized, with no training.
resnext101_32x8d	random	ResNext50-32x8D randomly initialized, with no training.
wide_resnet50_2	random	Wide-ResNet50 randomly initialized, with no training.
wide_resnet101_2	random	Wide-ResNet101 randomly initialized, with no training.
mnasnet0_5	random	MNASNet0.5 randomly initialized, with no training.
mnasnet1_0	random	MNASNet1.0 randomly initialized, with no training.
autoencoding	taskonomy	Image compression and decompression
class_object	taskonomy	1000-way object classification (via knowledge distillation from ImageNet).
class_scene	taskonomy	Scene Classification (via knowledge distillation from MIT Places).
curvature	taskonomy	Magnitude of 3D principal curvatures
denoising	taskonomy	Uncorrupted version of corrupted image.
depth_euclidean	taskonomy	Depth estimation
depth_zbuffer	taskonomy	Depth estimation.
edge_occlusion	taskonomy	Edges which include parts of the scene.
edge_texture	taskonomy	Edges computed from RGB only (texture edges).
egomotion	taskonomy	Odometry (camera poses) given three input images.
fixated_pose	taskonomy	Relative camera pose with matching optical centers.
inpainting	taskonomy	Filling in masked center of image.
jigsaw	taskonomy	Putting scrambled image pieces back together.
keypoints2d	taskonomy	Keypoint estimation from RGB-only (texture features).
keypoints3d	taskonomy	3D Keypoint estimation from underlying scene 3D.
nonfixated_pose	taskonomy	Relative camera pose with distinct optical centers.
normal	taskonomy	Pixel-wise surface normals.
point_matching	taskonomy	Classifying if centers of two images match or not.
reshading	taskonomy	Reshading with new lighting placed at camera location.
room_layout	taskonomy	Orientation and aspect ratio of cubic room layout.
segment_semantic	taskonomy	Pixel-wise semantic labeling (via knowledge distillation from MS COCO).
segment_unsup25d	taskonomy	Segmentation (graph cut approximation) on RGB-D-Normals-Curvature image.
segment_unsup2d	taskonomy	Segmentation (graph cut approximation) on RGB.
vanishing_point	taskonomy	Three Manhattan-world vanishing points.

We extract the responses of each layer of each network to the same stimulus set used in the fMRI, and perform the same operation as before to extract the within- and across-category similarities per layer. As an example, let’s visualize the result of this process for AlexNet trained on ImageNet:

The 18 facets in this plot are the 18 layers of AlexNet (in order from first to last). The rest of the plot elements are the same as we saw with the fMRI data above.

Notice the pattern here: in the earlier layers, invariance is totally absent: within-category similarities are as low (if not lower) than across-category similarities. But around the 4th or 5th convolutional layer, we start to see some seperation – the emergence of orientation invariance. It’s weak at first, but by the second linear layer, all rotations (except the selfsame 0 degrees of rotation) are more or less equally similiar.

We can summarise this development in a few ways. We could, for example, average the distance between the points in each curve for each layer. The greater the average (positive) distance, the more invariant the representations. Another summary we could take, though, is the similarity of these curves at any given layer to the various brain areas we analysed above. Because we have a sense of where orientation invariance emerges in the brain, knowing how similar these model curves are to the curves in different brain areas should give us a gestalt of where orientation invariance emerges in our models.

To get this summary, we aggregate together the within- and across-category similarities (per model per layer) and use a second Pearson distance metric to calculate their similarity to each brain area. Once this is done, we get plots that look something like this:

This looks messy, I know, but what it represents is how similar each model layer is to each brain area. On the x axis is the index of the model layer (what you might also call the depth of the layer). On the y axis is the Pearson distance metric we apply to the aggregated similarities. Error bars are 95% confidence intervals calculated across the 8 stimulus types. In the next few steps, we’ll clean this up a bit, but keep your eye for now on the (statistically significant) gap that emerges around layer 12 between the brain areas we know to show little to no invariance (EVC / OPC) and the brain areas we know to show strong invariance (LOC / OPC).

Let’s first zoom in on EVC and LOC, the key locations in the fMRI data:

Now, let’s see if we can smooth out some of the roughness in these trajectories, the smaller peaks and valleys of which we don’t care too much about since they typically represent various suboperations in a full block of computation. To smooth these trajectories in a principled way, we’ll use a generalized additive model (GAM), a linear model that numerically penalizes complexity and only keeps a nonlinearity if the data strongly suggests it’s real. When we apply our GAM to these trajectories, we get the following:

With some of the chaos tamed, what do these plots show us? Well, for AlexNet trained on ImageNet, at least, it shows us that earlier and intermediate layers are more similar to EVC – with no invarance – and later layers are more similar to LOC – with strong invariance. Pivotally, the switch (what we henceforth call a ‘crossover’) occurs around layer 12, just as we saw in the layer-wise plots above.

We can, of course, repeat all the steps above for every model in our repertoire…

Major Takeaways

Across the tabs below (organized by training category), you can visualize the trajectories for every model in our survey.

ImageNet

Random

Taskonomy

While we’re still analyzing the details of these trajectories, zooming out to the maximum extent possible, some larger trends become evident:

The majority of the ImageNet-trained models show a single crossover wherein the within- and across-category similarities begin to correlate more with those of LOC than those of EVC.
The majority of these crossovers occur very late, often around the first fully connected layer. This accords with work done elsewhere, showing that convolutions alone are not orientation invariant.
The majority of randomly initialized models show either no crossover, a persistent overlap (wherein the curves are equally similar to EVC and LOC) or multiple crossovers. Where crossovers do occur, they occur much earlier than in trained models. How randomly initialized weights could mimic the patterns we see in trained networks is an area we’re actively investigating.
The patterns of similarity across the taskonomy models vary widely. Semantic models (semantic segmentation, scene classification, object classification) do tend to show at least the beginning of a crossover in the final few layers. Curiously, the task that produced the overall closest correspondence with LOC (and by association, had the strongest signature of invariance) was nonfixated camera pose estimation – determining the rotation of a camera relative to a scene.

Overall, our results provide a preliminary signature of human brain-like orientation invariance in deep neural networks. Future psychophysical work could help to clarify whether the computations undergirding this signature are (at a more granular level) mechanistically and algorithmically comparable to what we see in the brain, but for now, the main takeaway is that the signature (so far as we’ve computed it) does indeed exist.

Contact & Links

For questions, please contact Colin Conwell: conwell[at]g[dot]harvard[dot]edu

For the homepage of this and other projects, visit: https://colinconwell.github.io/