Human Stability Judgments are Mimicked by Deep Nets
Never Trained to Judge Stability
This article is a more accessible summary of a larger paper, which you can find here.
Have a look at the image below: a set of blocks stacked one atop the other. Imagine this is the first frame of an animation. If we were to press play, would this tower stay upright or fall?
(For the answer, press the arrow left or right on the carousel.)
The judgment you just made is a feat of ‘intuitive physics’: the ability to reason about and take effective action as an embodied agent in a physical world.
Our intuitive physics as humans is often so effective we barely even notice it. You know the force to exert on each joint as you put one foot in front of the other, know whether or not a moving car is far enough away that you have time to cross the street without it hitting you, and have probably caught the vast majority of things thrown at you in life, even if you’re not so great a catcher.
How our brains manage feats of physical inference like these is an ongoing matter of debate. That some sort of computation occurs seems guaranteed; what exact computation is unclear.
On one side of the debate is the theory that our intuitive physics is much akin to the physics simulators employed by video games [1,2], designed to rapidly and realistically produce convincing models of physical scenarios in real time. This “intuitive physics engine” makes a number of commitments in terms of core design principles (and thus what we might expect to find when exploring intuitive physics in the brain). It relies on structure that’s largely built-in and able to decompose physical scenarios into latent variables (e.g. mass, friction) that can be manipulated like knobs on a control panel to quickly test hypotheses about how those scenarios will unfold in the future. One of the many appeals of a physics engine is its grounding in physical reality: the discipline of physics exists because physical scenarios are often highly predictable (sometimes perfectly determined) when we understand their latent states, and the realism of modern video games is a testament to the efficacy of the simulations we’ve built based on this understanding.
A fundamental issue with the intuitive physics engine as a model of human physical inference is that we don’t really have a sense yet of how that engine might be implemented in neural circuits, and even if we did, we have even less of a sense of how it gets there in the first place. Additionally, there’s a growing catalogue of human behaviors (in physical inference and elsewhere) that seems incompatible with a model so reliant on simulation and innate structure.
In this report, we explore an alternative model of intuitive physics, based on pattern recognition. In this formulation, we take physical inference to be a problem of identifying perceptual features that serve as proxies for the underlying physical states that produce them. Pattern recognition is one of the most prolifically studied computations implemented in neural hardware, and could be used by the brain to provide rapid, approximate solutions to problems that require rapid, appropriate solutions – including those commonly encountered in the physical world. Outside the brain, the quintessential neural framework for pattern recognition is that of deep convolutional neural networks.
Here, we’ll use the pattern recognition capabilities of deep neural networks to explore whether general features learned from the statistics of images (and not from explicit training on physical tasks) are sufficient to serve as the basis for physical inference. Our first goal in this will be to demonstrate sufficiency alone – that our neural networks (never trained on physics) can succeed in the first place. Our second goal will be to show that their success (and sometimes failure) in physical inference mirrors that of human agents.
Adapting a technique specified by [3], we generate a dataset of block towers with various numbers of blocks, ranging from 2 to 6 blocks, using an open-source 3D graphics program called Blender. The groundtruth for whether a tower will fall can be determined by computing at each junction of blocks the mean position (centroid) of all the blocks above the junction and comparing it to the centroid of the block beneath. If the centroid of the blocks above extends beyond the edge of the block beneath (at any junction), the tower should fall. We sample the positions of the blocks in our tower in such a way that for every number of blocks, approximately 50% of towers are stable and 50% are unstable. The closer to unstable a certain block tower is, the harder it is to judge whether or not this tower will fall. (Imagine a simple case of 2 blocks: if there’s only a small sliver of contact between the two, this configuration is clearly unstable; if, on the other hand, there’s a large area of contact, it will be harder to tell).
For each of the block towers we use in our benchmark stimulus set, we extract a number of features – singular scalar values that describe some property of the stimuli. These features range in complexity, but importantly are never calculated with any information but the position of the blocks – information that we assume is inherent to the image. Physical properties (like mass, friction, and weight distribution) are excluded entirely.
The features we compute are as follows:
feature | description | notes |
---|---|---|
configural deviation | the mean and max values for the distance of each block from the centroid of all the blocks above it | the max value of this feature is a perfect predictor of groundtruth in that any value beyond a given threshold (half the width of a block) means the tower will fall. any value below that threshold means the tower will remain upright. |
local (pairwise) deviation | the mean and max values for the distance of each block in the tower from the block above it, irrespective of other blocks | |
global deviation | the mean and max values for the distance of each block from the overall centroid of the tower (the centroid of all the blocks considered together) | |
number of instabilities | the number of junctions in the tower shown by groundtruth calculations to be unstable | this feature is also a perfect predictor of groundtruth, in that any value of 0 means the tower is stable and any value above 0 means the tower is unstable |
percent unstable | the number of unstable junctions in the tower as a proportion of the total number of junctions | like number of instabilities (and for the same reasons) this feature is another perfect predictor of groundtruth |
horizontal extent | the horizontal distance from the right edge of the rightmost block in the tower to the left edge of the leftmost block in the tower: the tower’s width | |
vertical extent | the vertical distance from the bottom edge of the bottommost block to the upper edge of the uppermost block: the tower’s height | |
alignment distance | the numerically determined minimum distance each block must be moved to return the tower to a perfectly stable configuration, wherein each block is perfectly aligned with the others | |
minimum distance to stability | the minimum each block must be moved to return the tower to a minimally stable configuration, wherein there are no unstable junctions | because a value of 0 means the tower is already stable, and any value above 0 means the tower is unstable, this feature is a yet another perfect predictor of groundtruth. |
As values in a dataframe, these features look something these randomly sampled rows from our full feature set.
stack_size | image_index | number_of_instabilities | percent_unstable | mean_offset | max_offset | mean_pairwise | max_pairwise | mean_configural | max_configural | alignment_distance | minimum_to_stable |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 44 | 1 | 0.5000000 | 0.4690023 | 0.4690023 | 0.9380046 | 0.9380046 | 0.9380046 | 0.9380046 | 0.4690023 | 0.2190023 |
6 | 876 | 3 | 0.5000000 | 0.3057462 | 0.7611653 | 0.2554792 | 0.6050919 | 0.3649692 | 0.7099261 | 0.2752547 | 0.0581093 |
6 | 829 | 1 | 0.1666667 | 0.2185032 | 0.5571119 | 0.2635497 | 0.4659879 | 0.2988073 | 0.6685342 | 0.2185032 | 0.0280890 |
5 | 763 | 0 | 0.0000000 | 0.2379902 | 0.3460107 | 0.1710497 | 0.3609271 | 0.2684505 | 0.4472900 | 0.2155978 | 0.0000000 |
5 | 698 | 1 | 0.2000000 | 0.2200157 | 0.4932158 | 0.4230026 | 0.7528444 | 0.3270549 | 0.5710783 | 0.2116372 | 0.0142157 |
2 | 29 | 1 | 0.5000000 | 0.3920294 | 0.3920294 | 0.7840588 | 0.7840588 | 0.7840588 | 0.7840588 | 0.3920294 | 0.1420294 |
5 | 789 | 0 | 0.0000000 | 0.1965321 | 0.3806398 | 0.2939226 | 0.4942376 | 0.2825355 | 0.4757997 | 0.1934062 | 0.0000000 |
6 | 1000 | 0 | 0.0000000 | 0.2057760 | 0.4377859 | 0.2803946 | 0.8468966 | 0.2521971 | 0.4909329 | 0.1952770 | 0.0000000 |
5 | 648 | 2 | 0.4000000 | 0.3343302 | 0.5240816 | 0.2528125 | 0.3329855 | 0.4462971 | 0.5989432 | 0.3300818 | 0.0267393 |
6 | 889 | 1 | 0.1666667 | 0.2882641 | 0.5280648 | 0.3122998 | 0.5866901 | 0.3212427 | 0.6823588 | 0.2787162 | 0.0303931 |
5 | 793 | 0 | 0.0000000 | 0.1473294 | 0.2601249 | 0.2166011 | 0.4310448 | 0.2042263 | 0.4310448 | 0.1463445 | 0.0000000 |
2 | 122 | 0 | 0.0000000 | 0.1412559 | 0.1412559 | 0.2825117 | 0.2825117 | 0.2825117 | 0.2825117 | 0.1412559 | 0.0000000 |
2 | 94 | 1 | 0.5000000 | 0.3660010 | 0.3660010 | 0.7320020 | 0.7320020 | 0.7320020 | 0.7320020 | 0.3660010 | 0.1160010 |
6 | 884 | 2 | 0.3333333 | 0.3350267 | 0.5509477 | 0.2803022 | 0.4821332 | 0.3663415 | 0.7677735 | 0.3256932 | 0.0497190 |
4 | 494 | 1 | 0.2500000 | 0.1791887 | 0.3583773 | 0.3361123 | 0.5747764 | 0.2917062 | 0.5747764 | 0.1457911 | 0.0186941 |
4 | 442 | 2 | 0.5000000 | 0.6924933 | 1.0891963 | 0.6507152 | 0.8178275 | 0.8186449 | 1.1267395 | 0.6924933 | 0.2787557 |
6 | 965 | 0 | 0.0000000 | 0.1982528 | 0.3857316 | 0.1241329 | 0.2327803 | 0.2787720 | 0.3617466 | 0.1982528 | 0.0000000 |
3 | 345 | 0 | 0.0000000 | 0.1446989 | 0.2170484 | 0.1845133 | 0.2821184 | 0.2550429 | 0.2821184 | 0.1230089 | 0.0000000 |
2 | 138 | 0 | 0.0000000 | 0.1870227 | 0.1870227 | 0.3740455 | 0.3740455 | 0.3740455 | 0.3740455 | 0.1870227 | 0.0000000 |
5 | 749 | 0 | 0.0000000 | 0.0981315 | 0.1619850 | 0.1669901 | 0.2518584 | 0.1407651 | 0.1621288 | 0.0827974 | 0.0000000 |
5 | 681 | 1 | 0.2000000 | 0.3705760 | 0.5070146 | 0.2678504 | 0.6724794 | 0.3940879 | 0.8057557 | 0.3354686 | 0.0611511 |
4 | 443 | 2 | 0.5000000 | 0.4194836 | 0.8322247 | 0.5650873 | 0.8428094 | 0.6885742 | 0.9902404 | 0.4194836 | 0.2082624 |
5 | 627 | 4 | 0.8000000 | 0.4934614 | 1.0956999 | 0.4208380 | 0.9577462 | 0.6297500 | 0.9577462 | 0.4291699 | 0.1566704 |
5 | 781 | 0 | 0.0000000 | 0.0677768 | 0.0964896 | 0.0885573 | 0.1849234 | 0.0807456 | 0.1731548 | 0.0661986 | 0.0000000 |
2 | 132 | 0 | 0.0000000 | 0.0464227 | 0.0464227 | 0.0928454 | 0.0928454 | 0.0928454 | 0.0928454 | 0.0464227 | 0.0000000 |
3 | 260 | 1 | 0.3333333 | 0.2753119 | 0.4129679 | 0.4156738 | 0.6900838 | 0.3994100 | 0.6900838 | 0.2300279 | 0.0633613 |
4 | 503 | 0 | 0.0000000 | 0.0999262 | 0.1873735 | 0.1968036 | 0.3656003 | 0.1525298 | 0.2498314 | 0.0999262 | 0.0000000 |
3 | 397 | 0 | 0.0000000 | 0.1283207 | 0.1924810 | 0.1654278 | 0.2465873 | 0.1864949 | 0.2887215 | 0.1102852 | 0.0000000 |
6 | 983 | 0 | 0.0000000 | 0.1498617 | 0.3186350 | 0.1785938 | 0.3505711 | 0.2095452 | 0.3173374 | 0.1498617 | 0.0000000 |
6 | 887 | 1 | 0.1666667 | 0.1862643 | 0.4190824 | 0.3499487 | 0.6397563 | 0.2389786 | 0.5866343 | 0.1862643 | 0.0144390 |
4 | 515 | 0 | 0.0000000 | 0.0939787 | 0.1344122 | 0.1522606 | 0.1973766 | 0.1272851 | 0.1973766 | 0.0939787 | 0.0000000 |
5 | 622 | 1 | 0.2000000 | 0.2726537 | 0.6729629 | 0.2592977 | 0.6642915 | 0.2921382 | 0.8412036 | 0.2363319 | 0.0682407 |
2 | 41 | 1 | 0.5000000 | 0.4378049 | 0.4378049 | 0.8756099 | 0.8756099 | 0.8756099 | 0.8756099 | 0.4378049 | 0.1878049 |
5 | 701 | 0 | 0.0000000 | 0.1957247 | 0.4232981 | 0.2851936 | 0.3814421 | 0.3129102 | 0.4124256 | 0.1908932 | 0.0000000 |
2 | 35 | 1 | 0.5000000 | 0.3277445 | 0.3277445 | 0.6554891 | 0.6554891 | 0.6554891 | 0.6554891 | 0.3277445 | 0.0777445 |
4 | 548 | 0 | 0.0000000 | 0.1893296 | 0.2442359 | 0.2787333 | 0.3941250 | 0.2638523 | 0.3631936 | 0.1893296 | 0.0000000 |
3 | 315 | 0 | 0.0000000 | 0.0721386 | 0.1082078 | 0.1064718 | 0.1791891 | 0.0980332 | 0.1623117 | 0.0597297 | 0.0000000 |
6 | 870 | 2 | 0.3333333 | 0.2737882 | 0.5004552 | 0.3374449 | 0.6024884 | 0.3190397 | 0.6034715 | 0.2737882 | 0.0254555 |
2 | 163 | 0 | 0.0000000 | 0.0813453 | 0.0813453 | 0.1626905 | 0.1626905 | 0.1626905 | 0.1626905 | 0.0813453 | 0.0000000 |
3 | 207 | 1 | 0.3333333 | 0.2938381 | 0.4407572 | 0.6611358 | 0.7386559 | 0.5361358 | 0.5836158 | 0.2462186 | 0.0278719 |
2 | 72 | 1 | 0.5000000 | 0.3820717 | 0.3820717 | 0.7641433 | 0.7641433 | 0.7641433 | 0.7641433 | 0.3820717 | 0.1320717 |
3 | 213 | 1 | 0.3333333 | 0.2298307 | 0.3447461 | 0.3418996 | 0.3504390 | 0.4295094 | 0.5085798 | 0.2279331 | 0.0028599 |
5 | 706 | 0 | 0.0000000 | 0.1020953 | 0.2300429 | 0.1219225 | 0.2541348 | 0.1399699 | 0.2875537 | 0.0972770 | 0.0000000 |
6 | 890 | 1 | 0.1666667 | 0.3676904 | 0.5803013 | 0.2603350 | 0.5859337 | 0.3516431 | 0.8516295 | 0.3676904 | 0.0586049 |
3 | 392 | 0 | 0.0000000 | 0.2451482 | 0.3677223 | 0.3430386 | 0.4170897 | 0.4473110 | 0.4775324 | 0.2286924 | 0.0000000 |
5 | 644 | 1 | 0.2000000 | 0.1981296 | 0.4056977 | 0.3266799 | 0.3867294 | 0.2813021 | 0.5372720 | 0.1943359 | 0.0074544 |
5 | 696 | 1 | 0.2000000 | 0.2095509 | 0.5148292 | 0.2846444 | 0.5565383 | 0.3527962 | 0.5565383 | 0.2012090 | 0.0113077 |
2 | 12 | 1 | 0.5000000 | 0.3338699 | 0.3338699 | 0.6677399 | 0.6677399 | 0.6677399 | 0.6677399 | 0.3338699 | 0.0838699 |
3 | 242 | 1 | 0.3333333 | 0.3811761 | 0.5717642 | 0.4577897 | 0.8769573 | 0.4481342 | 0.8576463 | 0.2923191 | 0.1192154 |
4 | 440 | 1 | 0.2500000 | 0.3920738 | 0.4510911 | 0.2918747 | 0.6926710 | 0.4138055 | 0.7251301 | 0.3920738 | 0.0562825 |
3 | 233 | 1 | 0.3333333 | 0.4722274 | 0.7083411 | 0.6172862 | 0.8904509 | 0.7033166 | 1.0625117 | 0.4115242 | 0.1875039 |
6 | 960 | 0 | 0.0000000 | 0.1377617 | 0.3376095 | 0.2679944 | 0.4446810 | 0.2202040 | 0.3952441 | 0.1377617 | 0.0000000 |
2 | 171 | 0 | 0.0000000 | 0.1492993 | 0.1492993 | 0.2985987 | 0.2985987 | 0.2985987 | 0.2985987 | 0.1492993 | 0.0000000 |
2 | 157 | 0 | 0.0000000 | 0.1599455 | 0.1599455 | 0.3198910 | 0.3198910 | 0.3198910 | 0.3198910 | 0.1599455 | 0.0000000 |
3 | 227 | 2 | 0.6666667 | 0.5690855 | 0.8536283 | 0.8467229 | 0.8674390 | 0.9717229 | 1.0760069 | 0.5644820 | 0.3144820 |
3 | 337 | 0 | 0.0000000 | 0.0615625 | 0.0923437 | 0.1338096 | 0.1815501 | 0.0931280 | 0.1815501 | 0.0605167 | 0.0000000 |
5 | 619 | 2 | 0.4000000 | 0.3303974 | 0.6302148 | 0.4032878 | 0.6656197 | 0.5174669 | 0.6814119 | 0.3233164 | 0.0374060 |
4 | 474 | 1 | 0.2500000 | 0.2000714 | 0.3804100 | 0.2082132 | 0.3606771 | 0.2717781 | 0.5072133 | 0.2000714 | 0.0018033 |
5 | 779 | 0 | 0.0000000 | 0.1571537 | 0.2979691 | 0.1928644 | 0.2293644 | 0.2318699 | 0.3724614 | 0.1482398 | 0.0000000 |
3 | 286 | 1 | 0.3333333 | 0.2805922 | 0.4208883 | 0.3507239 | 0.5612171 | 0.3857815 | 0.6313324 | 0.2338159 | 0.0437775 |
4 | 552 | 0 | 0.0000000 | 0.1645941 | 0.2285887 | 0.2970948 | 0.3608970 | 0.2252877 | 0.3608970 | 0.1645941 | 0.0000000 |
3 | 359 | 0 | 0.0000000 | 0.1838418 | 0.2757627 | 0.2623305 | 0.3026271 | 0.3379873 | 0.3733475 | 0.1748870 | 0.0000000 |
5 | 728 | 0 | 0.0000000 | 0.2313988 | 0.3738546 | 0.2952939 | 0.5188853 | 0.3374334 | 0.4497608 | 0.2166673 | 0.0000000 |
6 | 990 | 0 | 0.0000000 | 0.1048971 | 0.2170217 | 0.2080634 | 0.3013339 | 0.1378396 | 0.2629201 | 0.1048971 | 0.0000000 |
6 | 815 | 1 | 0.1666667 | 0.1915293 | 0.3432148 | 0.3149511 | 0.6351984 | 0.2534309 | 0.5001536 | 0.1915293 | 0.0000256 |
5 | 625 | 3 | 0.6000000 | 0.2599565 | 0.5702270 | 0.3111018 | 0.4905627 | 0.4018095 | 0.5244147 | 0.2432773 | 0.0179874 |
6 | 886 | 3 | 0.5000000 | 0.5268158 | 0.9325646 | 0.3680348 | 0.6796238 | 0.5144943 | 1.0129572 | 0.5268158 | 0.1215361 |
3 | 273 | 2 | 0.6666667 | 0.4155513 | 0.6233269 | 0.6106685 | 0.6486438 | 0.7356685 | 0.8226933 | 0.4071124 | 0.1571124 |
2 | 153 | 0 | 0.0000000 | 0.2392161 | 0.2392161 | 0.4784322 | 0.4784322 | 0.4784322 | 0.4784322 | 0.2392161 | 0.0000000 |
6 | 838 | 1 | 0.1666667 | 0.3058684 | 0.4261666 | 0.1677717 | 0.3833879 | 0.3515892 | 0.5253065 | 0.3058684 | 0.0042177 |
3 | 275 | 1 | 0.3333333 | 0.2778046 | 0.4167070 | 0.3505385 | 0.5490440 | 0.3885467 | 0.6250604 | 0.2336923 | 0.0416868 |
3 | 254 | 1 | 0.3333333 | 0.2293951 | 0.3440927 | 0.3194088 | 0.3934604 | 0.3807482 | 0.5161391 | 0.2129392 | 0.0053797 |
2 | 198 | 0 | 0.0000000 | 0.0777017 | 0.0777017 | 0.1554034 | 0.1554034 | 0.1554034 | 0.1554034 | 0.0777017 | 0.0000000 |
4 | 590 | 0 | 0.0000000 | 0.1102278 | 0.1713906 | 0.1877456 | 0.2545567 | 0.1709152 | 0.2155028 | 0.1102278 | 0.0000000 |
4 | 414 | 1 | 0.2500000 | 0.1997590 | 0.3198862 | 0.2911842 | 0.5569034 | 0.2858436 | 0.5569034 | 0.1997590 | 0.0142258 |
4 | 580 | 0 | 0.0000000 | 0.0765990 | 0.1531979 | 0.1411293 | 0.2144571 | 0.0927763 | 0.2081938 | 0.0549958 | 0.0000000 |
3 | 205 | 1 | 0.3333333 | 0.2345428 | 0.3518142 | 0.3179830 | 0.5638029 | 0.3708199 | 0.5638029 | 0.1879343 | 0.0212676 |
5 | 771 | 0 | 0.0000000 | 0.2034445 | 0.2769005 | 0.2262782 | 0.4563009 | 0.2849505 | 0.3925716 | 0.1712723 | 0.0000000 |
2 | 74 | 1 | 0.5000000 | 0.3389592 | 0.3389592 | 0.6779184 | 0.6779184 | 0.6779184 | 0.6779184 | 0.3389592 | 0.0889592 |
3 | 319 | 0 | 0.0000000 | 0.1168295 | 0.1752442 | 0.1515996 | 0.2225335 | 0.1717661 | 0.2628664 | 0.1010664 | 0.0000000 |
5 | 709 | 0 | 0.0000000 | 0.1613303 | 0.2345831 | 0.2670159 | 0.3919659 | 0.2191630 | 0.3919659 | 0.1394189 | 0.0000000 |
5 | 765 | 0 | 0.0000000 | 0.1595025 | 0.3212647 | 0.2151102 | 0.2926510 | 0.2568042 | 0.3334866 | 0.1537798 | 0.0000000 |
2 | 103 | 0 | 0.0000000 | 0.0392579 | 0.0392579 | 0.0785159 | 0.0785159 | 0.0785159 | 0.0785159 | 0.0392579 | 0.0000000 |
5 | 711 | 0 | 0.0000000 | 0.1623711 | 0.3235006 | 0.2296314 | 0.3738340 | 0.2385571 | 0.3686399 | 0.1533433 | 0.0000000 |
6 | 888 | 1 | 0.1666667 | 0.2462385 | 0.3815555 | 0.2221110 | 0.4751615 | 0.2930344 | 0.5119898 | 0.2462385 | 0.0019983 |
4 | 562 | 0 | 0.0000000 | 0.1351732 | 0.2703463 | 0.1258685 | 0.3527720 | 0.1320807 | 0.3604618 | 0.0942544 | 0.0000000 |
2 | 56 | 1 | 0.5000000 | 0.4701536 | 0.4701536 | 0.9403072 | 0.9403072 | 0.9403072 | 0.9403072 | 0.4701536 | 0.2201536 |
4 | 521 | 0 | 0.0000000 | 0.1176232 | 0.1976810 | 0.1825715 | 0.3539149 | 0.1679127 | 0.3539149 | 0.1176232 | 0.0000000 |
5 | 683 | 3 | 0.6000000 | 0.3828250 | 0.9319919 | 0.5029328 | 0.9069213 | 0.5469022 | 0.9069213 | 0.3506987 | 0.1320533 |
4 | 441 | 3 | 0.7500000 | 0.4205143 | 0.8410286 | 0.4161932 | 0.8885117 | 0.6106376 | 0.8885117 | 0.3967728 | 0.1206031 |
3 | 215 | 1 | 0.3333333 | 0.2303072 | 0.3454607 | 0.3724666 | 0.5937718 | 0.3463052 | 0.5937718 | 0.1979239 | 0.0312573 |
4 | 559 | 0 | 0.0000000 | 0.2016471 | 0.3325543 | 0.2108995 | 0.3052194 | 0.3076757 | 0.4361266 | 0.2016471 | 0.0000000 |
5 | 736 | 0 | 0.0000000 | 0.1755084 | 0.3662160 | 0.2690448 | 0.5495437 | 0.2709191 | 0.4027131 | 0.1617886 | 0.0000000 |
6 | 962 | 0 | 0.0000000 | 0.2240777 | 0.3501600 | 0.3372261 | 0.5740782 | 0.2822087 | 0.4529200 | 0.2240777 | 0.0000000 |
4 | 537 | 0 | 0.0000000 | 0.2194555 | 0.3078602 | 0.3515438 | 0.4641372 | 0.2987473 | 0.4641372 | 0.2194555 | 0.0000000 |
6 | 980 | 0 | 0.0000000 | 0.1653668 | 0.3158146 | 0.2052644 | 0.3459786 | 0.2071336 | 0.4331345 | 0.1653668 | 0.0000000 |
2 | 178 | 0 | 0.0000000 | 0.0661733 | 0.0661733 | 0.1323466 | 0.1323466 | 0.1323466 | 0.1323466 | 0.0661733 | 0.0000000 |
6 | 937 | 0 | 0.0000000 | 0.2175637 | 0.4789167 | 0.2939072 | 0.6136261 | 0.3166383 | 0.4497411 | 0.1790893 | 0.0000000 |
3 | 327 | 0 | 0.0000000 | 0.0881284 | 0.1321925 | 0.1233966 | 0.2144569 | 0.1446746 | 0.2144569 | 0.0714856 | 0.0000000 |
5 | 758 | 0 | 0.0000000 | 0.1195030 | 0.2001376 | 0.2362445 | 0.3306889 | 0.1708749 | 0.2501720 | 0.1097619 | 0.0000000 |
Our first task here was to test a pool of human subjects to see how well they fared in judging the stability of the block towers. 81 subjects on Mechanical Turk were shown a series of towers (of one tower size) and asked to decide whether or not the tower shown was stable or unstable. Here’s a pirate plot showing their performance.
On the x axis is the number of blocks in the tower; on the y axis is the classification accuracy (% of trials in which subject correctly chose stable or unstable). The translucent white rectangles are the 95% confidence intervals across subjects.
Overall, this plot shows the human subjects do pretty well, but notice the slight (albeit statistically significant) decrease in performance as the number of blocks in the tower increases.
Beyond raw accuracy, there remains the question of how (computationally) humans are able to make the judgments they do. This is where our feature analysis comes in.
To get a sense of which features seem to be guiding human judgments, we run a series of variable importance analyses, in each case attempting to predict human judgments with our features and seeing, in effect, which feature is the ‘most important’ in terms of predicting human behavior. This results in 8 different metrics of feature importance, the descriptions of which can be found in the table below.
metric | description |
---|---|
ridge coefficients | the largest coefficient from a ridge regression following a cross-validated regularization procedure |
lasso coefficients | the largest coefficient from a lasso regression following a cross-validated regularization procedure |
PLS coefficients | the largest coefficient from a partial least squares (pls) regression following a cross-validated regularization procedure |
PLS projection influence | the largest influence on the dimensionality reduced projection in the partial least squares regression (effectively equivalent to the largest loading on the first component in a principal components analysis) |
random forest mean decrease in accuracy | the decrease in accuracy suffered by the classifier when removing the variable of interest |
random forest mean decrease in Gini | the decrease in the Gini coefficient (a measure of how much the variable of interest streamlines the final decision tree) |
ROC area under the curve (AUC) | the area under the curve of a logistic regression using only the variable of interest |
ROC information criterion (AIC) | the information criterion of a logistic regression using only the variable interest (smaller is better) |
More important than any single metric, though, is the overall gestalt these feature importance analyses provide us – a sense of the overall most important feature for predicting human behavior. That gestalt looks a bit like this:
On the x axis of this plot is the score for a given variable importance metric, on the y axis are the features. Across the facets are each of the 8 feature importance metrics. Apart from the AIC metric (for which lower scores are better), increasing score means increasing ability to predict human judgments. As you can see, the most predictive feature in 7 out of these 8 metrics is the ‘max configural deviation’ – the maximum of the distances of each block from the centroid of each of the blocks below it. Keep this feature in mind – as we’ll later use it to more granularly compare the performances of our human participants to those of the machines.
For our machine subjects, we use two different models, each with the same architecture (Resnet18), but two slightly different training regimens. The first, which we call Resnet18-Imagenet, is a standard Resnet18 trained to label the 1000 image classes in the ImageNet dataset. The second, which we call Resnet18-Autoencoder, is Resnet18 repurposed as the encoder in an autoencoder with a latent space of 128 dimensions, whose sole purpose is efficiently encoding in a lower dimensional space a set of block tower images, then reconstructing the images back to their original dimensions on the other end. (We include the autoencoder as a test of unsupervised learning’s ability to create feature spaces equally sufficient for intuitive physics as those created by supervised learning).
What’s worth reiterating here is that neither of these models are trained on the task of judging stability, per se. A number of models in recent years have been built specifically for the purpose of physical reasoning [3–5], and trained end to end in a fully supervised fashion on whatever physical reasoning task they’re designed to tackle. Neither of our models in this case ever have their weights updated by learning to classify ‘stable’ versus ‘unstable’, nor by explicitly learning the latent physical parameters relevant to the task (the mass of the blocks, for example). Even the Resnet18-Autoencoder, which is trained only to reconstruct images of the block towers, is never given additional information outside the pixel values of the input image.
So how without training do we evaluate how well the feature spaces of these deep nets can be used to predict stability? In this case, we affix what’s called a pooling linear classifier to the final layer of each network, performing what is, in effect, a logistic regression (with the outputs of 0 and 1 in this case assigned to “stable” or “unstable”) on the model’s outputs. “Linear separability” (the ability to draw a line behind categories) is a hallmark characteristic of feature spaces that can be said to have “disentangled” certain properties of the input – in this case, the stability of the block towers – and it’s precisely what we’re testing with our linear classifier.
If, as we’ve hypothesized, the pattern recognition performed by deep neural networks is sufficient to serve as the basis for intuitive physical inference, our models should show a ‘linear seperability’ of ‘stable’ versus ‘unstable’ that is comparable, if not superior to, the kind evident in human judgments.
To ensure our linear classifiers are consistent, and not subject to some statistical fluke, we fit them 5 times for each model. The performance of those classifiers, juxtaposed to the human data, can be seen in the plot below:
On the x axis and y axis again is the number of blocks and the accuracy, respectively. Confidence intervals (represented by the translucent white rectangle) are the bootstrapped 95% confidence intervals across the subjects (in the case of the human data) or the number of classifier fits (in the case of the machine data).
Though the level of classification accuracy can likely be pushed slightly upwards or downwards by the number of training iterations used in the fitting of the classifier, this first pass suggests the overall trend in performance is comparable across human and machine, including with respect to the dropoff in accuracy for greater numbers of blocks.
Just as with human behavior, our machine behavior is only superficially described by overall performance. To more veridically test the hypothesis that pattern recognition captures some aspect of human behavior, we need to see if our pattern recognizing neural networks make the same kinds of choices humans do. The first step in this is our feature importance analysis – seeing which of the features computed for each of stimuli best predict machine performance.
As with humans, the ‘max configural deviation’ feature seems dominant, ranking highest in 6 out of 8 metrics for both Resnet18-ImageNet and Resnet18-Autoencoder.
The second step, in the true spirit of psychophysics, is to compare the pattern of success and failure across our two subject pools.
Given the overlap in its importance across human and machine, we can use the maximum configural deviation as a sort of psychophysical proxy, seeing how the judgments of our (human and machine) subjects change as we manipulate it higher or lower:
On the x axis in this plot is the max configural deviation, ranging from 0 (perfectly stable) to 1 (perfectly unstable). Because this feature is an optimal proxy of groundtruth, we can actually demarcate the divide of ‘stable’ or ‘unstable’ directly on the axis (at 0.5 – the dashed line). On the y axis is one of two things (depending on whether the subjects are humans or machines): either the proportion of human subjects choosing ‘stable’ for a given stimulus, or the average confidence (approximated by output probabilities from the classifier)1 of the machine in classifying ‘stable’ for a given stimulus. The points are individual stimuli, and the lines are psychophysical curves from a mixed effects logistic regression in which we attempt to predict whether a human or machine’s choice for or a given stimulus with that stimulus’ max configural deviation. The closer to 0.5 the maximum configural deviation is, the more difficult the judgment should be.
What this plot shows (in conjunction with the feature importance analysis) is a remarkable overlap in the psychophysical profiles of humans and machines – almost to the extent that they are indistinguishable.
Taken together, the results of this study suggest pattern recognition may indeed be a mechanism for human physical inference in a paradigmatic intuitive physics task. The features learned by two networks never trained on physics provide sufficient linear separability to classify a stable versus unstable configuration of block towers, and the choices made by our machines closely mirror those of our human subjects. While these results do not exclude that possibility that other mechanisms are at play, it does suggest that the kind of pattern recognition done by deep neural networks is plenty sufficient for making moderately complex judgments about the physical world.
For questions, please contact Colin Conwell: conwell[at]g[dot]harvard[dot]edu
For the homepage of this and other projects, visit: https://colinconwell.github.io/
1. Battaglia PW, Hamrick JB, Tenenbaum JB. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences [Internet]. 2013;110(45):18327–32. Available from: https://www.pnas.org/content/110/45/18327
2. Ullman TD, Spelke E, Battaglia P, Tenenbaum JB. Mind games: Game engines as an architecture for intuitive physics. Trends in Cognitive Sciences. 2017;21(9):649–65.
3. Zhang R, Wu J, Zhang C, Freeman WT, Tenenbaum JB. A comparative evaluation of approximate probabilistic simulation and deep neural networks as accounts of human physical scene understanding. arXiv preprint arXiv:160501138. 2016;
4. Conwell C, Alvarez G. Modeling the intuitive physics of stability judgments using deep hierarchical convolutional neural networks. In: 2018 conference on cognitive computational neuroscience [Internet]. Cognitive Computational Neuroscience; 2018. Available from: https://doi.org/10.32470/ccn.2018.1206-0
5. Lerer A, Gross S, Fergus R. Learning physical intuition of block towers by example. arXiv preprint arXiv:160301312. 2016;
It’s worth noting there’s a significant amount of debate about the interpretation of these values; more recent work on the difference between epistemic and aleatory uncertainty in machines suggests the interpretation we use here may a bit outdated↩︎
For attribution, please cite this work as
Conwell, et al. (2019, Sept. 26). BlockBuster. Retrieved from https://colinconwell.github.io/pages/blockbuster.html
BibTeX citation
@misc{conwell2019blockbuster, author = {Conwell, Colin and Doshi, Fenil and Alvarez, George}, title = {BlockBuster}, url = {https://colinconwell.github.io/pages/blockbuster.html}, year = {2019} }