People excel at understanding their physical environment. How do they compare to state-of-the-art neural networks? To test this, we developed a large dataset of physical interactions and used it to compare human and AI physical scene understanding: Physion.
How well do today's AI models understand the physical structure and dynamics of visual scenes?— Daniel Bear (@recursus) June 18, 2021
Introducing #Physion: a dataset/benchmark featuring objects that roll, slide, fall, fold, collide, connect, contain, and more!
preprint: https://t.co/RaHHJB2tQR pic.twitter.com/skbdQcdtxP
Drawing on ideas from psychology, vision research, and AI, this was a dream collaboration with @eliaszwang, Damian Mrowca, @flxbinder, Fish Tung, @PramodRT9, @choldawa, @SiruiTao, @realkevinsmith, Fan-Yun Sun, @drfeifei, @Nancy_Kanwisher, @MITCoCoSci, @dyamins, and @judyefan. pic.twitter.com/xK7bJfGLQU— Daniel Bear (@recursus) June 18, 2021
There's been a lot of great work testing AI models on particular scenarios (e.g. block towers) or in simpler environments (e.g. 2D worlds).— Daniel Bear (@recursus) June 18, 2021
But a model with true physical scene understanding should be able to make predictions across diverse, realistically complex environments.
We use our general physical knowledge all the time. I see how much food I can fit on my plate. I avoid some debris rolling in front of my car. I try to hang my jacket on the back of a chair.— Daniel Bear (@recursus) June 18, 2021
We want to build systems that interact with the physical world as effectively as we do.
The #Physion benchmark asks: how close are we to achieving that goal? What are promising avenues for making progress?— Daniel Bear (@recursus) June 18, 2021
We designed eight scenarios to showcase different physical phenomena. Each is a set of realistic simulations of how events unfold given an initial configuration. pic.twitter.com/7CBZnY2EOg
We then asked people and models to do a hard task on these stimuli: seeing the beginning of a movie, predict whether two cued objects would touch.— Daniel Bear (@recursus) June 18, 2021
People did remarkably well (~75% accurate) with very little training, suggesting that they rapidly apply general physical knowledge. pic.twitter.com/PgJP0n22wB
What about models? We tested a diverse set of architectures, input types, learning objectives, training datasets, and prediction mechanisms.— Daniel Bear (@recursus) June 18, 2021
These included ConvNets, Transformers, object-centric models, and Graph Networks optimized and tested under multiple protocols. pic.twitter.com/Xecgr0WzrZ
It turns out only one model class comes close to making human-like predictions: Graph Networks that take the *simulator's true physical state* as input, treating it as a collection of "particles" as in https://t.co/akAoHRjVp2. They're supervised to learn the particles' dynamics. pic.twitter.com/uebCKaewCE— Daniel Bear (@recursus) June 18, 2021
This accords well with earlier work (e.g. https://t.co/ZovcAAFxp8) suggesting "mental simulation" of objects and particles can account for physical judgments (though see https://t.co/wyyuSY2oJk.) Of course, particle-based models cheat in a major way: they don't take visual input!— Daniel Bear (@recursus) June 18, 2021
When we test *vision-based* models on #Physion, we find that none make human-level predictions; they range from chance to ~60% accurate overall, and make very different mistakes from the ones people make! pic.twitter.com/rp9EcXYe3V— Daniel Bear (@recursus) June 18, 2021
This suggests that vision models have a long way to go before they capture human-like intuitive physics. But our results hint at promising directions: the better vision models either learned *object-centric representations* or were pretrained on (supervised) object recognition.— Daniel Bear (@recursus) June 18, 2021
Putting it all together, we think one way forward is to build models that (A) extract physically explicit, object-centric representations of scenes from visual input (perhaps via pretraining); then (B) simulate the dynamics of physically diverse scenarios with a Graph Network.— Daniel Bear (@recursus) June 18, 2021
But more importantly, we hope #Physion illuminates and motivates progress on a critical unsolved problem in Computer Vision and AI: the prediction of key physical events in a complex world.— Daniel Bear (@recursus) June 18, 2021
And if you think we should test another model or build a new scenario, let us know! pic.twitter.com/cAgwtU2oz8