Physion: Evaluating Physical Prediction from Vision in Humans and Machines

People excel at understanding their physical environment. How do they compare to state-of-the-art neural networks? To test this, we developed a large dataset of physical interactions and used it to compare human and AI physical scene understanding: Physion.

How well do today's AI models understand the physical structure and dynamics of visual scenes?

Introducing #Physion: a dataset/benchmark featuring objects that roll, slide, fall, fold, collide, connect, contain, and more!

preprint: https://t.co/RaHHJB2tQR pic.twitter.com/skbdQcdtxP
— Daniel Bear (@recursus) June 18, 2021

Drawing on ideas from psychology, vision research, and AI, this was a dream collaboration with @eliaszwang, Damian Mrowca, @flxbinder, Fish Tung, @PramodRT9, @choldawa, @SiruiTao, @realkevinsmith, Fan-Yun Sun, @drfeifei, @Nancy_Kanwisher, @MITCoCoSci, @dyamins, and @judyefan. pic.twitter.com/xK7bJfGLQU
— Daniel Bear (@recursus) June 18, 2021

There's been a lot of great work testing AI models on particular scenarios (e.g. block towers) or in simpler environments (e.g. 2D worlds).

But a model with true physical scene understanding should be able to make predictions across diverse, realistically complex environments.
— Daniel Bear (@recursus) June 18, 2021

We use our general physical knowledge all the time. I see how much food I can fit on my plate. I avoid some debris rolling in front of my car. I try to hang my jacket on the back of a chair.

We want to build systems that interact with the physical world as effectively as we do.
— Daniel Bear (@recursus) June 18, 2021

The #Physion benchmark asks: how close are we to achieving that goal? What are promising avenues for making progress?

We designed eight scenarios to showcase different physical phenomena. Each is a set of realistic simulations of how events unfold given an initial configuration. pic.twitter.com/7CBZnY2EOg
— Daniel Bear (@recursus) June 18, 2021

We then asked people and models to do a hard task on these stimuli: seeing the beginning of a movie, predict whether two cued objects would touch.

People did remarkably well (~75% accurate) with very little training, suggesting that they rapidly apply general physical knowledge. pic.twitter.com/PgJP0n22wB
— Daniel Bear (@recursus) June 18, 2021

What about models? We tested a diverse set of architectures, input types, learning objectives, training datasets, and prediction mechanisms.

These included ConvNets, Transformers, object-centric models, and Graph Networks optimized and tested under multiple protocols. pic.twitter.com/Xecgr0WzrZ
— Daniel Bear (@recursus) June 18, 2021

It turns out only one model class comes close to making human-like predictions: Graph Networks that take the *simulator's true physical state* as input, treating it as a collection of "particles" as in https://t.co/akAoHRjVp2. They're supervised to learn the particles' dynamics. pic.twitter.com/uebCKaewCE
— Daniel Bear (@recursus) June 18, 2021

This accords well with earlier work (e.g. https://t.co/ZovcAAFxp8) suggesting "mental simulation" of objects and particles can account for physical judgments (though see https://t.co/wyyuSY2oJk.) Of course, particle-based models cheat in a major way: they don't take visual input!
— Daniel Bear (@recursus) June 18, 2021

When we test *vision-based* models on #Physion, we find that none make human-level predictions; they range from chance to ~60% accurate overall, and make very different mistakes from the ones people make! pic.twitter.com/rp9EcXYe3V
— Daniel Bear (@recursus) June 18, 2021

This suggests that vision models have a long way to go before they capture human-like intuitive physics. But our results hint at promising directions: the better vision models either learned *object-centric representations* or were pretrained on (supervised) object recognition.
— Daniel Bear (@recursus) June 18, 2021

Putting it all together, we think one way forward is to build models that (A) extract physically explicit, object-centric representations of scenes from visual input (perhaps via pretraining); then (B) simulate the dynamics of physically diverse scenarios with a Graph Network.
— Daniel Bear (@recursus) June 18, 2021

But more importantly, we hope #Physion illuminates and motivates progress on a critical unsolved problem in Computer Vision and AI: the prediction of key physical events in a complex world.

And if you think we should test another model or build a new scenario, let us know! pic.twitter.com/cAgwtU2oz8
— Daniel Bear (@recursus) June 18, 2021

You can read more about the work, download the #Physion dataset, and learn how to create physical scenarios here: https://t.co/sCup47RjX6 pic.twitter.com/GrSxaagijN
— Daniel Bear (@recursus) June 18, 2021