Felix Binder

I am a cognitive scientist and AI safety researcher. With my work, I want to contribute to making sure that the AI systems that we build are robustly aligned to human interests.

Currently, I am a research scientist at Meta AI, where I work on AI Safety and Alignment for future superintelligent models.

I am also personally interested in reducing our uncertainty about the moral status of AIs. Part of my recent work investigates introspection in Large Language Models, exploring whether AI systems can acquire knowledge about themselves that goes beyond their training data—a capability that could inform questions about AI consciousness and moral consideration.

Previously, my PhD work investigated agent-environment interactions during planning. Some of the things that we do in the world (such as rearranging things, feeling how heavy something is, or looking at a problem from different angles) make it easier for us to find solutions to difficult planning problems. How can we understand this in computational terms?
My approach is best described as computational cognitive science: trying to discover the high-level algorithms of cognition using agent-based simulations, computational models, and behavioral experiments.

I completed my PhD in Cognitive Science at UC San Diego and, as a visiting scholar, at Stanford University. I worked with Judith Fan (Stanford), David Kirsh (UCSD) and Marcelo Mattar (NYU).

I also work as a VJ and visual artist—find my artistic work at vj.felixbinder.net.

Looking Inward: Language Models Can Learn About Themselves by Introspection

18 Oct 2024 • Research

Are LLMs capable of introspection, i.e. special access to their own inner states? Can they use this access to report facts about themselves that are not in the training data? Yes — in simple tasks at least! We find that LLMs are capable of introspection on simple tasks. We discuss potential implications of introspection for interpretability and the moral status of AIs. More …

Towards a Steganography Evaluation Protocol

27 Oct 2023 • Research

Large Language Models, by default, think out in the open. There is no inner memory, all information has to be output as text. Can they hide information in that text such that a human observer cannot detect it? Here, I propose a way of detecting whether models hide the results of intermediate reasoning steps to be able to answer questions more correctly. More …

Can Machines Think? CogSci Mind Challenge

12 Apr 2023

The Cognitive Science Society hosted a challenge to create a 5 minute video addressing the question “Can Machines Think?”.

More …

Physion: Evaluating Physical Prediction from Vision in Humans and Machines

12 Oct 2021 • Research

More …

Visual scoping operations for physical assembly

16 Jun 2021 • Research

Planning is all in the head, right? In a paper published in CogSci (preprint out now), Judith Fan, Marcelo Mattar, David Kirsh and I looked into how people might exploit the way objects are physically arranged to plan better: https://arxiv.org/pdf/2106.05654.pdf

More …

Hi!

Looking Inward: Language Models Can Learn About Themselves by Introspection

Towards a Steganography Evaluation Protocol

Can Machines Think? CogSci Mind Challenge

Physion: Evaluating Physical Prediction from Vision in Humans and Machines

Visual scoping operations for physical assembly