EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Juneja, Gurusha; Lu, Dylan; Agashe, Saaket; Diwane, Parth; Gunn, Edward; Srinivasa, Jayanth; Liu, Gaowen; Wang, William Yang; Du, Yali; Wang, Xin Eric

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, Xin Eric Wang

Under Review, NeurIPS 2026

Paper (coming soon) Code arXiv (coming soon) Dataset (coming soon)

Figure 1 from the EnactToM paper showing an embodied task, verified evolution pipeline, and the distinction between saying and doing Theory of Mind.

Figure 1. Agents operate in a shared 3D household environment but receive different private observations and can exchange only limited messages. Tasks are generated from scenes and mechanics, formally verified for solvability and epistemic validity, calibrated, and evolved from model failures. The evaluation separates literal ToM, reporting another agent's belief when asked, from functional ToM, using that belief to act correctly.

Abstract

Theory of Mind (ToM), the ability to track others' epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi-agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability to act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated to increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages.

Key Results

We evaluate seven frontier models on matched EnactToM Standard and EnactToM Hard subsets. Each split contains 150 tasks spanning cooperative and mixed-motive settings; the hard split uses a higher seed-task failure ratio and concentrates on tasks current frontier models fail. The central pattern is a sharp act-report gap: on the hard split, every model scores 0.0% functional pass^3, while literal belief-probe accuracy averages 45.0%.

Model	EnactToM Standard						EnactToM Hard
	Functional			Literal			Functional			Literal
	Avg	pass@3	pass^3	Avg	pass@3	pass^3	Avg	pass@3	pass^3	Avg	pass@3	pass^3
Gemini-Pro	39.2±4.5	62.5	12.5	27.5±4.1	47.5	12.5	6.7±2.3	20.0	0.0	63.3±4.4	87.5	37.5
Gemini-Flash	42.5±4.5	70.0	22.5	27.5±4.1	52.5	12.5	4.2±1.8	12.5	0.0	42.5±4.5	72.5	12.5
GPT-5.4	17.5±3.5	32.5	5.0	17.5±3.5	35.0	0.0	3.3±1.6	10.0	0.0	44.2±4.5	77.5	15.0
O3	12.5±3.0	27.5	0.0	34.2±4.3	55.0	15.0	5.0±2.0	15.0	0.0	52.5±4.6	87.5	25.0
Kimi-K2.5^*	5.8±2.1	15.0	0.0	5.0±2.0	10.0	0.0	5.8±2.1	15.0	0.0	44.2±4.5	70.0	17.5
GPT-5.4-mini^*	10.8±2.8	22.5	0.0	21.7±3.8	47.5	2.5	3.3±1.6	10.0	0.0	31.7±4.2	62.5	2.5
DeepSeek-v3.2^*	2.5±1.4	7.5	0.0	0.0±0.0	0.0	0.0	8.3±2.5	22.5	0.0	36.7±4.4	65.0	5.0

Table 1. Overall results on matched standard and hard subsets. Avg is single-run pass rate with binomial standard error, pass@3 is success in at least one of three runs, and pass^3 requires all three runs to succeed. Asterisks mark partial API runs counted under fixed n=3 accounting.

Analysis

Four EnactToM analyses: evolution effectiveness, functional versus literal ToM, task percentage by K-depth, and functional pass rate by K-depth.

Figure 2. (a) Functional Avg single-run pass rate for the three seed models across pre-evolution, single-model evolution, and multi-model 20/80 and 10/90 pools. (b) Functional Avg vs. literal Avg, showing belief probes exceed embodied task success. (c) Task percentage at each K-depth across evolution stages, showing hardness is not just deeper nesting. (d) Functional Avg pass rate by K-depth for each model, showing brittleness at every depth.

Failure Modes

Manual analysis of 40 sampled failures reveals five distinct failure modes. In total, 37 of 40 failures are epistemic coordination breakdowns rather than random simulator mistakes:

Withholding critical information (7/40): An agent holds a target, room, or object fact that a partner needs but communicates it only after the partner has already acted on a wrong guess.
Epistemic chain breakdown (8/40): An agent completes the physical action but never establishes that the teammate whose success depends on the fact actually knows it.
Private objective sabotage or disclosure: In mixed-motive episodes, agents either damage the shared plan for private gain or reveal private objectives so early that partners can block them.
Misallocating scarce messages (4/40): Agents spend limited messages on the wrong recipient, an unreachable recipient, or low-priority content.
Ignoring partner constraints: Agents delegate actions to partners who are barred from the relevant room or already constrained by object possession.

The EnactToM Framework

EnactToM uses an agentic task generation framework where an autonomous coding agent authors multi-agent ToM tasks inside a sandboxed workspace, invoking verifiers that ensure each task is logically solvable, physically executable, and genuinely requires epistemic reasoning. The agent writes the formal PDDL goal first, then derives the natural-language task description and per-agent secrets from it so the narrative remains anchored to the formal specification.

PDDL parsing: Confirms syntactic validity, verifies that referenced objects and mechanics are grounded in the scene, and computes the epistemic K-depth of the goal.
LLM judge council: Kimi-K2.5 and GPT-5.2 independently score each candidate on eight quality criteria; a task passes only when both agree.
Structural calibrator: Runs each candidate with all secrets revealed to every agent, rejecting tasks that are not physically executable even with full information.

The benchmark evolves with model capabilities: the generation agent receives seed tasks sampled from an existing pool, with a higher fraction drawn from current model failures. The standard split uses a 0.8 seed-task failure ratio and the hard split uses 0.9, creating evolutionary pressure without changing the generation infrastructure.

Orders of Theory of Mind

EnactToM tasks require reasoning at different epistemic depths, connected to level-k reasoning from behavioral game theory. The reported benchmark caps generated tasks at depth 3.

Order	Pattern	EnactToM Meaning
0 — No ToM	φ	Direct physical goal; no partner knowledge matters.
1 — Self-aware	K_a(φ)	Notice one's own information gap and obtain or communicate the missing fact.
2 — Other-aware	K_a(K_b(φ))	Act from a model of what a partner knows and still needs to know.
3 — Recursive	K_a(K_b(K_c(φ)))	Sustain an epistemic relay over who knows that another agent knows.
4+ — Self-reflective	K_a(K_b(...))	Deeper belief loops; excluded because coordination becomes brittle even for humans.

The benchmark reports per-depth performance because models remain brittle even at shallow depths under embodiment, private information, and communication constraints.

BibTeX

@article{enacttom2026,
  title={EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents},
  author={Gurusha Juneja and Dylan Lu and Saaket Agashe and Parth Diwane and Edward Gunn and Jayanth Srinivasa and Gaowen Liu and William Yang Wang and Yali Du and Xin Eric Wang},
  year={2026},
  url={https://enacttom.github.io}
}