ICML 2026 Accept

GeoSense

Internalizing Geometric Necessity Perception for Multimodal Reasoning

Ruiheng Liu1,*, Haihong Hao1,*, Mingfei Han1,2,*, Xin Gu3, Kecheng Zhang1, Changlin Li4, Xiaojun Chang1,2,✉

1 University of Science and Technology of China · 2 Mohamed bin Zayed University of Artificial Intelligence · 3 University of Chinese Academy of Sciences · 4 Stanford University

* Equal contribution · Corresponding author

GeoSense adaptive geometric reasoning teaser
GeoSense learns when visual reasoning needs 3D geometry and when the geometry channel should remain closed.

Motivation

Geometry should be requested, not blindly injected.

Existing spatial MLLMs often fuse 3D signals into every input, even when the task can be solved from standard 2D visual cues. GeoSense separates geometry into an on-demand channel and trains the model to emit an internal trigger, <vggt>, only when geometric information is necessary for the reasoning step.

56.6 Spatial benchmark average reported for GeoSense.
55.9 Overall average across spatial and general visual tasks.
35.68% Average geometry activation rate in the adaptive ablation setting.
117K Task-aware samples used to teach geometry activation and suppression.

Method

A decoupled architecture with an internal sense decision.

Separate geometry channel

GeoSense keeps the 2D visual encoder and the 3D geometry encoder as separate feature sources, avoiding always-on element-wise fusion.

Two-stage training

Alignment training first makes projected geometry tokens interpretable, then spatial-aware SFT teaches the model when to request them.

Adaptive inference

If the model emits <vggt>, it performs a geometry-aware second pass. Otherwise, it preserves native 2D visual reasoning.

GeoSense architecture overview
Architecture overview. The model first reasons from text and 2D visual tokens, then requests 3D geometry embeddings only when the internal decision indicates insufficiency.

Results

Spatial gains without collapsing general visual reasoning.

GeoSense is evaluated across spatial reasoning benchmarks and general multimodal reasoning benchmarks. The reported comparison emphasizes both spatial improvement and robustness when geometry is unnecessary.

The adaptive design outperforms always-on geometry fusion while using geometric features for only part of the evaluation samples.

Model Spatial Avg. General Avg. Overall Avg.
Qwen2.5-VL-3B 43.4 53.3 48.3
Qwen2.5-VL-7B 50.5 57.8 54.1
VG-LLM 49.7 52.0 50.9
GeoSense 56.6 55.2 55.9

Visualizations

Qualitative behavior across geometry-needed and geometry-free cases.

GeoSense qualitative examples

Qualitative examples show that GeoSense can trigger geometry for directional and metric reasoning while keeping standard reasoning paths for cases that do not need 3D cues.

Adaptive use of 3D geometry with GeoSense

Compared with rigid geometry usage, GeoSense treats 3D features as a conditional resource selected by the model's internal sense decision.

Resources

Code, model checkpoint, and citation.

Paper

Read the current manuscript on arXiv.

Open arXiv

Code

Training, evaluation, and data-preparation scripts are available on GitHub.

Open GitHub

Model

The released GeoSense checkpoint is hosted on Hugging Face.

Open Hugging Face
@article{geosense2026,
  title={GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning},
  author={Ruiheng Liu and Haihong Hao and Mingfei Han and Xin Gu and Kecheng Zhang and Changlin Li and Xiaojun Chang},
  journal={Under review by the International Conference on Machine Learning (ICML)},
  year={2026}
}