Pixels, Patterns, but no Poetry: To See the World like Humans

A Preliminary

1University of Chinese Academy of Sciences, 2Nanjing University, 3National University of Singapore,
4BUPT, 5Nankai University, 6The Pennsylvania State University, 7Peking University, 8BJTU

Abstract

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: ***Can Multimodal Large Language Models truly perceive the world as humans do?*** This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the [Turing Eye Test (TET)](https://huggingface.co/datasets/HongchengGao/TuringEyeTest), a **challenging perception-oriented benchmark** comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that **humans process intuitively**. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone—effective for previous benchmarks—fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone—a key gap between current MLLMs and human perception. ***This is a preliminary version that only contains a subset of TET tasks. We will release the full set of TET with more diverse tasks and explore methods to improve visual generalization in the next version.***

Four Diagnostic Tasks

We create four specialized datasets to evaluate the perception edge of VLMs, each designed to probe specific aspects of visual understanding that humans excel at but current models struggle with:
🔍 HiddenText

Scale-variant items where text is rendered as shapes within figures, appearing as text when reduced and resolving into complete images when magnified. 150 images

🎯 3DCaptcha

Recognition challenges constructed with curved characters in three-dimensional space, testing spatial reasoning capabilities. 150 captchas

🌈 ColorBlind

Similar to Ishihara tests, but augmented with confounding colored dots that are chromatically similar to central characters to increase difficulty. 150 test images

📖 ChineseLigatures

Intricate complex glyphs synthesized through character decomposition, morphological transformation, and fusion of multiple Chinese characters into different words or phrases.

The Great Vision Blind Spot

We conduct extensive experiments on 15 models across different architectures and types. Our comprehensive evaluation reveals a striking pattern: ***all VLMs fail consistently across all four diagnostic tasks***. This universal failure across different model architectures, scales, and training paradigms suggests fundamental limitations in current approaches to visual understanding.
Models HiddenText 3DCaptcha ColorBlind ChineseLigatures
Pass@1 Pass@32 Pass@1 Pass@32 Pass@1 Pass@32 Pass@1 Pass@32
OpenAI o1 0 0 0 0 0 1.33 0 0
Claude-4-Sonnet 0 0 0 0 0 0 0 0
Gemini 2.5 Pro 0 0 0 0 0 0 2.5 5.0
Seed-1-6-250615 0 0 0 0 0 0 2.5 2.5
Qwen2.5VL-72B 0 0 0 0 0 0 0 0
Qwen2.5VL-7B 0 0.67 0 0 0 0 0 2.5
QVQ-72B 0 0 0 0 0 0 0 0
Qwen2.5-Omni-7B 0 0 0 0 0 0 0 2.5
InternVL3-78B 0 0 0 0 0 0 0 0
MiniCPM-o-2.6 0 0 0 0 0 0 0 2.5
Show o2 0 0 0 0 0 0 0 0
Bagel 0 0 0 0 0 0 0 0
Janus-pro 0 0 0 0 0 0 0 0
kimi-vl-a3b 0 0 0 0 0 0 0 5.0
kimi-vl-a3b-thinking 0 0 0 0 0 0 0 0

Pass@1 and Pass@32 performance of 15 MLLMs on four tasks of TET. The results show universally poor performance across all models and tasks, highlighting the severity of the vision blind spot.

Attention Analysis

To explore why models cannot perceive images accurately, we conduct an analysis of attention patterns using Grad-CAM visualization. Our systematic examination of two representative models from the Qwen2.5-VL series (7B and 72B parameters) reveals **fundamental limitations in visual attention mechanisms** that explain the catastrophic failures on our perceptual tasks.
🔍 Information Flow in Visual Encoder

While ViT allocates attention across various regions of the image, this attention is often directed outside the target character regions or only captures partial segments. The image encoder struggles to effectively focus on textural features corresponding to character regions and instead prioritizes object-level features within the image. Such disparities in visual attention prevent the model from truly comprehending the image content.

🧠 Information Flow in Language Decoder

LLM decoders consistently fail to focus on the precise regions containing text or character information; instead, they scatter attention over irrelevant regions or completely ignore critical visual elements. This inconsistency between attention patterns and actual locations of important visual features indicates fundamental limitations in visual perception ability.

Attention Visualization Analysis
scroll to view more

Perception vs. Reasoning

To investigate whether domain knowledge deficiency causes model failures, we conduct SFT experiments on Qwen2.5-7B-VL using five different training configurations targeting different model components
Training Loss Comparison

Training loss curve for different settings on finetuning parameters for both our tasks and traditional tasks.

Our results reveal a fundamental distinction between perception-oriented and reasoning-oriented tasks: - ***TET tasks***: Only configurations that include vision encoder fine-tuning achieve effective learning, while language-only training plateaus quickly - ***Traditional benchmarks*** (OCRVQA, GEOQA, CLEVR): All training configurations converge efficiently to similar performance This demonstrates that our benchmark poses **perception challenges** that require visual understanding capabilities, rather than reasoning challenges that can be solved through language enhancement alone. The image domains in existing benchmarks are typically covered by pre-training data, requiring only semantic understanding improvements—a gap our benchmark is designed to address.
Paper Preview
scroll to view more

Citation

@misc{gao2025pixelspatternspoetryworld,
      title={Pixels, Patterns, but No Poetry: To See The World like Humans}, 
      author={Hongcheng Gao and Zihao Huang and Lin Xu and Jingyi Tang and Xinhao Li and Yue Liu and Haoyang Li and Taihang Hu and Minhua Lin and Xinlong Yang and Ge Wu and Balong Bi and Hongyu Chen and Wentao Zhang},
      year={2025},
      eprint={2507.16863},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.16863}
}