Pixels, Patterns, but no Poetry: To See the World like Humans

A Preliminary

Hongcheng Gao^*1, Zihao Huang^*1, Lin Xu^*1, Jingyi Tang^*1, Xinhao Li²,
Yue Liu³, Haoyang Li⁴, Taihang Hu⁵, Minhua Lin⁶, Xinlong Yang⁷,
Ge Wu⁵, Balong Bi¹, Hongyu Chen⁸, Wentao Zhang^†7

¹University of Chinese Academy of Sciences, ²Nanjing University, ³National University of Singapore,
⁴BUPT, ⁵Nankai University, ⁶The Pennsylvania State University, ⁷Peking University, ⁸BJTU

Paper Code Dataset

Abstract

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: ***Can Multimodal Large Language Models truly perceive the world as humans do?*** This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the [Turing Eye Test (TET)](https://huggingface.co/datasets/HongchengGao/TuringEyeTest), a **challenging perception-oriented benchmark** comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that **humans process intuitively**. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone—effective for previous benchmarks—fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone—a key gap between current MLLMs and human perception. ***This is a preliminary version that only contains a subset of TET tasks. We will release the full set of TET with more diverse tasks and explore methods to improve visual generalization in the next version.***

Four Diagnostic Tasks

We create four specialized datasets to evaluate the perception edge of VLMs, each designed to probe specific aspects of visual understanding that humans excel at but current models struggle with:

🔍 HiddenText

Scale-variant items where text is rendered as shapes within figures, appearing as text when reduced and resolving into complete images when magnified. 150 images

🎯 3DCaptcha

Recognition challenges constructed with curved characters in three-dimensional space, testing spatial reasoning capabilities. 150 captchas

🌈 ColorBlind

Similar to Ishihara tests, but augmented with confounding colored dots that are chromatically similar to central characters to increase difficulty. 150 test images

📖 ChineseLigatures

Intricate complex glyphs synthesized through character decomposition, morphological transformation, and fusion of multiple Chinese characters into different words or phrases.

HiddenText
ColorBlind
3DCaptcha
ChineseLigatures

Scale-variant items where text is rendered as shapes within figures, appearing as text when reduced and resolving into complete images when magnified.

World

China

France

Music

Cat

Water

Sun

Child

Coffee

Snow

Phone

Money

Train

Bridge

Door

Window

Boston

Key

Love

Tooth

Rice

Card

Magic

Berlin

Beach

Clean

Google

Juice

Sound

Happy

Fresh

Mexico

Peace

Sydney

Smoke

Pillow

The Great Vision Blind Spot

We conduct extensive experiments on 15 models across different architectures and types. Our comprehensive evaluation reveals a striking pattern: ***all VLMs fail consistently across all four diagnostic tasks***. This universal failure across different model architectures, scales, and training paradigms suggests fundamental limitations in current approaches to visual understanding.

Models	Param	Date	HiddenText		3DCaptcha		ColorBlind		ChineseLigatures
Models	Param	Date	Pass@1	Pass@32	Pass@1	Pass@32	Pass@1	Pass@32	Pass@1	Pass@32
GPT-5	-	2025-08	0	0.67	0	0	3.33	16	2.5	2.5
GLM-4.5V	106B(12B act.)	2025-08	0	0	0	0	0	0	0	0
Grok-4	-	2025-07	0	1.33	0	0	0	0	0	2.5
Seed-1-6	230B (23B act.)	2025-06	0	0	0	0	0	0	2.5	2.5
Show o2	7B	2025-06	0	0	0	0	0	0	0	0
Claude-4-Sonnet	-	2025-05	0	0	0	0	0	0	0	0
Gemini 2.5 Pro	-	2025-05	0	0	0	0	0	0	2.5	5.0
Bagel	14B (7B act.)	2025-05	0	0	0	0	0	0	0	0
kimi-vl-a3b	16B (3B act.)	2025-04	0	0	0	0	0	0	0	5.0
kimi-vl-a3b-thinking	16B (3B act.)	2025-04	0	0	0	0	0	0	0	0
InternVL3-78B	78B	2025-04	0	0	0	0	0	0	0	0
Qwen2.5-Omni-7B	7B	2025-03	0	0	0	0	0	0	0	2.5
MiniCPM-o-2.6	8B	2025-01	0	0	0	0	0	0	0	2.5
Janus-pro	7B	2025-01	0	0	0	0	0	0	0	0
Qwen2.5VL-72B	72B	2025-01	0	0	0	0	0	0	0	0
Qwen2.5VL-7B	7B	2025-01	0	0.67	0	0	0	0	0	2.5
OpenAI o1	-	2024-12	0	0	0	0	0	1.33	0	0
QVQ-72B	72B	2024-12	0	0	0	0	0	0	0	0

Pass@1 and Pass@32 performance of 15 MLLMs on four tasks of TET. The results show universally poor performance across all models and tasks, highlighting the severity of the vision blind spot.

Attention Analysis

To explore why models cannot perceive images accurately, we conduct an analysis of attention patterns using Grad-CAM visualization. Our systematic examination of two representative models from the Qwen2.5-VL series (7B and 72B parameters) reveals **fundamental limitations in visual attention mechanisms** that explain the catastrophic failures on our perceptual tasks.

🔍 Information Flow in Visual Encoder

While ViT allocates attention across various regions of the image, this attention is often directed outside the target character regions or only captures partial segments. The image encoder struggles to effectively focus on textural features corresponding to character regions and instead prioritizes object-level features within the image. Such disparities in visual attention prevent the model from truly comprehending the image content.

🧠 Information Flow in Language Decoder

LLM decoders consistently fail to focus on the precise regions containing text or character information; instead, they scatter attention over irrelevant regions or completely ignore critical visual elements. This inconsistency between attention patterns and actual locations of important visual features indicates fundamental limitations in visual perception ability.

scroll to view more

Perception vs. Reasoning

To investigate whether domain knowledge deficiency causes model failures, we conduct SFT experiments on Qwen2.5-7B-VL using five different training configurations targeting different model components

Training loss curve for different settings on finetuning parameters for both our tasks and traditional tasks.

Our results reveal a fundamental distinction between perception-oriented and reasoning-oriented tasks: - ***TET tasks***: Only configurations that include vision encoder fine-tuning achieve effective learning, while language-only training plateaus quickly - ***Traditional benchmarks*** (OCRVQA, GEOQA, CLEVR): All training configurations converge efficiently to similar performance This demonstrates that our benchmark poses **perception challenges** that require visual understanding capabilities, rather than reasoning challenges that can be solved through language enhancement alone. The image domains in existing benchmarks are typically covered by pre-training data, requiring only semantic understanding improvements—a gap our benchmark is designed to address.

scroll to view more

Citation

@misc{gao2025pixelspatternspoetryworld,
      title={Pixels, Patterns, but No Poetry: To See The World like Humans}, 
      author={Hongcheng Gao and Zihao Huang and Lin Xu and Jingyi Tang and Xinhao Li and Yue Liu and Haoyang Li and Taihang Hu and Minhua Lin and Xinlong Yang and Ge Wu and Balong Bi and Hongyu Chen and Wentao Zhang},
      year={2025},
      eprint={2507.16863},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.16863}
}