Scale-variant items where text is rendered as shapes within figures, appearing as text when reduced and resolving into complete images when magnified. 150 images
Recognition challenges constructed with curved characters in three-dimensional space, testing spatial reasoning capabilities. 150 captchas
Similar to Ishihara tests, but augmented with confounding colored dots that are chromatically similar to central characters to increase difficulty. 150 test images
Intricate complex glyphs synthesized through character decomposition, morphological transformation, and fusion of multiple Chinese characters into different words or phrases.
World
China
France
Music
Cat
Water
Sun
Child
Coffee
Snow
Phone
Money
Train
Bridge
Door
Window
Boston
Key
Love
Tooth
Rice
Card
Magic
Berlin
Beach
Clean
Juice
Sound
Happy
Fresh
Mexico
Peace
Sydney
Smoke
Pillow
Models | HiddenText | 3DCaptcha | ColorBlind | ChineseLigatures | ||||
---|---|---|---|---|---|---|---|---|
Pass@1 | Pass@32 | Pass@1 | Pass@32 | Pass@1 | Pass@32 | Pass@1 | Pass@32 | |
OpenAI o1 | 0 | 0 | 0 | 0 | 0 | 1.33 | 0 | 0 |
Claude-4-Sonnet | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Gemini 2.5 Pro | 0 | 0 | 0 | 0 | 0 | 0 | 2.5 | 5.0 |
Seed-1-6-250615 | 0 | 0 | 0 | 0 | 0 | 0 | 2.5 | 2.5 |
Qwen2.5VL-72B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Qwen2.5VL-7B | 0 | 0.67 | 0 | 0 | 0 | 0 | 0 | 2.5 |
QVQ-72B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Qwen2.5-Omni-7B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.5 |
InternVL3-78B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
MiniCPM-o-2.6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.5 |
Show o2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Bagel | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Janus-pro | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
kimi-vl-a3b | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5.0 |
kimi-vl-a3b-thinking | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Pass@1 and Pass@32 performance of 15 MLLMs on four tasks of TET. The results show universally poor performance across all models and tasks, highlighting the severity of the vision blind spot.
While ViT allocates attention across various regions of the image, this attention is often directed outside the target character regions or only captures partial segments. The image encoder struggles to effectively focus on textural features corresponding to character regions and instead prioritizes object-level features within the image. Such disparities in visual attention prevent the model from truly comprehending the image content.
LLM decoders consistently fail to focus on the precise regions containing text or character information; instead, they scatter attention over irrelevant regions or completely ignore critical visual elements. This inconsistency between attention patterns and actual locations of important visual features indicates fundamental limitations in visual perception ability.
Training loss curve for different settings on finetuning parameters for both our tasks and traditional tasks.
@misc{gao2025pixelspatternspoetryworld,
title={Pixels, Patterns, but No Poetry: To See The World like Humans},
author={Hongcheng Gao and Zihao Huang and Lin Xu and Jingyi Tang and Xinhao Li and Yue Liu and Haoyang Li and Taihang Hu and Minhua Lin and Xinlong Yang and Ge Wu and Balong Bi and Hongyu Chen and Wentao Zhang},
year={2025},
eprint={2507.16863},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.16863}
}