On the Cultural Anachronism and Temporal Reasoning
in Vision Language Models

Mukul Ranjan¹, Prince Jha¹, Khushboo Kumari², Zhiqiang Shen¹
¹MBZUAI, Abu Dhabi, UAE   |   ²Inception, UAE
1,600
Indian Artifacts
600
Evaluation Questions
50.4%
Best Model Accuracy: GPT-4o
8
SOTA VLMs Tested

Abstract

Vision-language models (VLMs) often exhibit "cultural anachronism" when interpreting historical artifacts. We introduce TAB-VLM, a benchmark designed to evaluate temporal reasoning capabilities on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Our evaluation reveals significant deficiencies in current models, particularly for non-Western visual culture.

The TAB-VLM Pipeline

TAB-VLM Pipeline
Figure 1: Systematic reduction and curation from 220,000 raw artifacts to 1,600 high-quality samples across 8 historical periods.

Temporal Reasoning Tasks

Evaluation Tasks Figure
Figure 2: Examples of the six task types in our benchmark.

Quantitative Results

Performance Table Screenshot
Table 1: Zero-shot performance comparison across eight state-of-the-art Vision-Language Models.

Citation

@article{ranjan2026cultural,
  title={On the Cultural Anachronism and Temporal Reasoning in Vision Language Models},
  author={Ranjan, Mukul and Jha, Prince and Kumari, Khushboo and Shen, Zhiqiang},
  journal={ACL 2026},
  year={2026}
}