[ACCV 2024 Oral]
Vision Language Models are Blind
This paper examines the limitations of VLMs, such as GPT-4o, in extremely simple, abstract vision tasks. Despite their high scores on multimodal benchmarks, these models often fail on very basic cases.
This research has been featured by
OpenAI,
TechCrunch, and
Ars Technica.