Paper - "Vision Language Models are Blind"
This paper examines the limitations of current vision-based language models, such as GPT-4V and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?