Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries

Vision-Language Models (VLMs) are increasingly used for generating responses to queries about visual content. Despite their progress, they often suffer from a major issue: generating plausible but incorrect responses, also known as hallucinations. These hallucinations can lead to a lack of trust in these systems, especially in real-world, high-stakes applications. Evaluating the helpfulness and truthfulness of VLM-generated responses is challenging because it requires not only understanding visual content but also verifying each claim made in the response. Traditional benchmarks have not been adequate for addressing this challenge, either because they limit evaluations to simplistic, binary questions or because they rely on incomplete context to judge open-ended responses.

Researchers from Salesforce AI Research have proposed Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm that evaluates VLM responses to open-ended visual queries. In PROVE, researchers use a high-fidelity scene graph representation constructed from hyper-detailed image captions and employ a large language model (LLM) to generate diverse question-answer (QA) pairs along with executable programs to verify each QA pair. This approach allows the creation of a benchmark dataset of 10.5k visually grounded and challenging QA pairs. The evaluation strategy involves measuring both the helpfulness and truthfulness of VLM responses using a unified framework based on scene graph comparisons. This programmatic evaluation provides a more reliable and interpretable assessment of VLM performance compared to previous benchmarks.

Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries

The PROVE benchmark uses detailed scene graph representations and executable programs to verify the correctness of VLM responses. Scene graphs, constructed from detailed image captions, contain entities, attributes, and relationships that represent the visual scene. By prompting an LLM, researchers generate open-ended QA pairs and corresponding verification programs that ensure the questions are challenging yet verifiable. Only QA pairs that can be programmatically verified are retained in the benchmark, resulting in a high-quality dataset. The evaluation involves extracting scene graph representations from both the model responses and ground truth answers, and then calculating scores based on the recall and precision of these representations, measuring how helpful and truthful the responses are.

The results of the evaluation show that current VLMs struggle to achieve a good balance between helpfulness and truthfulness. Models such as GPT-4o, Phi-3.5-Vision, and Pixtral demonstrated higher helpfulness scores but not necessarily higher truthfulness. The study also found that increasing model size tends to improve helpfulness but does not always enhance truthfulness. The evaluation of various models revealed that recent improvements in training better VLMs have led to enhanced helpfulness but have not consistently translated into truthful outputs. Notably, the LLaVA-1.5 model series achieved the best truthfulness scores, indicating that smaller, more focused models might outperform larger ones in maintaining accuracy.

In conclusion, PROVE presents a significant advancement in evaluating the helpfulness and truthfulness of VLM-generated responses. By leveraging detailed scene graph representations and programmatic verification, this benchmark provides a more reliable and interpretable evaluation framework. The findings underscore the need for VLMs that strike a balance between generating informative and accurate responses, especially as their use in real-world applications continues to grow. Future research is expected to focus on improving both the helpfulness and truthfulness of these models through advanced training techniques and new evaluation strategies.


Check out the Paper and Dataset Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Listen to our latest AI podcasts and AI research videos here ➡️

Related articles

8 Significant Research Papers on LLM Reasoning

Simple next-token generation, the foundational technique of large language models (LLMs), is usually insufficient for tackling complex reasoning...

AI-Generated Masterpieces: The Blurring Lines Between Human and Machine Creativity

Hey there! Just the other day, I was admiring a beautiful painting at a local art gallery when...

Marek Rosa – dev blog: GoodAI LTM Benchmark v3 Released

 The main purpose of the GoodAI LTM Benchmark has always been to serve as an objective measure for...