VisualPRM

The overall Best-of-8 evaluation results across seven multimodal reasoning benchmarks with different critic models. Our VisualPRM greatly enhances the overall performance, while InternVL2.5-8B struggles to be an effective critic model.

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs.

Our main contributions are as follows:

VisualPRM-8B : an advanced multimodal Process Reward Model (PRM) with 8B parameters. Specifically, VisualPRM improves the overall reasoning performance of MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, and InternVL2.5-78B by 8.0, 3.7, 8.4, and 5.9 points, respectively, across seven multimodal reasoning benchmarks. Additionally, we compare PRMs with Outcome Reward Models and Self-Consistency in BoN evaluation, finding that PRMs consistently outperform both approaches.
VisualPRM400K : a dataset comprising approximately 400K multimodal process supervision data. We generate the data using an automatic data pipeline. The key idea is to estimate the expected accuracy of the given step based on Monte Carlo sampling and consider the step correct if .
VisualProcessBench : a benchmark designed to measure the abilities of PRMs and MLLMs to identify erroneous steps in multimodal reasoning tasks. This benchmark comprises 2,866 samples with a total of 26,950 human-annotated step-wise correctness labels.

VisualPRM-8B

During the training process, we formulate the process supervision problem as a multi-turn chat task so that we can effectively leverage the generation ability of MLLMs. The image , question , and the first step of the solution to this question are included in the first turn and a new step is presented in each subsequent turn. The model is required to predict the quality of the given step in each turn as where denotes the quality of -th step.

Following Math-Shepherd, we require the model to predict the correctness of the given step, rather than the exact score of . The -th step is considered correct if . We also try to set a threshold to reduce false positive steps, but find that such a threshold negatively impacts the PRM performance. Notably, unlike previous works, which choose to supervise only up to the first incorrect step, we always supervise all steps.

During inference stage, we first compute the scores of each step and then merge them to obtain the response score. Specifically, the score for each step is defined as the weighted sum of the generation probability for the discretized scores. The weights for are . Without further explanation, we average the scores of each step as the response score.

Model	MMMU	MathVista	MathVision	MathVerse-VO	DynaMath	WeMath	LogicVista	Overall
MiniCPM-V2.6-8B	49.8	60.8	23.4	18.9	9.8	16.4	27.5	29.5
+VisualPRM	56.8(+7.0)	65.7(+4.9)	24.7(+1.3)	35.8(+16.9)	11.2(+1.4)	31.0(+14.6)	37.4(+9.8)	37.5(+8.0)
Qwen2.5-VL-7B	55.0	67.8	25.4	41.1	21.0	35.2	44.1	41.4
+VisualPRM	58.6(+3.6)	70.3(+2.5)	31.3(+5.9)	44.3(+3.2)	23.0(+2.0)	39.8(+4.6)	48.3(+4.2)	45.1(+3.7)
InternVL2.5-8B	56.2	64.5	17.0	22.8	9.4	23.5	36.0	32.8
+VisualPRM	60.2(+4.0)	68.5(+4.0)	25.7(+8.7)	35.8(+13.0)	18.0(+8.6)	36.5(+13.0)	43.8(+7.8)	41.2(+8.4)
InternVL2.5-26B	60.7	68.2	23.4	24.0	11.4	30.9	39.6	36.9
+VisualPRM	63.9(+3.2)	73.1(+4.9)	29.6(+6.2)	39.1(+15.2)	23.2(+11.8)	40.8(+9.9)	51.0(+11.4)	45.8(+8.9)
InternVL2.5-38B	63.9	71.9	32.2	36.9	20.0	38.3	47.9	44.4
+VisualPRM	69.0(+5.1)	73.9(+2.0)	35.2(+3.0)	46.7(+9.8)	30.5(+10.5)	46.2(+7.9)	53.7(+5.8)	50.7(+6.3)
InternVL2.5-78B	70.0	72.3	32.2	39.2	19.2	39.8	49.0	46.0
+VisualPRM	70.7(+0.7)	75.1(+2.8)	35.9(+3.7)	47.1(+7.9)	31.3(+12.1)	49.1(+9.3)	53.9(+4.9)	51.9(+5.9)

We evaluate the reasoning abilities of MLLMs across seven benchmarks, including MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista. The evaluation samples include subject-based, mathematical, and logical reasoning problems. We report the worst-case accuracy for DynaMath and the overall accuracy for the remaining benchmarks. For MathVerse, we report the performance on the Vision-Only split.

As shown in the table below, VisualPRM greatly enhances the reasoning abilities of MLLMs across different model scales and families. Specifically, for models with fewer than 10 billion parameters, the overall performance of InternVL2.5-8B, MiniCPM-V-8B, and Qwen2.5-VL-7B improves by 8.4, 8.0, and 3.7 points, respectively, demonstrating the effectiveness of test-time scaling across different model families. For larger models, InternVL2.5-26B, InternVL2.5-38B, and InternVL2.5-78B also achieve substantial performance gains over their counterparts without TTS, further validating the scalability and effectiveness of TTS across different model sizes.

We increase the number of response candidates sampled from InternVL2.5-8B and select the final response using Self-Consistency (SC), Outcome Reward Model (ORM), and PRM. The training data for ORM are nearly identical to those used for PRM, except that all steps are concatenated into a single step and step-wise correctness annotations are converted into a single correctness label for the outcome.

As shown in the figures above, increasing the number of response candidates improves the reasoning performance of InternVL2.5-8B and MiniCPM-V2.6-8B when using SC, ORM, or PRM, with PRM yielding the most significant improvements. Specifically, when using InternVL2.5-8B as the policy model, PRM outperforms SC and ORM by 2.4 and 1.5 points, respectively, under the Best-of-8 evaluation setting. Moreover, this performance gap widens as increases, reaching 3.1 and 4.3 points when is set to 128. Notably, when using ORM as the critic model, although performance improves during Best-of-8 evaluation, further increasing does not lead to consistent gains for InternVL2.5-8B. For example, the Best-of-128 performance is inferior to the Best-of-64 performance. These results highlight the effectiveness of PRM in TTS.

VisualPRM400K

◀

▶

Definition. As shown in the figures above, each data sample in our VisualPRM400K consists of an image , a question , a step-by-step solution , and the expected accuracy annotation for each step, where is the number of steps of a certain solution and denotes the expected accuracy of step . The image sets and question sets are collected from MMPR v1.1, while the step-by-step solutions are sampled using InternVL2.5 series models.

Process Supervision Generation. Given an image $I$, a question , and a solution , we annotate the correctness of each step using an automatic data pipeline. The key idea is to estimate the expected accuracy of given steps based on Monte Carlo sampling. Specifically, the model is required to complete the solution as follows: where is the completion of .

Besides, the expected accuracy of is defined as: Notably, to reduce the data construction costs, we set the max number of steps to 12 and evenly merge the steps if the number of current steps exceeds the threshold.

VisualProcessBench

◀

▶

Definition. Each sample in our benchmark consists of a multimodal reasoning question, a step-by-step solution, and correctness annotations for each step. Considering that recent models begin to demonstrate reflection abilities to rectify their own reasoning process, the evaluation setting used in previous works, which only requires the model to find the first erroneous step, may lead to a false negative estimation. Therefore, our benchmark requires the model to identify all erroneous steps in the given solution instead of only the first erroneous step.

Data Source. Our benchmark focuses on multimodal reasoning tasks, collecting images and questions from existing representative multimodal reasoning benchmarks, including MMMU, MathVision, MathVerse, DynaMath, and WeMath. Given these questions, we generate step-by-step solutions using leading MLLMs, including GPT-4o, Claude-3.5-Sonnet, Gemini-2.0-Flash, QvQ-72B-Preview, and InternVL2.5-78B. The solutions are sampled from different MLLMs to ensure their diversity.

Step Correctness Annotation. We employ a team of human experts with at least a university degree to manually annotate the correctness of each step in the solutions. Specifically, 13 people worked for 3 days, resulting in a workload of 39 person-days. The cost per person-day is approximately 37 dollars. During the annotation process, annotators are provided with the image, question, ground truth answer, and each step of the solution. Their task is to assign each step in the solution a label of positive, negative, or neutral. A positive label indicates that the step is correct, while a negative label signifies an incorrect step. The neural label is assigned to steps that do not involve any reasoning process or provide no additional information. To ensure the annotation quality, annotators are permitted to skip questions they do not understand. During the annotation process, our dataset is divided into 10 splits, each containing approximately 300 samples. For each split, the authors of this paper manually review about 10% of the samples. Splits with erroneous annotations are sent back for re-annotation.

Model	MMMU	MathVision	MathVerse-VO	DynaMath	WeMath	Overall
InternVL2.5-8B	47.1	45.5	47.8	50.3	50.8	48.0
InternVL2.5-26B	48.8	47.4	49.2	50.4	51.4	49.2
InternVL2.5-38B	51.5	48.4	50.9	51.8	52.5	50.8
InternVL2.5-78B	52.0	51.7	53.7	50.8	52.5	52.6
VisualPRM (ours)	58.5	62.1	61.0	62.7	61.8	62.0

As shown in the table above, InternVL2.5 series struggle to accurately assess the correctness of each step. Specifically, the overall F1 score for random guessing is 50.0, while most open-source MLLMs achieve scores close to this baseline, highlighting their limitations as critic models. We manually check the judgments of these open-source MLLMs and observe that these models tend to provide positive analysis and label most steps as correct. For example, InternVL2.5-8B achieves an F1 score of 76.8 for positive steps, while its F1 score for negative steps is only 19.2, indicating that InternVL2.5-8B rarely identifies steps as incorrect.

Citation


@article{wang2025visualprm,
    title={VisualPRM: An Effective Process Reward Model for Multimodal Reasoning},
    author={Wang, Weiyun and Gao, Zhangwei and Chen, Lianjie and Chen, Zhe and Zhu, Jinguo and Zhao, Xiangyu and Liu, Yangzhou and Cao, Yue and Ye, Shenglong and Zhu, Xizhou and others},
    journal={arXiv preprint arXiv:2503.10291},
    year={2025}
}

VisualPRM-8B

VisualPRM400K

VisualProcessBench

Citation

🔙 Go Back