DeepPrune

Introduction

Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy---our analysis reveals that over 80\% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80\% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient.

🔍 With DeepPrune, we are eager to find out how to achieve high performance with low token cost in inference-time scaling scenarios.

Motivation

Generally, there are two types of inference-time scaling: sequential scaling and parallel scaling. Sequential scaling focuses on increasing the computation in one reasoning trace like expanding the output length to 128k. While parallel scaling (e.g. best-of-n sampling) encourages generating multiple reasoning traces simultaneously, further pushing the total token cost to 100M or higher. However, beneath these advances lies a practical question: How to achieve high performance with low token cost?

Existing efficient reasoning methods mainly focus on alleviating the over-thinking of sequential scaling. There are few works designed for parallel scaling, which typically adopt the LLM's internal signal like confidence for early stopping to improve the sampling efficiency. However, these confidence-based methods suffer from two fundamental limitations: (1) they fail to reduce redundancy between parallel reasoning paths, and (2) they risk prematurely terminating correct reasoning traces.

Preliminary Experiment

(a) Distribution of same vs. different answer pairs of reasoning traces, revealing severe redundancy;
(b) ROC curve for shallow semantic similarity (SentenceBERT) to distinguish traces with same answers from those with different ones, which shows limited predictive power;
(c) ROC curve for LLM-based deep comparison (Qwen3-4B-Instruct) achieves moderate improvement.

Method Framework

Overview of the DeepPrune framework. The offline training phase (top) involves constructing trace pair datasets with binary labels indicating answer equivalence, then training a judge model using focal loss and oversampling to address class imbalance. The online pruning phase (bottom) leverages the trained judge model to perform dynamic pruning via greedy clustering where traces are assigned to existing clusters or new ones based on similarity predictions, and concludes with majority voting on selected traces to determine the final answer.

Leaderboard

📢 The leaderboard is constantly updating as we are welcoming new submissions! Please email us!

We consider the token consumption and accuracy of three widely-used reasoning models, DeepSeek-8B, Qwen3-32B, and GPT-OSS-20B on three benchmarks:

AIME24: 30 problems AIME25: 30 problems GPQA-Diamond: 198 problems

To view other sorted results, please click on the corresponding cell.

#	Method	Date	DeepSeek-8B						Qwen3-32B						GPT-OSS-20B
#	Method	Date	AIME24		AIME25		GPQA		AIME24		AIME25		GPQA		AIME24		AIME25		GPQA
			Token	Acc	Token	Acc	Token	Acc	Token	Acc	Token	Acc	Token	Acc	Token	Acc	Token	Acc	Token	Acc
1	cons@512📝 Meta	2025-08-21	3.55	86.7%	4.01	82.3%	9.92	72.5%	2.00	84.8%	2.43	80.1%	7.44	72.2%	5.57	96.7%	6.26	95.4%	-	-
2	DeepConf-high📝 Meta	2025-08-21	1.45	86.7%	2.37	81.4%	6.90	72.4%	0.88	86.4%	1.61	80.2%	4.16	72.9%	3.07	96.7%	3.18	95.3%	-	-
3	DeepConf-low📝 Meta	2025-08-21	0.78	92.5%	1.24	86.4%	3.46	71.7%	0.66	89.5%	1.14	80.2%	3.21	73.0%	1.11	95.7%	1.21	96.1%	-	-
4	cons@512 Tsinghua	2025-10-10	3.62	86.7%	4.19	83.3%	10.9	66.2%	1.93	86.7%	2.64	80.0%	6.94	70.7%	2.05	93.3%	2.10	90.0%	4.60	70.7%
5	DeepPrune Tsinghua	2025-10-10	0.42	86.7%	0.35	83.3%	2.54	63.1%	0.26	90.0%	0.23	90.0%	1.00	70.2%	0.42	90.0%	0.38	93.3%	2.20	68.7%

📝 indicates the result is taken from the method's corresponding paper

1*. The comparison is for end-to-end reasoning system that combines the reasoning model with parallel scaling methods like DeepConf or self-consistency (e.g. cons@512).
2*. The token consumption is counted in 10^8.
3*. The cons@512 method means sampling 512 parallel traces with majority voting for final result.
4*. The system with least token consumption is in bold, and the one with highest accuracy is underlined.
5*. If you want to develop new methods, you can refer to our established reasoning trace dataset of the three reasoning models. Based on these existing traces, you can build your own answer aggregation strategies or early stopping methods without running the base reasoning model again.

Last Update: 2025-10-10

The offline evaluation results of the judge model across different truncation methods and training strategies.

Performance of DeepPrune with varying redundancy threshold τ on AIME datasets for Qwen3-32B. Token consumption, pass rate and accuracy are reported for two pruning settings: (1) Conduct greedy clustering then retains only one trace per cluster, (2) Perform majority voting to get one final answer with the largest cluster.

Ablation study on judge model with different truncation strategies for unfinished reasoning traces.

If you find our work useful, please cite:


      @article{tu2025deepprune,
        title={DeepPrune: Parallel Scaling without Inter-trace Redundancy}, 
        author={Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li},
        journal={arXiv preprint arXiv:2510.08483},
        year={2025}
      }

DeepPrune

Parallel Scaling without Inter-trace Redundancy

Introduction

Method

Motivation

Preliminary Experiment

Method Framework

Leaderboard

Experiment Results

Citation