LongBench v2 frontier

Introduction

Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy---our analysis reveals that over 80\% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80\% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient.

๐Ÿ” With DeepPrune, we are eager to find out how to achieve high performance with low token cost in inference-time scaling scenarios.

Method

Motivation

Generally, there are two types of inference-time scaling: sequential scaling and parallel scaling. Sequential scaling focuses on increasing the computation in one reasoning trace like expanding the output length to 128k. While parallel scaling (e.g. best-of-n sampling) encourages generating multiple reasoning traces simultaneously, further pushing the total token cost to 100M or higher. However, beneath these advances lies a practical question: How to achieve high performance with low token cost?

Existing efficient reasoning methods mainly focus on alleviating the over-thinking of sequential scaling. There are few works designed for parallel scaling, which typically adopt the LLM's internal signal like confidence for early stopping to improve the sampling efficiency. However, these confidence-based methods suffer from two fundamental limitations: (1) they fail to reduce redundancy between parallel reasoning paths, and (2) they risk prematurely terminating correct reasoning traces.

Preliminary Experiment

grade-lv

(a) Distribution of same vs. different answer pairs of reasoning traces, revealing severe redundancy;
(b) ROC curve for shallow semantic similarity (SentenceBERT) to distinguish traces with same answers from those with different ones, which shows limited predictive power;
(c) ROC curve for LLM-based deep comparison (Qwen3-4B-Instruct) achieves moderate improvement.

Method Framework

grade-lv

Overview of the DeepPrune framework. The offline training phase (top) involves constructing trace pair datasets with binary labels indicating answer equivalence, then training a judge model using focal loss and oversampling to address class imbalance. The online pruning phase (bottom) leverages the trained judge model to perform dynamic pruning via greedy clustering where traces are assigned to existing clusters or new ones based on similarity predictions, and concludes with majority voting on selected traces to determine the final answer.

Leaderboard

๐Ÿ“ข The leaderboard is constantly updating as we are welcoming new submissions! Please email us!

We consider the token consumption and accuracy of three widely-used reasoning models, DeepSeek-8B, Qwen3-32B, and GPT-OSS-20B on three benchmarks:

AIME24: 30 problems          AIME25: 30 problems          GPQA-Diamond: 198 problems

To view other sorted results, please click on the corresponding cell.

# Method Date DeepSeek-8B Qwen3-32B GPT-OSS-20B
AIME24 AIME25 GPQA AIME24 AIME25 GPQA AIME24 AIME25 GPQA
Token Acc Token Acc Token Acc Token Acc Token Acc Token Acc Token Acc Token Acc Token Acc
1 cons@512๐Ÿ“

Meta

2025-08-21

3.55 86.7% 4.01 82.3% 9.92 72.5% 2.00 84.8% 2.43 80.1% 7.44 72.2% 5.57 96.7% 6.26 95.4% - -
2 DeepConf-high๐Ÿ“

Meta

2025-08-21

1.45 86.7% 2.37 81.4% 6.90 72.4% 0.88 86.4% 1.61 80.2% 4.16 72.9% 3.07 96.7% 3.18 95.3% - -
3 DeepConf-low๐Ÿ“

Meta

2025-08-21

0.78 92.5% 1.24 86.4% 3.46 71.7% 0.66 89.5% 1.14 80.2% 3.21 73.0% 1.11 95.7% 1.21 96.1% - -
4 cons@512

Tsinghua

2025-10-10

3.62 86.7% 4.19 83.3% 10.9 66.2% 1.93 86.7% 2.64 80.0% 6.94 70.7% 2.05 93.3% 2.10 90.0% 4.60 70.7%
5 DeepPrune

Tsinghua

2025-10-10

0.42 86.7% 0.35 83.3% 2.54 63.1% 0.26 90.0% 0.23 90.0% 1.00 70.2% 0.42 90.0% 0.38 93.3% 2.20 68.7%

๐Ÿ“ indicates the result is taken from the method's corresponding paper

1*. The comparison is for end-to-end reasoning system that combines the reasoning model with parallel scaling methods like DeepConf or self-consistency (e.g. cons@512).
2*. The token consumption is counted in 10^8.
3*. The cons@512 method means sampling 512 parallel traces with majority voting for final result.
4*. The system with least token consumption is in bold, and the one with highest accuracy is underlined.
5*. If you want to develop new methods, you can refer to our established reasoning trace dataset of the three reasoning models. Based on these existing traces, you can build your own answer aggregation strategies or early stopping methods without running the base reasoning model again.

Last Update: 2025-10-10

Experiment Results

Citation

If you find our work useful, please cite:


      @article{tu2025deepprune,
        title={DeepPrune: Parallel Scaling without Inter-trace Redundancy}, 
        author={Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li},
        journal={arXiv preprint arXiv:2510.08483},
        year={2025}
      }