MARSHAL Header

MARSHAL: Incentivizing Multi-Agent Reasoning
via Self-Play with Strategic LLMs

Huining Yuan1*, Zelai Xu1*, Zheyue Tan2, Xiangmin Yi1,
Mo Guang3, Kaiwen Long3, Haojia Hui3, Boxun Li4, Xinlei Chen1, Bo Zhao2,
Xiao-Ping Zhang1†, Chao Yu1†, Yu Wang1†

1Tsinghua University, 2Aalto University, 3Li Auto Inc., 4Infinigence-AI

Overview

We introduce MARSHAL, an end-to-end reinforcement learning framework designed to incentivize Multi-Agent Reasoning through Self-play witH strAtegic LLMs in a diverse range of competitive and cooperative games.

MARSHAL addresses the challenge of credit assignment in multi-agent multi-turn self-play through two core mechanisms:

  • Turn-level Advantage Estimator: Enables fine-grained credit assignment, allowing the model to accurately attribute long-term outcomes to individual actions.
  • Agent-specific Advantage Normalization: Stabilizes the training process by calibrating advantage estimates relative to the performance of each agent.

Key Results

By leveraging self-play across strategic games, MARSHAL (based on Qwen3-4B) demonstrates notable generalization capabilities:

  • Strategic Games: Achieves up to 28.7% performance improvement on held-out games.
  • Reasoning Benchmarks: When integrated into leading multi-agent systems (MASs), MARSHAL yields consistent gains of up to
    • +10.0% on AIME
    • +7.6%* on GPQA-Diamond
    • +3.5% on average across all tested benchmarks.

    *Note: Due to a typo, this value is reported as 6.6% in the current arXiv preprint (v2). It will be corrected in the camera-ready version.

MARSHAL Performance Radar Chart

Evaluation of MARSHAL and two baselines on strategic games and reasoning benchmarks.

Game Replay

Click 'Next Step' to view a self-play demonstration of the MARSHAL agent in Tic-Tac-Toe. Observe the agent's reasoning process regarding game states and strategies as it controls both Player 1 (X) and Player 0 (O).

Player 0 (X)

Click 'Next Step' to start replay...
Player 0 reasoning...

Loading...

Player 1 (O)

Click 'Next Step' to start replay...
Player 1 reasoning...

Method

Algorithmic Design

We introduce two modifications to GRPO [1] for multi-agent self-play: a turn-level advantage estimator for fine-grained credit assignment, and agent-specific advantage normalization to stabilize training across heterogeneous roles.

MARSHAL Method Overview

Overview of MARSHAL. Left column: generating player trajectories through self-play in strategic games. Middle column: naive advantage estimation by GRPO. Right column: advantage estimation by MARSHAL for accurate credit assignment in multi-turn, multi-agent setting.

Game Selection

We use a portfolio of six strategic games, split into training and held-out testing sets to evaluate generalization:

Game Selection Icons

Game selection. From left to right, Tic-Tac-Toe, Kuhn Poker, Hanabi, Connect Four, Leduc Hold'em.

Reward Design

The reward signal consists of three components:

Main Results

Based on Qwen3-4B [2], we train two model types: specialist agents on single games (Tic-Tac-Toe, Kuhn Poker, Mini Hanabi) and a generalist agent on all three simultaneously.

Strategic Ability

Specialist agents generalize effectively to their more complex, held-out counterparts (e.g., from Tic-Tac-Toe to Connect Four). The generalist model achieves high performance across all settings, with 28.7% improvement on Leduc Hold'em and 22.9% on Simple Hanabi, demonstrating robust skill transfer.

Game Performance

Average normalized game returns.

Generalization to Multi-Agent Systems

We evaluate the transfer of capabilities to reasoning benchmarks within multi-agent systems, including the competitive MAD [3] and cooperative AutoGen [4]. In MAD, the generalist agent improves by 3.51% on average. Notably, it achieves striking gains on challenging benchmarks, including +6.57% on GPQA-Diamond in MAD and +10.00% on AIME in AutoGen.

Setting Model Average Math QA
MATH GSM8K AQUA AIME AMC MMLU GPQA
Single Agent Qwen3-4B 60.74 87.60 94.60 39.80 36.70 70.00 57.10 39.39
SPIRAL 63.75 87.50 94.80 51.20 36.70 80.00 58.70 37.37
MARSHAL
Tic-Tac-Toe 63.54 89.10 95.20 46.50 40.00 77.50 57.60 38.89
Kuhn Poker 61.38 87.80 94.50 48.40 33.30 72.50 59.30 33.84
Mini Hanabi 62.05 88.10 94.70 48.00 43.30 65.00 58.90 36.36
Generalist 62.79 89.90 94.60 52.00 33.30 75.00 59.90 34.85
MAD (Competitive) Qwen3-4B 72.45 90.20 95.91 80.71 40.00 75.00 87.42 37.88
SPIRAL 73.41 91.60 95.45 81.89 40.00 77.50 87.01 40.40
MARSHAL
Tic-Tac-Toe 75.01 92.20 96.06 83.07 43.33 82.50 86.76 41.12
Kuhn Poker 74.54 91.60 96.21 82.68 40.00 82.50 87.39 41.41
Mini Hanabi 73.70 91.40 95.60 82.68 43.33 77.50 87.04 38.38
Generalist 75.96 92.80 95.60 83.86 46.67 80.00 87.36 45.45
AutoGen (Cooperative) Qwen3-4B 79.14 93.40 94.69 85.04 56.67 87.50 89.21 47.47
SPIRAL 80.05 94.20 94.47 86.61 60.00 87.50 91.60 45.96
MARSHAL
Tic-Tac-Toe 80.15 94.40 94.69 87.01 60.00 90.00 89.53 45.45
Kuhn Poker 81.54 95.80 94.39 86.61 63.33 92.50 89.65 48.48
Mini Hanabi 81.54 94.40 94.54 86.22 66.67 95.00 88.98 44.95
Generalist 82.15 95.20 94.54 86.61 66.67 92.50 89.53 50.00

Evaluation results on downstream reasoning benchmarks within multi-agent systems.

Pattern Analysis

Qualitative

Analysis of reasoning traces reveals emergent multi-agent skills cultivated in games and transferred to general-purpose systems.

Skill Manifestation in Game-Play Generalize to Multi-Agent Systems
Role Understanding The Tic-Tac-Toe specialist recognizes its role as the second player (O) and adopts a defensive strategy.
<think>
Okay, so I am playing the game of Tic-Tac-Toe as the mark O... As the second-move player, I should prioritize blocking the X marks from forming a horizontal, vertical, or diagonal line...
</think>
The same agent, acting as the "negative" debater in MAD, adapts its strategy to refute the opponent.
<think>
Okay, so I need to figure out the minimum number of... The affirmative side just argued for 4. As the negative side assistant, I should try to find an alternative solution and convince the judge...
</think>
Intent Recognition The Hanabi specialist infers the intent behind a teammate's ambiguous hint.
<think>
Okay, so I need to figure out the next move. Player 0 just revealed my only red card. Wait, but the rank remains unknown. Maybe they want me to play this card to the stack? ...
</think>
The same agent, acting as a user proxy in AutoGen, infers uncertainty from a collaborator's missing 'TERMINATE' token.
<think>
Okay, let's see. The assistant gave the answer 17.5 m/s, but did not end the conversation with 'TERMINATE'. Maybe the assistant is not sure with this answer? ...
</think>

Qualitative analysis of emergent reasoning patterns.

Quantitative

Following the taxomony of Cemri et al. (2025) [5], failure mode analysis on GPQA-Diamond shows a 11.5% reduction in Inter-Agent Misalignment. This confirms that MARSHAL agents actively listen to peers and maintain focus, validating game-theoretic self-play cultivates transferable multi-agent reasoning skills.

Failure Mode Analysis

Percentage of different failure modes in GPQA-Diamond.

Ablation

Self-Play vs. Fixed-Opponent

Training against fixed opponents leads to overfitting and poor generalization. Self-play provides an adaptive curriculum essential for developing robust, generalizable policies.

Model Training Games Testing Games
Tic-Tac-Toe Kuhn Poker Mini Hanabi Connect Four Leduc Hold'em Simple Hanabi
MARSHAL (Tic-Tac-Toe) 75.30 / 32.10 74.15 / 3.42 50.48 30.65 / 14.85 58.36 / 27.65 29.75
  w/ fixed opponent 88.00 / 41.95 63.15 / 28.84 34.93 20.35 / 5.65 47.38 / 35.55 12.22
MARSHAL (Kuhn Poker) 69.85 / 25.50 79.04 / 22.49 44.98 27.60 / 12.70 63.94 / 62.10 29.35
  w/ fixed opponent 0.00 / 0.00 76.19 / 15.64 0.00 0.00 / 0.00 0.00 / 0.00 0.00

Generalization comparison between MARSHAL (self-play) and its fixed-opponent variant. For competitive games, entries indicate first-move / second-move returns. Underlined scores indicate performance degradation compared to the standard MARSHAL model.

Analysis of Algorithmic Design

Ablation studies confirm that both the turn-level advantage estimator and agent-specific normalization are critical. Removing either component significantly degrades performance, especially in long-horizon and competitive games.

Model Training Games Testing Games
Tic-Tac-Toe Kuhn Poker Mini Hanabi Connect Four Leduc Hold'em Simple Hanabi
MARSHAL (Tic-Tac-Toe) 75.30 / 32.10 74.15 / 3.42 50.48 30.65 / 14.85 58.36 / 27.65 29.75
  w/o Turn-Level 74.60 / 24.15 80.26 / 28.35 34.80 26.75 / 12.30 48.34 / 41.34 19.05
  w/o Agent-Specific 82.70 / 31.20 70.89 / 11.24 44.10 25.40 / 10.50 51.04 / 49.88 21.72
MARSHAL (Kuhn Poker) 69.85 / 25.50 79.04 / 22.49 44.98 27.60 / 12.70 63.94 / 62.10 29.35
  w/o Turn-Level 63.35 / 19.65 92.49 / 21.02 41.65 29.60 / 10.85 32.26 / 31.23 22.98
  w/o Agent-Specific 69.55 / 24.55 75.37 / 19.55 40.18 27.00 / 10.50 35.73 / 21.50 22.42
MARSHAL (Hanabi) 71.90 / 7.35 72.52 / 9.29 55.55 26.75 / 5.75 37.36 / 55.12 33.93
  w/o Turn-Level 67.55 / 10.60 68.45 / 31.78 53.20 25.25 / 3.05 54.79 / 47.77 30.68
  w/o Agent-Specific 68.15 / 13.40 74.15 / 10.27 52.50 32.10 / 5.10 44.30 / 56.41 32.08

Ablation results for algorithmic design. For competitive games, entries indicate first-move / second-move returns. Underlined scores indicate performance degradation compared to the standard MARSHAL model.

Scaling to Larger Models

We extended training to the larger Qwen3-8B model. Results demonstrate that MARSHAL scales stably, consistently unlocking cooperative and competitive reasoning capabilities at larger scales.

Strategic Ability (8B)

Model Tic-Tac-Toe Kuhn Poker Mini Hanabi Connect Four Leduc Hold'em Simple Hanabi
Qwen3-8B 48.38 33.12 27.00 10.48 7.26 4.55
MARSHAL (8B) 54.05 44.49 55.28 21.55 53.89 37.27

Strategic ability comparison on the larger Qwen3-8B.

Generalization to MAS (8B)

MAS Model Avg MATH GSM8K AQUA AIME AMC MMLU GPQA
MAD Qwen3-8B 82.49 95.00 96.36 83.46 70.00 90.00 89.59 53.03
MARSHAL (8B) 85.09 96.40 96.59 83.46 80.00 95.00 90.70 53.54
AutoGen Qwen3-8B 79.68 88.80 95.91 83.07 60.00 89.19 89.30 51.52
MARSHAL (8B) 83.58 94.40 95.00 85.04 70.00 95.00 90.04 55.56

Generalization to multi-agent systems using the larger Qwen3-8B.

Citation

@misc{yuan2025marshal,
      title={MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs}, 
      author={Huining Yuan and Zelai Xu and Zheyue Tan and Xiangmin Yi and Mo Guang and Kaiwen Long and Haojia Hui and Boxun Li and Xinlei Chen and Bo Zhao and Xiao-Ping Zhang and Chao Yu and Yu Wang},
      year={2025},
      eprint={2510.15414},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.15414}, 
}

References

[1] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv:2402.03300.

[2] Yang, An, et al. "Qwen3 technical report." arXiv:2505.09388.

[3] Liang, Tian, et al. "Encouraging divergent thinking in large language models through multi-agent debate." EMNLP 2024.

[4] Wu, Qingyun, et al. "Autogen: Enabling next-gen llm applications via multi-agent conversation." COLM 2024.

[5] Cemri, Mert, et al. "Why do multi-agent llm systems fail?." NeurIPS 2025.