Overview
We introduce MARSHAL, an end-to-end reinforcement learning framework designed to incentivize Multi-Agent Reasoning through Self-play witH strAtegic LLMs in a diverse range of competitive and cooperative games.
MARSHAL addresses the challenge of credit assignment in multi-agent multi-turn self-play through two core mechanisms:
- Turn-level Advantage Estimator: Enables fine-grained credit assignment, allowing the model to accurately attribute long-term outcomes to individual actions.
- Agent-specific Advantage Normalization: Stabilizes the training process by calibrating advantage estimates relative to the performance of each agent.
Key Results
By leveraging self-play across strategic games, MARSHAL (based on Qwen3-4B) demonstrates notable generalization capabilities:
- Strategic Games: Achieves up to 28.7% performance improvement on held-out games.
- Reasoning Benchmarks: When integrated into leading multi-agent systems
(MASs), MARSHAL yields consistent gains of up to
- +10.0% on AIME
- +7.6% on GPQA-Diamond
- +3.5% on average across all tested benchmarks.
Figure 1. Evaluation of MARSHAL and two baselines on strategic games and reasoning benchmarks.
Game Replay
Click 'Next Step' to view a self-play demonstration of the MARSHAL agent in Tic-Tac-Toe. Observe the agent's reasoning process regarding game states and strategies as it controls both Player 1 (X) and Player 0 (O).
Player 0 (X)
Player 0 reasoning...
Loading...
Player 1 (O)
Player 1 reasoning...
Method
Algorithmic Design
We introduce two modifications to GRPO for multi-agent self-play: a turn-level advantage estimator for fine-grained credit assignment, and agent-specific advantage normalization to stabilize training across heterogeneous roles.
Figure 2. Overview of MARSHAL. Left column: generating player trajectories through self-play in strategic games. Middle column: naive advantage estimation by GRPO. Right column: advantage estimation by MARSHAL for accurate credit assignment in multi-turn, multi-agent setting.
Game Selection
We use a portfolio of six strategic games, split into training and held-out testing sets to evaluate generalization:
- Perfect-information competitive: For deterministic planning and role adaptation, train on Tic-Tac-Toe; test on Connect Four.
- Imperfect-information competitive: For robust reasoning under uncertainty, train on Kuhn Poker; test on Leduc Hold'em.
- Imperfect-information cooperative: For intent recognition and theory of mind, train on Mini Hanabi; test on Simple Hanabi.
Figure 3. Game selection. From left to right, Tic-Tac-Toe, Kuhn Poker, Hanabi, Connect Four, Leduc Hold'em.
Reward Design
The reward signal consists of three components:
- Intrinsic game rewards: +/-1 for win/loss/draw or chips won; shared reward for cooperative games.
- Action format regularization: Small reward for valid actions; large penalty for invalid ones.
- Response length penalty: Turn-level penalty for verbosity to encourage conciseness.
Main Results
Based on Qwen3-4B, we train two model types: specialist agents on single games (Tic-Tac-Toe, Kuhn Poker, Mini Hanabi) and a generalist agent on all three simultaneously.
Strategic Ability
Specialist agents generalize effectively to their more complex, held-out counterparts (e.g., from Tic-Tac-Toe to Connect Four). The generalist model achieves high performance across all settings, with 28.7% improvement on Leduc Hold'em and 22.9% on Simple Hanabi, demonstrating robust skill transfer.
Figure 4. Average normalized game returns.
Generalization to Multi-Agent Systems
We evaluate the transfer of capabilities to reasoning benchmarks within multi-agent systems, including the competitive MAD and cooperative AutoGen. In MAD, the generalist agent improves by 3.51% on average. Notably, it achieves striking gains on challenging benchmarks, including +6.57% on GPQA-Diamond in MAD and +10.00% on AIME in AutoGen.
| Setting | Model | Average | Math | QA | |||||
|---|---|---|---|---|---|---|---|---|---|
| MATH | GSM8K | AQUA | AIME | AMC | MMLU | GPQA | |||
| Single Agent | Qwen3-4B | 60.74 | 87.60 | 94.60 | 39.80 | 36.70 | 70.00 | 57.10 | 39.39 |
| SPIRAL | 63.75 | 87.50 | 94.80 | 51.20 | 36.70 | 80.00 | 58.70 | 37.37 | |
| MARSHAL | |||||||||
| Tic-Tac-Toe | 63.54 | 89.10 | 95.20 | 46.50 | 40.00 | 77.50 | 57.60 | 38.89 | |
| Kuhn Poker | 61.38 | 87.80 | 94.50 | 48.40 | 33.30 | 72.50 | 59.30 | 33.84 | |
| Mini Hanabi | 62.05 | 88.10 | 94.70 | 48.00 | 43.30 | 65.00 | 58.90 | 36.36 | |
| Generalist | 62.79 | 89.90 | 94.60 | 52.00 | 33.30 | 75.00 | 59.90 | 34.85 | |
| MAD (Competitive) | Qwen3-4B | 72.45 | 90.20 | 95.91 | 80.71 | 40.00 | 75.00 | 87.42 | 37.88 |
| SPIRAL | 73.41 | 91.60 | 95.45 | 81.89 | 40.00 | 77.50 | 87.01 | 40.40 | |
| MARSHAL | |||||||||
| Tic-Tac-Toe | 75.01 | 92.20 | 96.06 | 83.07 | 43.33 | 82.50 | 86.76 | 41.12 | |
| Kuhn Poker | 74.54 | 91.60 | 96.21 | 82.68 | 40.00 | 82.50 | 87.39 | 41.41 | |
| Mini Hanabi | 73.70 | 91.40 | 95.60 | 82.68 | 43.33 | 77.50 | 87.04 | 38.38 | |
| Generalist | 75.96 | 92.80 | 95.60 | 83.86 | 46.67 | 80.00 | 87.36 | 45.45 | |
| AutoGen (Cooperative) | Qwen3-4B | 79.14 | 93.40 | 94.69 | 85.04 | 56.67 | 87.50 | 89.21 | 47.47 |
| SPIRAL | 80.05 | 94.20 | 94.47 | 86.61 | 60.00 | 87.50 | 91.60 | 45.96 | |
| MARSHAL | |||||||||
| Tic-Tac-Toe | 80.15 | 94.40 | 94.69 | 87.01 | 60.00 | 90.00 | 89.53 | 45.45 | |
| Kuhn Poker | 81.54 | 95.80 | 94.39 | 86.61 | 63.33 | 92.50 | 89.65 | 48.48 | |
| Mini Hanabi | 81.54 | 94.40 | 94.54 | 86.22 | 66.67 | 95.00 | 88.98 | 44.95 | |
| Generalist | 82.15 | 95.20 | 94.54 | 86.61 | 66.67 | 92.50 | 89.53 | 50.00 | |
Table 1. Evaluation results on downstream reasoning benchmarks within multi-agent systems.
Pattern Analysis
Qualitative
Analysis of reasoning traces reveals emergent multi-agent skills cultivated in games and transferred to general-purpose systems.
| Skill | Manifestation in Game-Play | Generalize to Multi-Agent Systems |
|---|---|---|
| Role Understanding |
The Tic-Tac-Toe specialist recognizes its role as the second player (O) and adopts a
defensive strategy.<think>
|
The same agent, acting as the "negative" debater in MAD, adapts its strategy to
refute the opponent.<think>
|
| Intent Recognition |
The Hanabi specialist infers the intent behind a teammate's ambiguous hint.<think>
|
The same agent, acting as a user proxy in AutoGen, infers uncertainty from a
collaborator's missing 'TERMINATE' token.<think>
|
Table 2. Qualitative analysis of emergent reasoning patterns.
Quantitative
Following the taxomony of Cemri et al. (2025), failure mode analysis on GPQA-Diamond shows a 11.5% reduction in Inter-Agent Misalignment. This confirms that MARSHAL agents actively listen to peers and maintain focus, validating game-theoretic self-play cultivates transferable multi-agent reasoning skills.
Figure 5. Percentage of different failure modes in GPQA-Diamond.
Ablation
Self-Play vs. Fixed-Opponent
Training against fixed opponents leads to overfitting and poor generalization. Self-play provides an adaptive curriculum essential for developing robust, generalizable policies.
| Model | Training Games | Testing Games | ||||
|---|---|---|---|---|---|---|
| Tic-Tac-Toe | Kuhn Poker | Mini Hanabi | Connect Four | Leduc Hold'em | Simple Hanabi | |
| MARSHAL (Tic-Tac-Toe) | 75.30 / 32.10 | 74.15 / 3.42 | 50.48 | 30.65 / 14.85 | 58.36 / 27.65 | 29.75 |
| w/ fixed opponent | 88.00 / 41.95 | 63.15 / 28.84 | 34.93 | 20.35 / 5.65 | 47.38 / 35.55 | 12.22 |
| MARSHAL (Kuhn Poker) | 69.85 / 25.50 | 79.04 / 22.49 | 44.98 | 27.60 / 12.70 | 63.94 / 62.10 | 29.35 |
| w/ fixed opponent | 0.00 / 0.00 | 76.19 / 15.64 | 0.00 | 0.00 / 0.00 | 0.00 / 0.00 | 0.00 |
Table 3. Generalization comparison between MARSHAL (self-play) and its fixed-opponent variant. For competitive games, entries indicate first-move / second-move returns. Underlined scores indicate performance degradation compared to the standard MARSHAL model.
Analysis of Algorithmic Design
Ablation studies confirm that both the turn-level advantage estimator and agent-specific normalization are critical. Removing either component significantly degrades performance, especially in long-horizon and competitive games.
| Model | Training Games | Testing Games | ||||
|---|---|---|---|---|---|---|
| Tic-Tac-Toe | Kuhn Poker | Mini Hanabi | Connect Four | Leduc Hold'em | Simple Hanabi | |
| MARSHAL (Tic-Tac-Toe) | 75.30 / 32.10 | 74.15 / 3.42 | 50.48 | 30.65 / 14.85 | 58.36 / 27.65 | 29.75 |
| w/o Turn-Level | 74.60 / 24.15 | 80.26 / 28.35 | 34.80 | 26.75 / 12.30 | 48.34 / 41.34 | 19.05 |
| w/o Agent-Specific | 82.70 / 31.20 | 70.89 / 11.24 | 44.10 | 25.40 / 10.50 | 51.04 / 49.88 | 21.72 |
| MARSHAL (Kuhn Poker) | 69.85 / 25.50 | 79.04 / 22.49 | 44.98 | 27.60 / 12.70 | 63.94 / 62.10 | 29.35 |
| w/o Turn-Level | 63.35 / 19.65 | 92.49 / 21.02 | 41.65 | 29.60 / 10.85 | 32.26 / 31.23 | 22.98 |
| w/o Agent-Specific | 69.55 / 24.55 | 75.37 / 19.55 | 40.18 | 27.00 / 10.50 | 35.73 / 21.50 | 22.42 |
| MARSHAL (Hanabi) | 71.90 / 7.35 | 72.52 / 9.29 | 55.55 | 26.75 / 5.75 | 37.36 / 55.12 | 33.93 |
| w/o Turn-Level | 67.55 / 10.60 | 68.45 / 31.78 | 53.20 | 25.25 / 3.05 | 54.79 / 47.77 | 30.68 |
| w/o Agent-Specific | 68.15 / 13.40 | 74.15 / 10.27 | 52.50 | 32.10 / 5.10 | 44.30 / 56.41 | 32.08 |
Table 4. Ablation results for algorithmic design. For competitive games, entries indicate first-move / second-move returns. Underlined scores indicate performance degradation compared to the standard MARSHAL model.
Scaling to Larger Models
We extended training to the larger Qwen3-8B model. Results demonstrate that MARSHAL scales stably, consistently unlocking cooperative and competitive reasoning capabilities at larger scales.
Strategic Ability (8B)
| Model | Tic-Tac-Toe | Kuhn Poker | Mini Hanabi | Connect Four | Leduc Hold'em | Simple Hanabi |
|---|---|---|---|---|---|---|
| Qwen3-8B | 48.38 | 33.12 | 27.00 | 10.48 | 7.26 | 4.55 |
| MARSHAL (8B) | 54.05 | 44.49 | 55.28 | 21.55 | 53.89 | 37.27 |
Table 5. Strategic ability comparison on the larger Qwen3-8B.
Generalization to MAS (8B)
| MAS | Model | Avg | MATH | GSM8K | AQUA | AIME | AMC | MMLU | GPQA |
|---|---|---|---|---|---|---|---|---|---|
| MAD | Qwen3-8B | 82.49 | 95.00 | 96.36 | 83.46 | 70.00 | 90.00 | 89.59 | 53.03 |
| MARSHAL (8B) | 85.09 | 96.40 | 96.59 | 83.46 | 80.00 | 95.00 | 90.70 | 53.54 | |
| AutoGen | Qwen3-8B | 79.68 | 88.80 | 95.91 | 83.07 | 60.00 | 89.19 | 89.30 | 51.52 |
| MARSHAL (8B) | 83.58 | 94.40 | 95.00 | 85.04 | 70.00 | 95.00 | 90.04 | 55.56 |
Table 6. Generalization to multi-agent systems using the larger Qwen3-8B.
BibTeX
@article{yuan2025marshal,
title={MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs},
author={Yuan, Huining and Xu, Zelai and Tan, Zheyue and Yi, Xiangmin and Guang, Mo and Long, Kaiwen and Hui, Haojia and Li, Boxun and Chen, Xinlei and Zhao, Bo and others},
journal={arXiv preprint arXiv:2510.15414},
year={2025}
}
Models