MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

Overview

We introduce MARSHAL, an end-to-end reinforcement learning framework designed to incentivize Multi-Agent Reasoning through Self-play witH strAtegic LLMs in a diverse range of competitive and cooperative games.

MARSHAL addresses the challenge of credit assignment in multi-agent multi-turn self-play through two core mechanisms:

Turn-level Advantage Estimator: Enables fine-grained credit assignment, allowing the model to accurately attribute long-term outcomes to individual actions.
Agent-specific Advantage Normalization: Stabilizes the training process by calibrating advantage estimates relative to the performance of each agent.

Key Results

By leveraging self-play across strategic games, MARSHAL (based on Qwen3-4B) demonstrates notable generalization capabilities:

Strategic Games: Achieves up to 28.7% performance improvement on held-out games.
Reasoning Benchmarks: When integrated into leading multi-agent systems (MASs), MARSHAL yields consistent gains of up to
- +10.0% on AIME
- +7.6% on GPQA-Diamond
- +3.5% on average across all tested benchmarks.

Figure 1. Evaluation of MARSHAL and two baselines on strategic games and reasoning benchmarks.

Game Replay

Click 'Next Step' to view a self-play demonstration of the MARSHAL agent in Tic-Tac-Toe. Observe the agent's reasoning process regarding game states and strategies as it controls both Player 1 (X) and Player 0 (O).

Player 0 (X)

Click 'Next Step' to start replay...
Player 0 reasoning...

Player 1 (O)

Click 'Next Step' to start replay...
Player 1 reasoning...

Method

Algorithmic Design

We introduce two modifications to GRPO for multi-agent self-play: a turn-level advantage estimator for fine-grained credit assignment, and agent-specific advantage normalization to stabilize training across heterogeneous roles.

Figure 2. Overview of MARSHAL. Left column: generating player trajectories through self-play in strategic games. Middle column: naive advantage estimation by GRPO. Right column: advantage estimation by MARSHAL for accurate credit assignment in multi-turn, multi-agent setting.

Game Selection

We use a portfolio of six strategic games, split into training and held-out testing sets to evaluate generalization:

Perfect-information competitive: For deterministic planning and role adaptation, train on Tic-Tac-Toe; test on Connect Four.
Imperfect-information competitive: For robust reasoning under uncertainty, train on Kuhn Poker; test on Leduc Hold'em.
Imperfect-information cooperative: For intent recognition and theory of mind, train on Mini Hanabi; test on Simple Hanabi.

Figure 3. Game selection. From left to right, Tic-Tac-Toe, Kuhn Poker, Hanabi, Connect Four, Leduc Hold'em.

Reward Design

The reward signal consists of three components:

Intrinsic game rewards: +/-1 for win/loss/draw or chips won; shared reward for cooperative games.
Action format regularization: Small reward for valid actions; large penalty for invalid ones.
Response length penalty: Turn-level penalty for verbosity to encourage conciseness.

Main Results

Based on Qwen3-4B, we train two model types: specialist agents on single games (Tic-Tac-Toe, Kuhn Poker, Mini Hanabi) and a generalist agent on all three simultaneously.

Strategic Ability

Specialist agents generalize effectively to their more complex, held-out counterparts (e.g., from Tic-Tac-Toe to Connect Four). The generalist model achieves high performance across all settings, with 28.7% improvement on Leduc Hold'em and 22.9% on Simple Hanabi, demonstrating robust skill transfer.

Figure 4. Average normalized game returns.

Generalization to Multi-Agent Systems

We evaluate the transfer of capabilities to reasoning benchmarks within multi-agent systems, including the competitive MAD and cooperative AutoGen. In MAD, the generalist agent improves by 3.51% on average. Notably, it achieves striking gains on challenging benchmarks, including +6.57% on GPQA-Diamond in MAD and +10.00% on AIME in AutoGen.

Setting	Model	Average	Math					QA
Setting	Model	Average	MATH	GSM8K	AQUA	AIME	AMC	MMLU	GPQA
Single Agent	Qwen3-4B	60.74	87.60	94.60	39.80	36.70	70.00	57.10	39.39
	SPIRAL	63.75	87.50	94.80	51.20	36.70	80.00	58.70	37.37
	MARSHAL
	Tic-Tac-Toe	63.54	89.10	95.20	46.50	40.00	77.50	57.60	38.89
	Kuhn Poker	61.38	87.80	94.50	48.40	33.30	72.50	59.30	33.84
	Mini Hanabi	62.05	88.10	94.70	48.00	43.30	65.00	58.90	36.36
	Generalist	62.79	89.90	94.60	52.00	33.30	75.00	59.90	34.85
MAD (Competitive)	Qwen3-4B	72.45	90.20	95.91	80.71	40.00	75.00	87.42	37.88
	SPIRAL	73.41	91.60	95.45	81.89	40.00	77.50	87.01	40.40
	MARSHAL
	Tic-Tac-Toe	75.01	92.20	96.06	83.07	43.33	82.50	86.76	41.12
	Kuhn Poker	74.54	91.60	96.21	82.68	40.00	82.50	87.39	41.41
	Mini Hanabi	73.70	91.40	95.60	82.68	43.33	77.50	87.04	38.38
	Generalist	75.96	92.80	95.60	83.86	46.67	80.00	87.36	45.45
AutoGen (Cooperative)	Qwen3-4B	79.14	93.40	94.69	85.04	56.67	87.50	89.21	47.47
	SPIRAL	80.05	94.20	94.47	86.61	60.00	87.50	91.60	45.96
	MARSHAL
	Tic-Tac-Toe	80.15	94.40	94.69	87.01	60.00	90.00	89.53	45.45
	Kuhn Poker	81.54	95.80	94.39	86.61	63.33	92.50	89.65	48.48
	Mini Hanabi	81.54	94.40	94.54	86.22	66.67	95.00	88.98	44.95
	Generalist	82.15	95.20	94.54	86.61	66.67	92.50	89.53	50.00

Table 1. Evaluation results on downstream reasoning benchmarks within multi-agent systems.

Pattern Analysis

Qualitative

Analysis of reasoning traces reveals emergent multi-agent skills cultivated in games and transferred to general-purpose systems.

Skill	Manifestation in Game-Play	Generalize to Multi-Agent Systems
Role Understanding	The Tic-Tac-Toe specialist recognizes its role as the second player (O) and adopts a defensive strategy. `<think> Okay, so I am playing the game of Tic-Tac-Toe as the mark O... As the second-move player, I should prioritize blocking the X marks from forming a horizontal, vertical, or diagonal line... </think>`	The same agent, acting as the "negative" debater in MAD, adapts its strategy to refute the opponent. `<think> Okay, so I need to figure out the minimum number of... The affirmative side just argued for 4. As the negative side assistant, I should try to find an alternative solution and convince the judge... </think>`
Intent Recognition	The Hanabi specialist infers the intent behind a teammate's ambiguous hint. `<think> Okay, so I need to figure out the next move. Player 0 just revealed my only red card. Wait, but the rank remains unknown. Maybe they want me to play this card to the stack? ... </think>`	The same agent, acting as a user proxy in AutoGen, infers uncertainty from a collaborator's missing 'TERMINATE' token. `<think> Okay, let's see. The assistant gave the answer 17.5 m/s, but did not end the conversation with 'TERMINATE'. Maybe the assistant is not sure with this answer? ... </think>`

Table 2. Qualitative analysis of emergent reasoning patterns.

Quantitative

Following the taxomony of Cemri et al. (2025), failure mode analysis on GPQA-Diamond shows a 11.5% reduction in Inter-Agent Misalignment. This confirms that MARSHAL agents actively listen to peers and maintain focus, validating game-theoretic self-play cultivates transferable multi-agent reasoning skills.

Figure 5. Percentage of different failure modes in GPQA-Diamond.

Ablation

Self-Play vs. Fixed-Opponent

Training against fixed opponents leads to overfitting and poor generalization. Self-play provides an adaptive curriculum essential for developing robust, generalizable policies.

Model	Training Games			Testing Games
Model	Tic-Tac-Toe	Kuhn Poker	Mini Hanabi	Connect Four	Leduc Hold'em	Simple Hanabi
MARSHAL (Tic-Tac-Toe)	75.30 / 32.10	74.15 / 3.42	50.48	30.65 / 14.85	58.36 / 27.65	29.75
w/ fixed opponent	88.00 / 41.95	63.15 / 28.84	34.93	20.35 / 5.65	47.38 / 35.55	12.22
MARSHAL (Kuhn Poker)	69.85 / 25.50	79.04 / 22.49	44.98	27.60 / 12.70	63.94 / 62.10	29.35
w/ fixed opponent	0.00 / 0.00	76.19 / 15.64	0.00	0.00 / 0.00	0.00 / 0.00	0.00

Table 3. Generalization comparison between MARSHAL (self-play) and its fixed-opponent variant. For competitive games, entries indicate first-move / second-move returns. Underlined scores indicate performance degradation compared to the standard MARSHAL model.

Analysis of Algorithmic Design

Ablation studies confirm that both the turn-level advantage estimator and agent-specific normalization are critical. Removing either component significantly degrades performance, especially in long-horizon and competitive games.

Model	Training Games			Testing Games
Model	Tic-Tac-Toe	Kuhn Poker	Mini Hanabi	Connect Four	Leduc Hold'em	Simple Hanabi
MARSHAL (Tic-Tac-Toe)	75.30 / 32.10	74.15 / 3.42	50.48	30.65 / 14.85	58.36 / 27.65	29.75
w/o Turn-Level	74.60 / 24.15	80.26 / 28.35	34.80	26.75 / 12.30	48.34 / 41.34	19.05
w/o Agent-Specific	82.70 / 31.20	70.89 / 11.24	44.10	25.40 / 10.50	51.04 / 49.88	21.72
MARSHAL (Kuhn Poker)	69.85 / 25.50	79.04 / 22.49	44.98	27.60 / 12.70	63.94 / 62.10	29.35
w/o Turn-Level	63.35 / 19.65	92.49 / 21.02	41.65	29.60 / 10.85	32.26 / 31.23	22.98
w/o Agent-Specific	69.55 / 24.55	75.37 / 19.55	40.18	27.00 / 10.50	35.73 / 21.50	22.42
MARSHAL (Hanabi)	71.90 / 7.35	72.52 / 9.29	55.55	26.75 / 5.75	37.36 / 55.12	33.93
w/o Turn-Level	67.55 / 10.60	68.45 / 31.78	53.20	25.25 / 3.05	54.79 / 47.77	30.68
w/o Agent-Specific	68.15 / 13.40	74.15 / 10.27	52.50	32.10 / 5.10	44.30 / 56.41	32.08

Table 4. Ablation results for algorithmic design. For competitive games, entries indicate first-move / second-move returns. Underlined scores indicate performance degradation compared to the standard MARSHAL model.

Scaling to Larger Models

We extended training to the larger Qwen3-8B model. Results demonstrate that MARSHAL scales stably, consistently unlocking cooperative and competitive reasoning capabilities at larger scales.

Strategic Ability (8B)

Model	Tic-Tac-Toe	Kuhn Poker	Mini Hanabi	Connect Four	Leduc Hold'em	Simple Hanabi
Qwen3-8B	48.38	33.12	27.00	10.48	7.26	4.55
MARSHAL (8B)	54.05	44.49	55.28	21.55	53.89	37.27

Table 5. Strategic ability comparison on the larger Qwen3-8B.

Generalization to MAS (8B)

MAS	Model	Avg	MATH	GSM8K	AQUA	AIME	AMC	MMLU	GPQA
MAD	Qwen3-8B	82.49	95.00	96.36	83.46	70.00	90.00	89.59	53.03
MAD	MARSHAL (8B)	85.09	96.40	96.59	83.46	80.00	95.00	90.70	53.54
AutoGen	Qwen3-8B	79.68	88.80	95.91	83.07	60.00	89.19	89.30	51.52
AutoGen	MARSHAL (8B)	83.58	94.40	95.00	85.04	70.00	95.00	90.04	55.56

Table 6. Generalization to multi-agent systems using the larger Qwen3-8B.

BibTeX

@article{yuan2025marshal,
  title={MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs},
  author={Yuan, Huining and Xu, Zelai and Tan, Zheyue and Yi, Xiangmin and Guang, Mo and Long, Kaiwen and Hui, Haojia and Li, Boxun and Chen, Xinlei and Zhao, Bo and others},
  journal={arXiv preprint arXiv:2510.15414},
  year={2025}
}