PRISM

Policy Reuse via Interpretable Strategy Mapping

A reinforcement learning framework for zero-shot strategy transfer between agents. Train a PPO agent, transfer its strategy to a DQN agent — no retraining required.

PRISM architecture diagram showing zero-shot strategy transfer between PPO and DQN agents
92.0%
Zero-Shot Win Rate
PPO to DQN transfer
1.35x
Curriculum Speedup
5x5 to 7x7 Go transfer
75.6%
Causal Intervention
concepts drive behavior
0.87+
Concept Stability
ARI across seeds

The Problem

RL agents spend millions of steps learning strategy, but that knowledge is locked inside continuous neural network weights. You can't read it, extract it, or move it to a different agent.

Training a second agent on a related task means starting from scratch — no borrowing, no shortcuts. PRISM fixes that by forcing agents to reason through a shared layer of discrete, interpretable concepts.

How It Works

  1. 1
    Baseline training. Train an RL agent (PPO or DQN) with a CNN or MLP encoder that outputs 128D feature vectors.
  2. 2
    Concept discovery. Freeze the encoder, run K-means on 500 episodes of observations. Each state maps to one of K discrete concept IDs.
  3. 3
    Bottleneck policy. Train a micro-policy that takes only concept IDs as input — no raw observations, just integers.

Zero-Shot Transfer

To transfer between agents, PRISM aligns their concept spaces. It computes cosine similarity between concept centroids and runs the Hungarian algorithm to find the optimal 1:1 matching. Once aligned, a source policy's embeddings are remapped to the target agent's concept space — and run immediately, zero-shot.

Transfer PairWin RateNotes
PPO to DQN92.0% ± 2.6Best same-task result
PPO to DAgger90.6% ± 2.2Strong transfer
DQN to PPO64.4% ± 6.1DQN concepts less universal
DAgger to any~54%BC sources don't transfer well

Tech Stack

PythonPyTorchStable-Baselines3scikit-learnscipyPettingZooGymnasium

Environments

  • Go 7x7 — CNN encoder, K=64 concepts, action masking
  • Go 5x5 — Curriculum source for 7x7 transfer
  • CartPole — MLP encoder, K=32 concepts
  • LunarLander — Cross-domain transfer target
  • Acrobot — Cross-domain control task

Key Findings

Encoder quality beats alignment method

The source agent's encoder determines whether transfer works. Alignment method (Hungarian, Procrustes, greedy) makes almost no difference — all achieve ~92% WR. Random alignment drops to 73%.

Transitive transfer

Routing through a high-quality intermediary (A to B to C) sometimes beats direct transfer (A to C) by 11 to 13 percentage points.

Concepts are causally real

Forcing the policy to use override concept IDs changes the chosen action 75.6% of the time (p ~ 10^-151). The concepts actually drive behavior.

Stable across seeds

Running K-means twice with different random seeds produces nearly identical clusters (ARI > 0.87). The concepts are genuine structure in the representation.

Read the Paper

Full treatment with proofs, ablations, five alignment methods compared, and results across all three domains — 23 pages.