The Problem
RL agents are impressive. Train one on Go long enough and it figures out strategy the hard way — through millions of games against itself or a fixed opponent.
But once training is done, all that knowledge is buried in neural network weights. You can't read it. You can't extract it. If you want a second agent to play the same game with a different algorithm, it starts completely from scratch.
That bothered me. PRISM is my attempt to fix it.
What PRISM Does
PRISM (Policy Reuse via Interpretable Strategy Mapping) forces RL agents to compress what they've learned into a small set of discrete concepts, then uses those concepts as a shared transfer interface.
Instead of strategies being buried in continuous weight space, PRISM makes them explicit. A PPO agent and a DQN agent, trained independently, can share strategy through 64 concept IDs.
How It Works
Training happens in three stages.
Stage 1: Baseline training. Train a standard RL agent. The agent has a CNN encoder that reads board positions into a 128-dimensional feature vector, and a policy head that maps that to actions.
Stage 2: Concept discovery. Freeze the encoder. Run it on gameplay episodes and cluster the resulting 128D vectors with K-means into K discrete concepts (K=64). Each observation now maps to a single integer: its concept ID.
Stage 3: Bottleneck policy. Train a new micro-policy that takes only concept IDs as input — not raw observations, not continuous features, just integers. It learns to play Go knowing only "I'm in concept 12 right now."
Transfer works by aligning concepts across agents. Compute cosine similarity between concept centroids from two different agents, run the Hungarian algorithm to find the optimal 1:1 matching, remap the source policy's embedding table into the target agent's concept space, and run — zero-shot, no gradient updates.
The Results
I tested six source→target pairs among three independently trained agents: PPO, DQN, and a behavioral cloning agent (DAgger on GnuGo expert demonstrations). All evaluated against GnuGo on Go 7×7, 10 seeds × 100 games each.
Two transfers succeed:
- BC → DQN: 76.4% ± 3.4% — The BC agent was trained on GnuGo expert demos, so its policy head closely matches expert play. That transfers well to DQN's concept space.
- PPO → DQN: 69.5% ± 3.2% — A strong RL-trained source transferring to a functional target encoder.
The other four fail. DQN → PPO sits at 49.8% — right at the 50% null, not significantly above it. DAgger → PPO is 41.5%, actually below 50%. DQN → DAgger is 38.7%. PPO → DAgger is 0% (completely degenerate — more on why below).
For context, a random agent achieves 3.5%, and identity mapping (no alignment) achieves 9.2%.
Fine-tuning. The zero-shot result is just an initialization. I fine-tuned the transferred PPO→DQN policy with REINFORCE on the concept bottleneck, and ran the same REINFORCE from a randomly initialized bottleneck. The transferred policy crosses 60% win rate at generation 5 (50K steps). The from-scratch policy reaches 27% at generation 40 (400K steps) without ever crossing 60%. That's an 8× step advantage to the threshold — though this is one seed, so treat the magnitude as rough.
What Didn't Work
The degenerate target. PPO → DAgger produces 0% win rate because the DAgger encoder collapses during gameplay — it maps every board position to the same single concept. The agent outputs pass on every move and loses by forfeit. The DAgger agent's own bottleneck achieves 98% win rate with this setup, because its policy learned a strong constant-action strategy during training that GnuGo can't beat on a 7×7 board. But that breaks completely when used as a transfer target.
The weak source. DQN received about a third as much training as PPO (57 generational steps vs. 169). Its native win rate is roughly 64% vs. PPO's 99%. None of the DQN-as-source transfers succeed. An undertrained source has degenerate centroids for positions it never encountered. No alignment method can recover information that was never encoded.
Atari Breakout. I ran the identical pipeline on Atari Breakout to see where the approach breaks. PPO reaches 15.1 reward/life. The bottleneck collapses to 0.3 — same as a random agent. Zero-shot transfer also lands at the random floor.
The failure mode is diagnostic: Breakout requires continuous ball tracking. The correct action depends on ball velocity and trajectory, which K-means can't capture from static frame features. The bottleneck ends up learning the marginal action distribution. That confirms the Go results reflect something real about the domain — discrete board positions genuinely call for categorically different strategies. Breakout doesn't have that structure.
The Surprising Part: Alignment Quality Predicts Nothing
I compared five alignment methods on PPO → DQN (5 seeds each):
| Method | Win Rate |
|---|---|
| Greedy NN (degenerate†) | 97.2% ± 1.6% |
| Hungarian (PRISM) | 68.8% ± 4.7% |
| Procrustes | 51.6% ± 6.4% |
| Random permutation | 48.4% ± 33.7% |
| Identity (no alignment) | 9.2% ± 3.0% |
†Greedy maps all 64 source concepts to a single target concept. The policy outputs one fixed action for every board state — high apparent win rate, not actual transfer.
Alignment method matters — Hungarian beats random, and both beat identity. But the quality metric (normalized centroid similarity after alignment) predicts nothing across transfer pairs. R² ≈ 0. Two concept spaces can be geometrically distant and still admit a consistent pairwise mapping. The alignment method finds the right pairing; geometric proximity is irrelevant to that.
Random permutation's 33.7% standard deviation tells its own story. Individual seeds range from 0% to 96%. It's a lottery, not a method.
The Causal Check
I wanted to confirm that concepts actually drive behavior rather than just correlate with it. Override each state's assigned concept with 5 alternatives, measure how often the chosen action changes. Answer: 69.4% of the time (p = 8.6 × 10⁻⁸⁶, 2500 interventions). The concepts are causally real.
The ablation results are the most interesting part. C16 (assigned in 15.4% of states) collapses win rate from 100% to 51.8% when removed. C47, the most frequently used concept at 33% of states, causes only a 9.4% drop. Frequency and strategic importance are not aligned.
Stability
Cross-seed ARI is 0.214, NMI is 0.587. ARI is harsh — it penalizes any pairwise disagreement including boundary cases. NMI measures shared structure between partitions. The 0.587 NMI means points that cluster together in one initialization tend to cluster together in others. The concept structure is stable even though the integer labels shift. Perturbation robustness (Gaussian noise σ=0.1) is 0.999 within a fixed run.
Fix your K-means seed and the results are reproducible. The specific integer assigned to a concept is an initialization artifact — its content is not.
K Sensitivity
I swept K ∈ {8, 16, 32, 64, 128} at reduced training to see how concept count affects both direct and transfer performance. Transfer peaks at K=32 (76%). Above that, individual cluster centroids become less reproducible across runs and alignment degrades. Direct performance is highest at K=8 and K=64.
The tradeoff: fewer concepts generalize better for transfer, more concepts capture finer-grained structure for direct play. K=64 is the paper's operating point based on direct performance at full training. If you're targeting transfer specifically, K=32 looks better.
Tech Stack
| Tool | Role |
|---|---|
| PyTorch | Neural networks, training loops |
| Stable-Baselines3 | PPO and DQN implementations |
| scikit-learn | K-means clustering |
| scipy | Hungarian algorithm, Procrustes analysis |
| GnuGo | Go engine for curriculum training and evaluation |
The Paper
The full writeup is 13 pages and covers the three-stage pipeline, causal intervention protocol, ablation study, alignment method comparison, fine-tuning curves, K sensitivity sweep, and the Atari Breakout boundary condition.
The most useful takeaway: spend your compute budget on training the source agent, not on your alignment method. Source quality drives transfer success. The alignment algorithm barely matters as long as it isn't random or identity.
Code is on GitHub. The interesting files are src/concept_aligner.py, src/concept_manager.py, and src/concept_policy.py.