PRISM
Policy Reuse via Interpretable Strategy Mapping
A reinforcement learning framework for zero-shot strategy transfer between agents. Train a PPO agent, transfer its strategy to a DQN agent — no retraining required.
The Problem
RL agents spend millions of steps learning strategy, but that knowledge is locked inside continuous neural network weights. You can't read it, extract it, or move it to a different agent.
Training a second agent on the same task means starting from scratch. PRISM fixes that by forcing agents to reason through a shared layer of discrete, interpretable concepts — and using those concepts as a transfer interface.
How It Works
- 1Baseline training. Train an RL agent (PPO, DQN, or DAgger) with a CNN encoder that outputs 128D feature vectors.
- 2Concept discovery. Freeze the encoder, run K-means on gameplay episodes. Each board position maps to one of 64 discrete concept IDs.
- 3Bottleneck policy. Train a micro-policy on concept IDs only — no raw observations, just integers. Transfer by aligning concept spaces via Hungarian matching.
Zero-Shot Transfer Results
Six source→target pairs among three independently trained agents, evaluated against GnuGo on Go 7×7, 10 seeds × 100 games. Random agent baseline: 3.5%. No-alignment baseline: 9.2%.
| Source | Target | Win Rate | Notes |
|---|---|---|---|
| BC | DQN | 76.4% ± 3.4% | Expert-trained source transfers well |
| PPO | DQN | 69.5% ± 3.2% | Strong RL source, functional target |
| DQN | PPO | 49.8% ± 2.6% | Undertrained source, not above 50% |
| DAgger | PPO | 41.5% ± 6.0% | Below 50% — imitation source fails |
| DQN | BC | 38.7% ± 4.9% | Degenerate target encoder |
| PPO | BC | 0.0% | BC encoder collapses — always passes |
Tech Stack
Scope
- Go 7×7 — primary domain, CNN encoder, K=64 concepts, action masking
- Atari Breakout — boundary condition: bottleneck collapses to random floor
PRISM requires domains where strategic state is naturally discrete. Continuous-dynamics environments like Breakout fall outside scope — confirmed empirically.
Key Findings
Source quality is what matters
Alignment quality (centroid similarity after matching) predicts nothing — R² ≈ 0 across all transfer pairs. Whether the source policy is strong is the operative variable. Spend compute on training the source, not on alignment method.
Concepts are causally real
Overriding a state's concept assignment changes the chosen action 69.4% of the time (p = 8.6 × 10⁻⁸⁶, 2500 interventions). The concepts drive behavior — they're not just correlated with it.
Frequency ≠ importance
C47 (most-used at 33% of states) causes only a 9.4% win-rate drop when ablated. C16 (15.4% frequency) collapses win rate from 100% to 51.8%. Concept usage does not predict strategic importance.
K tradeoff
Transfer win rate peaks at K=32 (76% in the ablation sweep). Direct performance favors K=64 at full training. Lower K generalizes better for transfer; higher K captures finer-grained structure for direct play.
Read the Paper
13 pages covering the three-stage pipeline, causal intervention, concept ablation, alignment method comparison, fine-tuning curves, K sensitivity sweep, and the Atari Breakout boundary condition.