PRISM
Policy Reuse via Interpretable Strategy Mapping
A reinforcement learning framework for zero-shot strategy transfer between agents. Train a PPO agent, transfer its strategy to a DQN agent — no retraining required.
The Problem
RL agents spend millions of steps learning strategy, but that knowledge is locked inside continuous neural network weights. You can't read it, extract it, or move it to a different agent.
Training a second agent on a related task means starting from scratch — no borrowing, no shortcuts. PRISM fixes that by forcing agents to reason through a shared layer of discrete, interpretable concepts.
How It Works
- 1Baseline training. Train an RL agent (PPO or DQN) with a CNN or MLP encoder that outputs 128D feature vectors.
- 2Concept discovery. Freeze the encoder, run K-means on 500 episodes of observations. Each state maps to one of K discrete concept IDs.
- 3Bottleneck policy. Train a micro-policy that takes only concept IDs as input — no raw observations, just integers.
Zero-Shot Transfer
To transfer between agents, PRISM aligns their concept spaces. It computes cosine similarity between concept centroids and runs the Hungarian algorithm to find the optimal 1:1 matching. Once aligned, a source policy's embeddings are remapped to the target agent's concept space — and run immediately, zero-shot.
| Transfer Pair | Win Rate | Notes |
|---|---|---|
| PPO to DQN | 92.0% ± 2.6 | Best same-task result |
| PPO to DAgger | 90.6% ± 2.2 | Strong transfer |
| DQN to PPO | 64.4% ± 6.1 | DQN concepts less universal |
| DAgger to any | ~54% | BC sources don't transfer well |
Tech Stack
Environments
- Go 7x7 — CNN encoder, K=64 concepts, action masking
- Go 5x5 — Curriculum source for 7x7 transfer
- CartPole — MLP encoder, K=32 concepts
- LunarLander — Cross-domain transfer target
- Acrobot — Cross-domain control task
Key Findings
Encoder quality beats alignment method
The source agent's encoder determines whether transfer works. Alignment method (Hungarian, Procrustes, greedy) makes almost no difference — all achieve ~92% WR. Random alignment drops to 73%.
Transitive transfer
Routing through a high-quality intermediary (A to B to C) sometimes beats direct transfer (A to C) by 11 to 13 percentage points.
Concepts are causally real
Forcing the policy to use override concept IDs changes the chosen action 75.6% of the time (p ~ 10^-151). The concepts actually drive behavior.
Stable across seeds
Running K-means twice with different random seeds produces nearly identical clusters (ARI > 0.87). The concepts are genuine structure in the representation.
Read the Paper
Full treatment with proofs, ablations, five alignment methods compared, and results across all three domains — 23 pages.