โ† Back to blog

Building PRISM: Zero-Shot Strategy Transfer for RL Agents

2/24/2026ยท5 min readPythonPyTorchReinforcement LearningResearch
Building PRISM: Zero-Shot Strategy Transfer for RL Agents

The Problem

Reinforcement learning agents are impressive. Train one on Go for long enough and it figures out strategy the hard way โ€” through millions of games against itself.

But once it's trained, all that knowledge is stuck inside neural network weights. You can't read it. You can't extract it. If you want a second agent to play the same game with a different algorithm, it starts completely from scratch. No borrowing, no shortcuts.

That bothered me. So I built PRISM.

What PRISM Does

PRISM stands for Policy Reuse via Interpretable Strategy Mapping. The core idea: force RL agents to compress what they've learned into a small set of discrete concepts, then use those concepts as a common transfer interface.

Instead of strategies being buried in continuous weight space, PRISM makes them explicit. A PPO agent and a DQN agent, trained independently, can share strategy through 64 shared concept IDs.

How It Works

Training happens in three stages.

Stage 1: Baseline training. Train a standard RL agent (PPO or DQN). The agent has an encoder that reads observations into a 128-dimensional vector, and a policy head that maps that to actions.

Stage 2: Concept discovery. Freeze the encoder. Run it on 500 episodes of observations and cluster the resulting 128D vectors with K-means into K discrete concepts (K=64 for Go, K=32 for control tasks). Each observation now maps to a single integer: its concept ID.

Stage 3: Bottleneck policy. Train a new micro-policy that takes only concept IDs as input. Not raw observations. Not continuous features. Just integers. It has to learn to play the game knowing only "I'm in concept 12 right now."

Transfer works by aligning concepts across agents. You compute cosine similarity between concept centroids from two different agents, then run the Hungarian algorithm to find the optimal 1:1 matching. Once aligned, you take the source policy's embeddings, remap them to the target agent's concept space, and run โ€” zero-shot.

The Results

Same-task transfer on Go 7x7: PPO to DQN achieves 92.0% win rate with no retraining. This was the number I was most excited about. Two agents, trained with completely different algorithms, never sharing weights โ€” and one can pick up and run the other's policy.

Curriculum transfer: Taking concepts from a Go 5x5 agent and using them to warm-start a 7x7 agent gave a 1.35x speedup to reach 95% win rate (26 training generations vs 35 from scratch, p=0.027). The encoder sizes are incompatible between 5x5 and 7x7, but both operate in the same 128D space, so the concept transfer still works.

Transitive transfer: A to B to C sometimes beats A to C directly. Routing through a PPO intermediary added 11 to 13 percentage points over direct transfer in some configurations.

Causal intervention: To check that concepts actually drive behavior and aren't just correlated with it, I forced the policy to use override concept IDs and measured how often the chosen action changed. Answer: 75.6% of the time (p ~ 10^-151). The concepts are real.

What Didn't Work

Behavioral cloning sources don't transfer. DAgger-trained agents top out at about 54% win rate as sources, versus 76 to 92% for RL-trained agents. The issue is exploration: RL agents see a wide variety of states during training, which makes their concepts general. BC agents overfit to expert demonstrations and produce specialist features that don't generalize.

Cross-domain zero-shot transfer mostly fails. CartPole and LunarLander have completely different action semantics. "Push left" means different things in each environment. Fine-tuning helps a lot โ€” CartPole to LunarLander picks up +21% with a short fine-tune โ€” but dropping a zero-shot policy into a different domain rarely works.

Strategy composition experiments also didn't pan out. I tried averaging concept embeddings from specialist agents hoping to combine their strategies. The bottleneck constraint was too tight; averaging just destroyed both strategies.

The Surprising Part

The alignment method barely matters.

Hungarian algorithm, Procrustes analysis, greedy nearest-neighbor โ€” all of them achieve roughly 92% win rate, within 1 to 2 percentage points of each other. Random alignment drops to 73%. So the structure is real and any reasonable method can find it, but they're all about equally good at finding it.

What actually matters is the source agent's encoder quality. The correlation between alignment quality metrics and final transfer performance is nearly zero (R^2 = 0.004). You could spend a lot of time optimizing your alignment method and it would barely move the needle. The source encoder is what determines whether transfer will work.

Concept stability was also higher than expected. Running K-means twice on the same encoder with different random seeds produces nearly identical clusters (Adjusted Rand Index > 0.87). The concepts are genuine structure in the learned representation, not just artifacts of wherever K-means happened to land.

Tech Stack

Tool Role
PyTorch Neural networks, training loops
Stable-Baselines3 PPO and DQN implementations
scikit-learn K-means clustering
scipy Hungarian algorithm, Procrustes analysis
PettingZoo Multi-agent Go environment
Gymnasium CartPole, LunarLander, Acrobot

What's Next

The obvious next step is testing on harder environments. Go 7x7 is already a real challenge, but it's still a small board. The interesting question is whether the same concept structure appears in larger games, and whether it's still transferable at that scale.

Cross-domain transfer working well zero-shot would also be interesting. Right now it requires fine-tuning because action semantics don't carry over. If you could build a concept space that's truly action-agnostic โ€” describing game state rather than intended action โ€” that might change things.

The paper is linked on the project page if you want the full treatment: proofs, ablations, five alignment methods compared. It's 23 pages.

If you're doing anything in RL interpretability or transfer learning, feel free to dig into the code. Most of the interesting stuff lives in src/concept_aligner.py and src/concept_manager.py.