PRISM

Policy Reuse via Interpretable Strategy Mapping

A reinforcement learning framework for zero-shot strategy transfer between agents. Train a PPO agent, transfer its strategy to a DQN agent — no retraining required.

PRISM architecture diagram showing zero-shot strategy transfer between PPO and DQN agents
76.4%
Best Transfer
BC → DQN, zero-shot
Fine-Tune Speedup
steps to 60% win rate
69.4%
Causal Intervention
concepts drive behavior
0.587
Concept Stability
NMI across seeds

The Problem

RL agents spend millions of steps learning strategy, but that knowledge is locked inside continuous neural network weights. You can't read it, extract it, or move it to a different agent.

Training a second agent on the same task means starting from scratch. PRISM fixes that by forcing agents to reason through a shared layer of discrete, interpretable concepts — and using those concepts as a transfer interface.

How It Works

  1. 1
    Baseline training. Train an RL agent (PPO, DQN, or DAgger) with a CNN encoder that outputs 128D feature vectors.
  2. 2
    Concept discovery. Freeze the encoder, run K-means on gameplay episodes. Each board position maps to one of 64 discrete concept IDs.
  3. 3
    Bottleneck policy. Train a micro-policy on concept IDs only — no raw observations, just integers. Transfer by aligning concept spaces via Hungarian matching.

Zero-Shot Transfer Results

Six source→target pairs among three independently trained agents, evaluated against GnuGo on Go 7×7, 10 seeds × 100 games. Random agent baseline: 3.5%. No-alignment baseline: 9.2%.

SourceTargetWin RateNotes
BCDQN76.4% ± 3.4%Expert-trained source transfers well
PPODQN69.5% ± 3.2%Strong RL source, functional target
DQNPPO49.8% ± 2.6%Undertrained source, not above 50%
DAggerPPO41.5% ± 6.0%Below 50% — imitation source fails
DQNBC38.7% ± 4.9%Degenerate target encoder
PPOBC0.0%BC encoder collapses — always passes

Tech Stack

PythonPyTorchStable-Baselines3scikit-learnscipyGnuGo

Scope

  • Go 7×7 — primary domain, CNN encoder, K=64 concepts, action masking
  • Atari Breakout — boundary condition: bottleneck collapses to random floor

PRISM requires domains where strategic state is naturally discrete. Continuous-dynamics environments like Breakout fall outside scope — confirmed empirically.

Key Findings

Source quality is what matters

Alignment quality (centroid similarity after matching) predicts nothing — R² ≈ 0 across all transfer pairs. Whether the source policy is strong is the operative variable. Spend compute on training the source, not on alignment method.

Concepts are causally real

Overriding a state's concept assignment changes the chosen action 69.4% of the time (p = 8.6 × 10⁻⁸⁶, 2500 interventions). The concepts drive behavior — they're not just correlated with it.

Frequency ≠ importance

C47 (most-used at 33% of states) causes only a 9.4% win-rate drop when ablated. C16 (15.4% frequency) collapses win rate from 100% to 51.8%. Concept usage does not predict strategic importance.

K tradeoff

Transfer win rate peaks at K=32 (76% in the ablation sweep). Direct performance favors K=64 at full training. Lower K generalizes better for transfer; higher K captures finer-grained structure for direct play.

Read the Paper

13 pages covering the three-stage pipeline, causal intervention, concept ablation, alignment method comparison, fine-tuning curves, K sensitivity sweep, and the Atari Breakout boundary condition.