SPHERE ICML 2026 · Accepted Paper
Accepted to ICML 2026
Mixture-of-Experts · Continual Deep Reinforcement Learning · Spectral Plasticity

SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning

Continual RL trains one policy across a sequence of tasks, but MoE policies can become unable to adapt to later tasks. SPHERE keeps expert representations diverse so updates stay useful.
Peking University · BIGAI

TL;DR

SPHERE studies why MoE policies lose their ability to adapt in continual RL. It shows that learning updates collapse into too few directions, then keeps expert features diverse so later tasks remain learnable.

Phenomenon → diagnosis → regularization
Problem A policy trained across many tasks can stop adapting to later tasks.
Method A regularizer that keeps expert features diverse instead of collapsed.
Effect Better continual-control performance and healthier feature geometry across tasks.

Mechanism: Why Policies Stop Learning in Continual RL

The mechanism story starts with the observed learning slowdown, connects it to collapsed update directions, and then shows how SPHERE keeps those directions more diverse.

Phenomenon → spectral collapse → SPHERE
Phenomenon In continual RL, policies keep receiving new experience but can stop changing effectively.
Diagnosis The update geometry loses rank, meaning learning concentrates into too few functional directions.
SPHERE SPHERE keeps the weighted expert-feature Gram more isotropic, which helps preserve diverse update directions.
Teaser showing continual-RL plasticity loss, spectral collapse, and SPHERE regularization

Update Geometry: Collapse vs. SPHERE

Use the slider · or press Play

The visualization makes the diagnosis concrete. A unit sphere of possible gradient directions becomes an ellipsoid after multiplication by the eNTK matrix $\mathbf{K}$. When $\mathbf{K}$ becomes low-rank, one axis shrinks toward zero and the ellipsoid degenerates toward a near-plane or line; SPHERE keeps the spectrum more isotropic.

Top row ($\nabla_f L$) shows the input sphere of directions; bottom row ($\mathbf{K}\nabla_f L$) shows the stretching process.

Task
Task 5 Run
Mesh

Interactive 3D visualization comparing baseline Top-K MoE and SPHERE update geometry across HumanoidBench tasks.

Baseline (Top‑K MoE)
collapsed spectrum → near low-rank
SPHERE
isotropic spectrum → diverse updates

This animation is grounded in real Fig. 1 exports: the top-3 eigenvalues of $\mathbf{K}$ shape the ellipsoid for each task, and gradient-direction samples are projected into the same 3D subspace.

Experiments & Analysis

The story follows the paper: first the failure mode, then the spectral diagnosis, then performance on MetaWorld and HumanoidBench, followed by the design ablation and feature proxy.

Phenomenon · Diagnosis · Performance · Design · Feature Proxy
Phenomenon
Continual RL Degrades Across Architectures
HumanoidBench success: RL vs CRL
Before introducing SPHERE, the same HumanoidBench architectures succeed less when trained continually than when trained task by task.
Diagnosis
Spectral Plasticity Collapses During CRL
Effective rank during training
The performance drop comes with lower eNTK effective rank: baseline updates collapse, while SPHERE keeps the update geometry better conditioned.
Performance
SPHERE Improves Continual Training Performance
MetaWorld
MetaWorld success rate across methods under RL and CRL
On MetaWorld, SPHERE narrows the gap between task-by-task training and continual training, giving the strongest average success among the compared methods.
HumanoidBench
HumanoidBench success rate across methods under RL and CRL
On HumanoidBench, the same pattern holds: SPHERE improves average success over the unregularized MoE and continual-learning baselines.
Design
What Matters in SPHERE?
HumanoidBench CRL design ablation
Variant Average success
w/o SPHERE 0.36 ± 0.08
w/ SPHERE 0.54 ± 0.12
All hidden expert layers 0.42 ± 0.07
Per-expert loss sum 0.40 ± 0.08
Gradient-factor regularization 0.43 ± 0.09
The ablation asks which SPHERE design choices matter. The default last-layer, routing-weighted expert-feature Gram is the strongest tested design; alternatives help, but none matches the full setup.
Qualitative
Feature Collapse Across Tasks
Qualitative analysis: collapse vs SPHERE
Using the same held-out Stair states after each task, we visualize expert features. Without SPHERE, points quickly concentrate along one dominant direction; with SPHERE, multiple directions remain active across the sequence.
Feature Proxy
Weighted Expert Features Track the eNTK Rank
Scatter: effective rank of weighted expert feature Gram vs effective rank of eNTK
The weighted expert-feature Gram follows the same trend as the eNTK effective rank, supporting it as a practical proxy for spectral plasticity.

BibTeX

@inproceedings{luo2026sphere,
  title   = {SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning},
  author    = {Luo, Lirui and Zhang, Guoxi and Xu, Hongming and Fang, Cong and Li, Qing},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}