Conformalized proximal policy optimization: statistical uncertainty quantification for principled exploration in reinforcement learning
1 School of Computer Science and Data Engineering, NingboTech University, Ningbo 315100, China
2 SoundKing Electro-Acoustic Co., Ltd., Ningbo 315140, China
  • Volume
  • Citation
    Zhang B, Chen G, Feng W. Conformalized proximal policy optimization: statistical uncertainty quantification for principled exploration in reinforcement learning. Robot Learn. 2026(2):0011, https://doi.org/10.55092/rl20260011. 
  • DOI
    10.55092/rl20260011
  • Copyright
    Copyright2026 by the authors. Published by ELSP.
Abstract

Reinforcement learning (RL) excels in diverse domains, yet its deployment in safety-critical applications is hindered by the lack of principled uncertainty quantification. Existing methods either lack formal statistical guarantees or incur significant computational costs. We propose Conformalized Proximal Policy Optimization (CP-PPO), integrating conformal prediction into the PPO framework to provide finite-sample, distribution-free prediction intervals. CP-PPO employs a sliding window conformal calibration to maintain approximate statistical validity despite non-stationary policy optimization, and leverages value function uncertainty to drive adaptive entropy regularization for principled exploration. We evaluate CP-PPO on five diverse Gymnasium environments spanning discrete and continuous control. CP-PPO achieves a 63% performance improvement in Pendulum-v1 while maintaining empirical coverage that precisely matches the theoretical target (90.0% for α = 0.1) across all environments. Comprehensive ablation studies demonstrate that both conformal calibration and adaptive entropy contribute to performance gains, and that coverage guarantees hold reliably across a wide range of hyperparameters (α ∈ [0.01,0.3], Ncal ∈ [100,2000]). CP-PPO establishes conformal prediction as a promising framework for online RL uncertainty quantification, offering empirical validation of finite-sample coverage properties in non-stationary sequential decision-making.

Keywords

reinforcement learning; conformal prediction; uncertainty quantification; safe exploration; statistical learning theory

Preview