Reinforcement learning (RL) excels in diverse domains, yet its deployment in safety-critical applications is hindered by the lack of principled uncertainty quantification. Existing methods either lack formal statistical guarantees or incur significant computational costs. We propose Conformalized Proximal Policy Optimization (CP-PPO), integrating conformal prediction into the PPO framework to provide finite-sample, distribution-free prediction intervals. CP-PPO employs a sliding window conformal calibration to maintain approximate statistical validity despite non-stationary policy optimization, and leverages value function uncertainty to drive adaptive entropy regularization for principled exploration. We evaluate CP-PPO on five diverse Gymnasium environments spanning discrete and continuous control. CP-PPO achieves a 63% performance improvement in Pendulum-v1 while maintaining empirical coverage that precisely matches the theoretical target (90.0% for α = 0.1) across all environments. Comprehensive ablation studies demonstrate that both conformal calibration and adaptive entropy contribute to performance gains, and that coverage guarantees hold reliably across a wide range of hyperparameters (α ∈ [0.01,0.3], Ncal ∈ [100,2000]). CP-PPO establishes conformal prediction as a promising framework for online RL uncertainty quantification, offering empirical validation of finite-sample coverage properties in non-stationary sequential decision-making.
reinforcement learning; conformal prediction; uncertainty quantification; safe exploration; statistical learning theory