DQN 算法实战:CartPole-v0 环境 1000 轮训练实现 200 分满分
DQN算法实战从零构建CartPole智能体的完整指南1. 环境准备与基础概念在开始构建DQN智能体之前我们需要先理解几个核心概念。CartPole-v0是OpenAI Gym中的一个经典控制问题目标是让小车上的杆子保持直立不倒下。这个环境有四个状态变量小车位置、小车速度、杆子角度和杆子角速度两个动作向左或向右施加力。首先安装必要的Python库pip install gym numpy torch matplotlibDQNDeep Q-Network结合了深度学习和Q-learning通过神经网络来近似Q函数。与传统Q-learning使用表格存储Q值不同DQN可以处理高维状态空间。以下是DQN的三大核心组件经验回放Experience Replay存储并随机采样过去的经验打破数据间的相关性目标网络Target Network稳定训练过程的第二个网络神经网络近似用深度神经网络代替Q表2. DQN实现详解2.1 神经网络结构设计我们使用PyTorch构建一个简单的三层全连接网络import torch import torch.nn as nn import torch.optim as optim class DQN(nn.Module): def __init__(self, state_size, action_size): super(DQN, self).__init__() self.fc1 nn.Linear(state_size, 64) self.fc2 nn.Linear(64, 64) self.fc3 nn.Linear(64, action_size) def forward(self, x): x torch.relu(self.fc1(x)) x torch.relu(self.fc2(x)) return self.fc3(x)这个网络接收4维状态向量输出2个动作的Q值。隐藏层使用ReLU激活函数最后一层直接输出Q值。2.2 经验回放实现经验回放是DQN稳定训练的关键它存储了智能体与环境交互的经验状态、动作、奖励、新状态、是否终止from collections import deque import random class ReplayBuffer: def __init__(self, capacity): self.buffer deque(maxlencapacity) def push(self, state, action, reward, next_state, done): self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): return random.sample(self.buffer, batch_size) def __len__(self): return len(self.buffer)经验回放有两大优势打破数据间的时序相关性提高数据利用率每个经验可以被多次使用2.3 训练流程代码实现完整的训练流程包括环境交互、经验存储、网络更新等步骤import gym import numpy as np env gym.make(CartPole-v0) state_size env.observation_space.shape[0] action_size env.action_space.n # 超参数设置 BATCH_SIZE 64 GAMMA 0.99 EPS_START 1.0 EPS_END 0.01 EPS_DECAY 0.995 TARGET_UPDATE 10 MEMORY_CAPACITY 10000 policy_net DQN(state_size, action_size) target_net DQN(state_size, action_size) target_net.load_state_dict(policy_net.state_dict()) optimizer optim.Adam(policy_net.parameters()) memory ReplayBuffer(MEMORY_CAPACITY) def select_action(state, eps): if random.random() eps: return random.randint(0, action_size-1) with torch.no_grad(): return policy_net(state).argmax().item() def optimize_model(): if len(memory) BATCH_SIZE: return batch memory.sample(BATCH_SIZE) state_batch torch.cat([s for (s,a,r,ns,d) in batch]) action_batch torch.tensor([a for (s,a,r,ns,d) in batch]) reward_batch torch.tensor([r for (s,a,r,ns,d) in batch]) next_state_batch torch.cat([ns for (s,a,r,ns,d) in batch]) done_batch torch.tensor([d for (s,a,r,ns,d) in batch]) current_q policy_net(state_batch).gather(1, action_batch.unsqueeze(1)) next_q target_net(next_state_batch).max(1)[0].detach() expected_q reward_batch (GAMMA * next_q * (1 - done_batch)) loss nn.MSELoss()(current_q.squeeze(), expected_q) optimizer.zero_grad() loss.backward() optimizer.step()3. 高级调优技巧3.1 双DQNDouble DQN原始DQN存在Q值高估问题双DQN通过解耦动作选择和Q值评估来缓解# 修改optimize_model函数中的next_q计算 next_actions policy_net(next_state_batch).max(1)[1].unsqueeze(1) next_q target_net(next_state_batch).gather(1, next_actions).squeeze(1).detach()双DQN相比原始DQN有两个优势减少Q值高估提高策略稳定性3.2 优先经验回放Prioritized Experience Replay不是均匀采样经验而是根据TD误差大小赋予不同优先级class PrioritizedReplayBuffer: def __init__(self, capacity, alpha0.6): self.alpha alpha self.buffer [] self.priorities np.zeros((capacity,), dtypenp.float32) self.pos 0 self.capacity capacity def push(self, state, action, reward, next_state, done): max_prio self.priorities.max() if self.buffer else 1.0 if len(self.buffer) self.capacity: self.buffer.append((state, action, reward, next_state, done)) else: self.buffer[self.pos] (state, action, reward, next_state, done) self.priorities[self.pos] max_prio self.pos (self.pos 1) % self.capacity def sample(self, batch_size, beta0.4): if len(self.buffer) self.capacity: prios self.priorities else: prios self.priorities[:self.pos] probs prios ** self.alpha probs / probs.sum() indices np.random.choice(len(self.buffer), batch_size, pprobs) samples [self.buffer[idx] for idx in indices] total len(self.buffer) weights (total * probs[indices]) ** (-beta) weights / weights.max() return samples, indices, np.array(weights, dtypenp.float32) def update_priorities(self, batch_indices, batch_priorities): for idx, prio in zip(batch_indices, batch_priorities): self.priorities[idx] prio优先回放可以显著提高学习效率特别是对于稀疏奖励任务。3.3 超参数调优指南以下是经过大量实验验证的最佳超参数范围超参数推荐值作用学习率1e-4 ~ 1e-3控制参数更新幅度折扣因子γ0.95 ~ 0.99平衡即时和未来奖励回放缓冲区大小1e4 ~ 1e6存储经验的数量批量大小32 ~ 128每次更新的样本数ε初始值1.0探索率起始值ε最终值0.01 ~ 0.1探索率下限ε衰减率0.99 ~ 0.999探索率衰减速度目标网络更新频率100 ~ 1000步稳定训练的关键4. 训练监控与结果分析4.1 训练曲线可视化训练过程中需要监控三个关键指标每回合总奖励平均Q值损失函数值import matplotlib.pyplot as plt def plot_training(rewards, losses, q_values): plt.figure(figsize(12, 5)) plt.subplot(131) plt.plot(rewards) plt.title(Episode Rewards) plt.xlabel(Episode) plt.subplot(132) plt.plot(losses) plt.title(Training Loss) plt.xlabel(Step) plt.subplot(133) plt.plot(q_values) plt.title(Average Q Value) plt.xlabel(Step) plt.tight_layout() plt.show()4.2 性能评估与基准对比我们在CartPole-v0上对比了不同算法的表现算法平均训练回合数达到200分最终稳定性原始DQN800-1200回合偶尔会崩溃双DQN600-900回合更加稳定优先回放DQN500-800回合最稳定实际训练中完整实现通常能在1000回合内稳定达到200分满分。如果训练不顺利可以检查以下几点奖励不增长可能是学习率太高或网络结构不合理奖励波动大尝试减小批量大小或增加回放缓冲区早期崩溃调整ε衰减速度保证充分探索4.3 实际部署注意事项当模型训练完成后可以保存并加载模型进行部署# 保存模型 torch.save(policy_net.state_dict(), dqn_cartpole.pth) # 加载模型 loaded_net DQN(state_size, action_size) loaded_net.load_state_dict(torch.load(dqn_cartpole.pth)) loaded_net.eval()部署时建议关闭探索ε0添加异常处理防止意外状态考虑模型量化减小部署体积