背景
传统的随机梯度下降算法SGD(Stochastic Gradient Descent)的式子为:
\[ \theta_{t+1} \leftarrow \theta_{t} -\alpha \nabla_{\theta_{t}}J_{minibatcg}(\theta_{t}) \]
其中\(J\)为损失函数,\(\theta_{t}\)为t时刻的参数,\(\alpha\)为学习率,\(\nabla_{\theta_{t}}J_{minibatcg}(\theta_{t})\)为t时刻梯度.
Adam算法初步优化
考虑让梯度更加平滑,有以下式子:
$$ m_{t+1} {1}m{t}+(1-{1}){{t}}J{minibatcg}(_{t}) \
{t+1} {t} -m_{t+1} $$
优点:
- 梯度更加平滑,减少了振荡
- 可以在初始设置更大的学习率
Adam算法最终优化
通过对应"动量"这一想法的进一步扩展,我们得到了新的学习算法式子:
$$ m_{t+1} _1 m_t + (1 - 1) {t} J{}(t) \ v{t+1} _2 v_t + (1 - 2)({t} J{}(t) {t} J{}(_t)) \
_{t+1} _t - $$
其中\(\beta_{1},\beta_{2}\)是大小在0和1之间的超参数,\(\odot\)是平方的符号,\(\alpha\)是学习率.
优点:
- 梯度更加平滑,减少了振荡
- 可以在初始设置更大的学习率
- 接收较小或很少更新的模型参数将得到较大的更新
参考资料
11.10. Adam算法 — 动手学深度学习 2.0.0 documentation (d2l.ai)
Stanford CS 224N | Natural Language Processing with Deep Learning
Since the comment system relies on GitHub's Discussions feature, by default, commentators will receive all notifications. You can click "unsubscribe" in the email to stop receiving them, and you can also manage your notifications by clicking on the following repositories: bg51717/Hexo-Blogs-comments