让训练更加稳定

目标:让梯度值在合理的范围内,比如 [1×106,1×103][1\times10^{-6},1\times10^3]

  • 常用方法:
    • 将乘法变成加法(ResNet、LSTM)
    • 归一化(梯度归一化、梯度裁剪)
    • 合理的权重初始和激活函数

让每层的方差是一个常数

将每层的输出和梯度都看作随机变量,让它们的均值和方差都保持一致

正向:

E[hit]=0Var[hit]=ai,t\mathbb{E}\left[h_{i}^{t}\right]=0\quad\operatorname{Var}\left[h_{i}^{t}\right]=a\quad \forall i, t

反向:

E[hit]=0Var[hit]=bi,t\mathbb{E}\left[\frac{\partial \ell}{\partial h_{i}^{t}}\right]=0 \quad \operatorname{Var}\left[\frac{\partial \ell}{\partial h_{i}^{t}}\right]=b \quad \forall i, t

权重初始化

在合理值区间范围里随机初始化参数

  • 训练开始时更容易有数值不稳定
    • 远离最优解的地方损失函数表面可能很复杂
    • 最优解附近表面会比较平

使用 N(0,0.01)\mathcal{N}(0,0.01) 来初始可能对小网络没问题,但不能保证深度神经网络

以 MLP 为例

假设 wi,jtw_{i, j}^{t} 是独立同分布,那么 E[wi,jt]=0,Var[wi,jt]=γt\mathbb{E}\left[w_{i, j}^{t}\right]=0, \operatorname{Var}\left[w_{i, j}^{t}\right]=\gamma_{t}

假设 wi,jtw_{i, j}^{t} 独立于 hjt1h_{j}^{t-1}

假设没有激活函数 ht=Wtht1\mathbf{h}^{t}=\mathbf{W}^{t} \mathbf{h}^{t-1} ,这里 WtRnt×nt1\mathbf{W}^{t} \in \mathbb{R}^{n_{t} \times n_{t-1}}

E[hit]=E[jwi,jthjt1]=jE[wi,jt]E[hjt1]=0\mathbb{E}\left[h_{i}^{t}\right]=\mathbb{E}\left[\sum_{j} w_{i, j}^{t} h_{j}^{t-1}\right]=\sum_{j} \mathbb{E}\left[w_{i, j}^{t}\right] \mathbb{E}\left[h_{j}^{t-1}\right]=0

正向方差

Var[hit]=E[(hit)2]E[hit]2=E[(jwi,jthjt1)2]=E[j(wi,jt)2(hjt1)2+jkwi,jtwi,kthjt1hkt1]=jE[(wi,jt)2]E[(hjt1)2]=jVar[wi,jt]Var[hjt1]=nt1γtVar[hjt1]\begin{aligned}\operatorname{Var}\left[h_{i}^{t}\right] &=\mathbb{E}\left[\left(h_{i}^{t}\right)^{2}\right]-\mathbb{E}\left[h_{i}^{t}\right]^{2}=\mathbb{E}\left[\left(\sum_{j} w_{i, j}^{t} h_{j}^{t-1}\right)^{2}\right] \\&=\mathbb{E}\left[\sum_{j}\left(w_{i, j}^{t}\right)^{2}\left(h_{j}^{t-1}\right)^{2}+\sum_{j \neq k} w_{i, j}^{t} w_{i, k}^{t} h_{j}^{t-1} h_{k}^{t-1}\right] \\&=\sum_{j} \mathbb{E}\left[\left(w_{i, j}^{t}\right)^{2}\right] \mathbb{E}\left[\left(h_{j}^{t-1}\right)^{2}\right] \\&=\sum_{j} \operatorname{Var}\left[w_{i, j}^{t}\right] \operatorname{Var}\left[h_{j}^{t-1}\right]=n_{t-1} \gamma_{t} \operatorname{Var}\left[h_{j}^{t-1}\right]\end{aligned}

需要 nt1γt=1n_{t-1} \gamma_{t}=1

反向均值和方差

ht1=htWt(ht1)T=(Wt)T(ht)T\frac{\partial \ell}{\partial \mathbf{h}^{t-1}}=\frac{\partial \ell}{\partial \mathbf{h}^{t}} \mathbf{W}^{t}\rarr \left(\frac{\partial \ell}{\partial \mathbf{h}^{t-1}}\right)^{T}=\left(W^{t}\right)^{T}\left(\frac{\partial \ell}{\partial \mathbf{h}^{t}}\right)^{T}

E[hit1]=0\mathbb{E}\left[\frac{\partial \ell}{\partial h_{i}^{t-1}}\right]=0

Var[hit1]=ntγtVar[hjt]\operatorname{Var}\left[\frac{\partial \ell}{\partial h_{i}^{t-1}}\right]=n_{t} \gamma_{t} \operatorname{Var}\left[\frac{\partial \ell}{\partial h_{j}^{t}}\right]

需要 ntγt=1n_{t} \gamma_{t}=1

Xavier 初始化

难以同时满足 nt1γt=1n_{t-1} \gamma_{t}=1ntγt=1n_{t} \gamma_{t}=1

Xavier 使得 γt(nt1+nt)/2=1γt=2/(nt1+nt)\gamma_{t}\left(n_{t-1}+n_{t}\right) / 2=1 \quad \rightarrow \gamma_{t}=2 /\left(n_{t-1}+n_{t}\right)

正态分布 N(0,2/(nt1+nt))\mathcal{N}\left(0, \sqrt{2 /\left(n_{t-1}+n_{t}\right)}\right)

均匀分布 U(6/(nt1+nt),6/(nt1+nt))\mathscr{U}\left(-\sqrt{6 /\left(n_{t-1}+n_{t}\right)}, \sqrt{6 /\left(n_{t-1}+n_{t}\right)}\right)

适配权重形状变换,特别是 ntn_t

激活函数

为了方便分析假设线性的激活函数 σ(x)=αx+β\sigma(x)=\alpha x+\beta

h=Wtht1 and ht=σ(h)\mathbf{h}^{\prime}=\mathbf{W}^{t} \mathbf{h}^{t-1} \quad \text { and } \quad \mathbf{h}^{t}=\sigma\left(\mathbf{h}^{\prime}\right)

E[hit]=E[αhi+β]=ββ=0\mathbb{E}\left[h_{i}^{t}\right]=\mathbb{E}\left[\alpha h_{i}^{\prime}+\beta\right]=\beta\quad\rarr\beta=0

Var[hit]=E[(hit)2]E[hit]2=E[(αhi+β)2]β2=E[α2(hi)2+2αβhi+β2]β2=α2Var[hi]α=1\begin{aligned}\operatorname{Var}\left[h_{i}^{t}\right] &=\mathbb{E}\left[\left(h_{i}^{t}\right)^{2}\right]-\mathbb{E}\left[h_{i}^{t}\right]^{2} \\&=\mathbb{E}\left[\left(\alpha h_{i}^{\prime}+\beta\right)^{2}\right]-\beta^{2} \\&=\mathbb{E}\left[\alpha^{2}\left(h_{i}^{\prime}\right)^{2}+2 \alpha \beta h_{i}^{\prime}+\beta^{2}\right]-\beta^{2} \\&=\alpha^{2} \operatorname{Var}\left[h_{i}^{\prime}\right]\end{aligned}\quad\rarr\alpha=1

检查常用激活函数

使用泰勒展开:

sigmoid(x)=12+x4x348+O(x5)tanh(x)=0+xx33+O(x5)relu(x)=0+x for x0\begin{aligned}\operatorname{sigmoid}(x) &=\frac{1}{2}+\frac{x}{4}-\frac{x^{3}}{48}+O\left(x^{5}\right) \\\tanh (x) &=0+x-\frac{x^{3}}{3}+O\left(x^{5}\right) \\\operatorname{relu}(x) &=0+x \quad \text { for } x \geq 0\end{aligned}

调整 sigmoid 为 4×sigmoid(x)24\times\operatorname{sigmoid}(x)-2

总结

  • 合理的权重初始值和激活函数的选取可以提升数值稳定性