Machine Learning Week05

Posted on 2021-03-26 Edited on 2022-08-07 In Machine Learning Views: Symbols count in article: 4.8k Reading time ≈ 4 mins.

Neural Networks Learning

Cost Function and Propagation

Cost Function
- L 神经网络的层数
- $s_l$第l层的神经元个数，不包括bias unit
- K输出层的神经元个数（即种类）
- binary classification二元分类
- Logistic regression的代价函数： \[ J(\theta)=-\frac {1}{m}\sum \limits_{i=1}^m\bigg [y^{(i)}log(h_\theta(x^{(i)}))-(1-y^{(i)})log(1-h_\theta(x^{(i)}))\bigg ]+\frac{\lambda}{2m}\sum \limits_{j=1}^n\theta_j^2 \]
- 神经网络的代价函数： \[ J(\Theta)=-\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K}\left[y_{k}^{(i)} \log \left((h_{\Theta}(x^{(i)}))_{k}\right)+\left(1-y_{k}^{(i)}\right) \log \left(1-(h_{\Theta}(x^{(i)}))_{k}\right)\right]+\frac{\lambda}{2 m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}}(\theta_{j,i}^{(l)})^2 \]
  - 两个求和符号部分只是将输出层中每个单元的逻辑回归代价函数相加
  - 三个求和符号部分只是将整个网络中所有单个θ的平方相加，其中i并不指代训练实例i
反向传播算法Backpropagation Algorithm
- 梯度下降计算
  - min J（θ）就需要计算：J（θ）；J（θ）关于各个θ的偏导
  - 计算过程：先forward propagation；再反向
  - $\delta_j^{l}="error"\ of\ node\ j\ in\ layer\ l$用于改变activation激励值, Formally, $\delta_j^{(l)}=\frac \partial {\partial z_{j}^{(l)}} cost(j)$, 其中$cost(i)=y^{(i)}log(h_\Theta(x^{(i)}))-(1-y^{(i)})log(1-h_\Theta(x^{(i)}))$, 求导后易得$\delta_j^{(l)}=y_j^{(l)}-a_j^{(l)}$(为什么符号是相反的？答：这里cost错误，和前文代价函数符号相反，应该要变号)
  - 计算过程：
    - 令$\Delta_{i,j}^{l}:=0$
    - For training example t =1 to m:
      1. Set $a^{(1)} := x^{(t)}$
      2. Perform forward propagation to compute $a^{(l)}\ for\ l=2,3,…,L$
        
        (此处失效图片链接)img (https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/bYLgwteoEeaX9Qr89uJd1A_73f280ff78695f84ae512f19acfa29a3_Screenshot-2017-01-10-18.16.50.png?expiry=1606176000000&hmac=9aVGT1io0l-sybFSrc1stejo_L0d7hzlNXbQIt47h2Y)
      3. Using $y^{(t)}$, compute$ ^{(L)} = a^{(L)} - y^{(t)}$
        
        Where L is our total number of layers and a^{(L)}a(L) is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:
      4. Compute$ ^{(L-1)}, ^{(L-2)},,{(2)}$using$$ ^{(l)} = ((^{(l)})T ^{(l+1)}) . a^{(l)} . (1 - a^{(l)}) $$
        
        The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by $z^{(l)}$.
        
        The g-prime derivative terms can also be written out as:
        
        $g′(z^{(l)})=a^{(l)} .∗ (1−a^{(l)})$
      5. $Δ_{i,j}^{(l)}:=Δ_{i,j}^{(l)}+a_j^{(l)}δ_i^{(l+1)}$, or with vectorization, $Δ^{(l)}:=Δ^{(l)}+δ^{(l+1)}(a^{(l)})^T$
        
        Hence we update our new $\Delta$ matrix.
        
        $D_{i, j}^{(l)}:=\frac{1}{m}\left(\Delta_{i, j}^{(l)}+\lambda \Theta_{i, j}^{(l)}\right), \text { if } j \neq 0$
        
        $D_{i, j}^{(l)}:=\frac{1}{m}\Delta_{i, j}^{(l)}, \text { if } j = 0$
        
        The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative. Thus we get$ J()= D_{ij}^{(l)}$
  - 上述计算过程的中文推导

Backpropagation in practice

系数展开到向量：

M(a,b)，既取a，也取b，从1开始

优化算法（如：fminunc）默认将参数整合到一个向量中

%代码过程
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]

Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

过程：(此处失效图片链接)img(https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/kdK7ubT2EeajLxLfjQiSjg_d35545b8d6b6940e8577b5a8d75c8657_Screenshot-2016-11-27-15.09.24.png?expiry=1606348800000&hmac=bH794vb16zxSOiqZRj2Pe0PyuaYNbZ8tDQZlnGSoM18)

梯度检验gradient checking
- $\frac{d}{d\Theta}J(\Theta)\approx \frac{J(\Theta+\epsilon)+J(\Theta-\epsilon)}{2\epsilon}$，ε = $10^{-4}$
- \[ \dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon} \]
- 1
  2
  3
  4
  5
  6
  7
  8
  epsilon = 1e-4;
  for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
  end;
- Check: gradApprox ≈ deltaVector，只需要验证一次即可，否则这种方法会非常慢
Random Initialization随机初始化
- Symmetry breaking，因为如果设为一样的会使得梯度下降后参数也一样
- 将每一个$\Theta_{ij}^{(l)}$设为在[-ε，ε]之间的随机数，但要同时设置，以防出现相同，例：
  1
  Theta1 = rand(10,11)*2*init_epsilon-init_epsilon;
  (Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)
总体回顾
- 一般默认隐藏层每层神经元数量一致
- 构建一个模型的过程：
  1. 初始化模型
    - Number of input units = dimension of features x^{(i)}x(i)
    - Number of output units = number of classes
    - Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
    - Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
  2. Training a Neural Network
    1. Randomly initialize the weights
    2. Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
    3. Implement the cost function
    4. Implement backpropagation to compute partial derivatives
    5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
    6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
    7. 对每一个训练样例循环上述步骤（有可能得到局部最优，因为J（θ）是非凸函数）