Machine Learning Week5 Neural Networks Learning

Machine Learning Week5 Neural Networks Learning

Cost Function and Backpropagation

Cost Function

  • L = total number of layers in the network
  • sls = number of units (not counting bias unit) in layer l
  • K = number of output units/classes

The (regularized) logistic regression cost function is as follows:

For neural networks, the cost function is a generalization of this equation above, so instead of one output we generate k outputs:

  • First half

    For each training data example, sum of each position in the output vector

  • Second half

    This is also called a weight decay term, the lambda value determines the important of the two halves

Backpropagation Algorithm

Back propagation is an algorithm used to minimize the cost function, as it allows us to calculate partial derivatives.

  • We want to find parameters Θ which minimize J(Θ):
    • Gradient descent
    • Advanced optimization algorithms
  • We want to calculate the partial derivative Ɵ with respect to a single parameter

    • Ɵi is the matrix of weights which define the function mapping from layer i to layer i+1
    • Ɵijl
      • i here represents the unit in layer l+1 you're mapping to (destination node)
      • j is the unit in layer l you're mapping from (origin node)
      • l is the layer your mapping from (to layer l+1) (origin layer)
  • Let's get an idea regarding the intuition of the algorithm

    For each node we can calculate (δjl) - this is the error of node j in layer l

    But the problem here is, "what is this 'real' value, and how do we calculate it?!"

    • The only "real" value we have is our actual classification (our y value) - so that's where we start
  • To do so, we use the following algorithm:

    1. Set a(1) := x(t)

    2. Perform forward propagation algorithm to compute al for l=2,3,…,L:

Backpropagation Intuition

Forward propagation

Feeding input into the input layer (xi, yi), the sigmoid function applied to the z values gives the activation values

Back propagation

The cost function if there is a single output:

If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:

Intuitively, δj(l) is the "error" for aj(l) (unit j in layer l). More formally, the delta values are actually the derivative of the cost function:

Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we could calculate some δj(l) :

Backpropagation in Practice

Implementation Note: Unrolling Parameters

With neural networks, we are working with sets of matrices:



In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the elements and put them into one long vector:

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the "unrolled" versions as follows:

Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

Gradient Checking

Backpropagation has a lot of details, small bugs can be present and ruin it :-(

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

With multiple theta matrices, we can approximate the derivative with respect to Θj as follows:

A small value for ϵ(epsilon) such as ϵ=10−4, guarantees that the math works out properly. If the value for ϵ* is too small, we can end up with numerical problems.

Hence, we are only adding or subtracting epsilon to the Θj matrix. In octave we can do it as follows:

epsilon = 1e-4;
for i = 1:n,
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)

Once you have verified once that your backpropagation algorithm is correct, you don't need to compute gradApprox again. The code to compute gradApprox can be very slow.

Random Initialization

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our Θ matrices using the following method:

%If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Putting It Together

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

  • Number of input units = dimension of features x(i)
  • Number of output units = number of classes
  • Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
  • Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

Training a Neural Network

  1. Randomly initialize the weights
  2. Implement forward propagation to get hΘ(x(i)) for any x(i)
  3. Implement the cost function
  4. Implement backpropagation to compute partial derivatives
  5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
  6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

When we perform forward and back propagation, we loop on every training example:

for i = 1:m,
Perform forward propagation and backpropagation using example (x(i),y(i))
(Get activations a(l) and delta terms d(l) for l = 2,...,L

The following image gives us an intuition of what is happening as we are implementing our neural network:

Ideally, you want hΘ(x(i)) ≈ y(i). This will minimize our cost function. However, keep in mind that J(Θ) is not convex and thus we can end up in a local minimum instead.


Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now