## Motivations

### Non-linear Hypotheses

#### Why do we need neural networks?

- Consider a supervised learning classification problem
- logistic regression
- g as usual is sigmoid function
- And, if you include enough polynomial terms then, you know, maybe you can get a hypotheses.
- However, this problem is just about two features x
_{1}and x_{2}, many machine learning problems would have a lot more features.

- e.g. our housing example
- 100 house features, predict odds of a house being sold in the next 6 months
- Here, if you included all the quadratic terms (second order)
- There are lots of them (x
_{1}^{2},x_{1}x_{2}, x_{1}x_{4}..., x_{1}x_{100}) - For the case of n = 100, you have about 5000 features
- Number of features grows O(n
^{2})

- There are lots of them (x

- Not a good way to build classifiers when n is large

#### Example: Problems where n is large - computer vision

Computer vision sees a matrix of pixel intensity values

- Look at matrix - explain what those numbers represent

To build a car detector

- Build a training set of
- Not cars
- Cars

- Then test against a car

- Build a training set of
How can we do this

Plot two pixels (two pixel locations)

Plot car or not car on the graph

- Need a non-linear hypothesis to separate the classes
- Feature space
- If we used 50 x 50 pixels --> 2500 pixels, so n = 2500
- If RGB then 7500
- If 100 x 100 RB then --> 50 000 000 features

- Too big
- So - simple logistic regression here is not appropriate for large complex systems
- Neural networks are much better for a complex nonlinear hypothesis even when feature space is huge

### Neurons and the Brain

**Neural networks**(**NNs**) were originally motivated by looking at machines which replicate the brain's functionality- Looked at here as a machine learning technique

- Origins
- To build learning systems, why not mimic the brain?
- Used a lot in the 80s and 90s
- Popularity diminished in late 90s
- Recent major resurgence
- NNs are computationally expensive, so only recently large scale neural networks became computationally feasible

- Brain
- Does loads of crazy things
- Hypothesis is that the brain has a single learning algorithm

- Evidence for hypothesis
- Auditory cortex --> takes sound signals
- If you cut the wiring from the ear to the auditory cortex
- Re-route optic nerve to the auditory cortex
- Auditory cortex learns to see

- Auditory cortex --> takes sound signals
- With different tissue learning to see, maybe they all learn in the same way
- Brain learns by itself how to learn

- Does loads of crazy things
- Brain can process and learn from data from any source

## Neural Networks

### Model Representation I

- Three things to notice
- Cell body
- Number of input wires (dendrites)
- Output wire (axon)

- Simple level
- Neurone gets one or more inputs through dendrites
- Does processing
- Sends output down axon

- Neurons communicate through electric spikes
- Pulse of electricity via axon to another neuron

#### Artificial neural network - representation of a neuron

In an artificial neural network, a neuron is a logistic unit

- Feed input via input wires
- Logistic unit does computation
- Sends output down output wires

That logistic computation is just like our previous logistic regression hypothesis calculation

Very simple model of a neuron's computation

- Often good to include an x
_{0}input - the**bias unit**- This is equal to 1

- Often good to include an x
This is an artificial neuron with a sigmoid (logistic) activation function

- Ɵ vector may also be called the
**weights**of a model

- Ɵ vector may also be called the
The above diagram is a single neuron

- Below we have a group of neurons strung together

- First layer is the
**input layer** - Final layer is the
**output layer**- produces value computed by a hypothesis - Middle layer(s) are called the
**hidden layers**- You don't observe the values processed in the hidden layer
- Can have many hidden layer

#### Neural networks - notation

a

_{i}^{(j)}- activation of unit i in layer j- By activation, we mean the value which is computed and output by that node

- Ɵ
^{(j)}- matrix of parameters controlling the function mapping from layer j to layer j + 1- If network has s
_{j}units in layer j and s_{j+1}units in layer j + 1 then Ɵ^{(j)}will be of dimensions [ S_{j+1}X (s_{j}+ 1)]- Column length is the number of units in the following layer
- Row length is the number of units in the current layer + 1 (because we have to map the bias unit)

- If network has s
- We have to calculate the activation for each node
- That activation depends on
- The input(s) to the node
- The parameter associated with that node (from the Ɵ vector associated with that layer)

- That activation depends on
- Every input/activation goes to every node in following layer
- Ɵ
_{ji}^{l}- j (first of two subscript numbers)= ranges from 1 to the number of units in layer l+1, mapping to node j in layer l+1
- i (second of two subscript numbers) = ranges from 0 to the number of units in layer l, mapping from node i in layer l
- l is the layer you're moving FROM

- Ɵ

### Model Representation II

Define some additional terms z

_{1}^{2}= Ɵ_{10}^{1}x_{0}+ Ɵ_{11}^{1}x_{1}+ Ɵ_{12}^{1}x_{2}+ Ɵ_{13}^{1}x_{3}, then a_{1}^{2}= g(z_{1}^{2}). Similarly, we define the others as z_{2}^{2}and z_{3}^{2}, these values are just a linear combination of the values.z

^{2}as the vector of z values from the second layer, is a 3x1 vectorWe can vectorize the computation of the neural network as as follows in two steps

- z
^{2}= Ɵ^{(1)}x - a
^{2}= g(z^{(2)})

- z
This process is also called

**forward propagation**- Start off with activations of input unit
- i.e. the x vector as input

- Forward propagate and calculate the activation of each layer sequentially
- This is a vectorized version of this implementation

- Start off with activations of input unit

#### Neural networks learning its own features

Diagram below looks a lot like logistic regression

Layer 3 is a logistic regression node

- The only difference is, instead of input a feature vector, the features are just values calculated by the hidden layer

- The features a
_{1}^{2}, a_{2}^{2}, and a_{3}^{2}are calculated/learned - not original features - A neural network can learn its own features to feed into logistic regression
- Depending on the Ɵ
^{1}parameters you can learn some interesting things- Flexibility to learn whatever features it wants to feed into the final logistic regression calculation
- So, if we compare this to previous logistic regression, you would have to calculate your own exciting features to define the best way to classify or describe something
- Here, we're letting the hidden layers do that, so we feed the hidden layers our input values, and let them learn whatever gives the best final result to feed into the final output layer

- Flexibility to learn whatever features it wants to feed into the final logistic regression calculation
- As well as the networks already seen, other architectures (topology) are possible
- More/less nodes per layer
- More layers

## Applications

### Examples and Intuitions I

Non-linear classification: XOR/XNOR

- x
_{1}, x_{2}are binary

- Example on the right shows a simplified version of the more complex problem we're dealing with (on the left)

- x

#### Neural Network example 1: AND function

- We get a one-unit neural network to compute this logical AND function?
- Add a bias unit
- Add some weights for the networks

#### Neural Network example 2: OR function

### Examples and Intuitions II

#### Neural Network example 3: NOT function

#### Neural Network example 4: XNOR function

XNOR is short for NOT XOR