Intuitively understanding derivatives
A derivative tells us how sensitive a function's output is to a small change in its input. That's the whole idea. The rest of this post is just unpacking what that means and how we calculate it.
Let's assume:
.
If we plot this function, we'll get some sort of a parabola.
So what is a derivative and what is it telling us about the function?
The general definition is:
What this basically tells us is how the change in affects the function output, more accurately: the derivative measures how sensitive the output is to a tiny change in the input.
I mentioned the parabola earlier because it helps to picture this: the derivative is the slope of the curve at a point. A slope carries both a direction (is the function rising or falling?) and an intensity (how steeply?), which is exactly what a rate of change is.
So we can pick a number and a function where is a term in a function to see what the relationship is between the change in and output of .
Let's say .
Let's take a small step forwards, . If goes up by , by how much does change?
So let's test it out:
So our difference is:
To get the rate of change, which is how much the output moved per unit of input movement, we divide by :
Meaning, around , for a very small increase in , the output increases by roughly 8.003 times that increase. So if expected change is roughly .
You've probably seen another way of calculating this, where you apply some rules to produce an expression like from . That's really the same idea, just done by letting approach in general instead of picking a specific small value. Let's work through that too.
Let's now see what happens as approaches , instead of choosing a specific small value like .
We can substitute our specific function into the general definition.
Let's first expand :
Now we compute . The , , and terms appear in both, so they cancel, and we are left with:
Dividing each term by gives:
As approaches , the term vanishes, leaving us with .
Plugging in gives . That's the exact derivative at , and it's very close to our earlier approximation of !
And is a function in its own right. Plug in any and it tells you the rate of change of at that point. At it's , at it's , at it's . The derivative of a function with respect to a variable is itself a function, telling us how sensitive the original function's output is to changes in that variable.
This is exactly the tool we need to train a neural network. The network is a function of its weights, the loss is a function of the network's output, so the loss is ultimately a function of every weight. For each weight, the derivative tells us how nudging it would move the loss, and gradient descent uses that to step every weight in the direction that reduces the loss. See What are neural networks and how do they work for the surrounding context.