Intuitively understanding derivatives

A derivative tells us how sensitive a function's output is to a small change in its input. That's the whole idea. The rest of this post is just unpacking what that means and how we calculate it.

Let's assume:

f(x)=3x24x+5f(x) = 3x^2 - 4x + 5

f(3)=20f(3) = 20.

If we plot this function, we'll get some sort of a parabola.

So what is a derivative and what is it telling us about the function?

The general definition is:

L=limh0f(X+h)f(X)hL = \lim_{h \to 0} \frac{f(X+h) - f(X)}{h}

What this basically tells us is how the change in XX affects the function output, more accurately: the derivative measures how sensitive the output is to a tiny change in the input.

I mentioned the parabola earlier because it helps to picture this: the derivative is the slope of the curve at a point. A slope carries both a direction (is the function rising or falling?) and an intensity (how steeply?), which is exactly what a rate of change is.

So we can pick a number XX and a function ff where XX is a term in a function ff to see what the relationship is between the change in XX and output of f(X)f(X).

Let's say X=2X = 2.

f(2)=32242+5=9f(2) = 3 \cdot 2^2 - 4 \cdot 2 + 5 = 9

Let's take a small step forwards, h=0.001h = 0.001. If XX goes up by 0.0010.001, by how much does f(X)f(X) change?

So let's test it out:

f(2+h)=f(2.001)=32.001242.001+5=9.008003f(2+h) = f(2.001) = 3 \cdot 2.001^2 - 4 \cdot 2.001 + 5 = 9.008003

So our difference is:

f(2+h)f(2)=0.008003f(2+h) - f(2) = 0.008003

To get the rate of change, which is how much the output moved per unit of input movement, we divide by hh:

0.008003h=0.0080030.001=8.003\frac{0.008003}{h} = \frac{0.008003}{0.001} = 8.003

Meaning, around X=2X = 2, for a very small increase in XX, the output increases by roughly 8.003 times that increase. So if h=0.001h = 0.001 expected change is roughly 0.0080030.008003.

You've probably seen another way of calculating this, where you apply some rules to produce an expression like 6x46x - 4 from 3x24x+53x^2 - 4x + 5. That's really the same idea, just done by letting hh approach 00 in general instead of picking a specific small value. Let's work through that too.

Let's now see what happens as hh approaches 00, instead of choosing a specific small value like 0.0010.001.

We can substitute our specific function into the general definition.

Let's first expand f(x+h)f(x+h):

f(x+h)=3(x+h)24(x+h)+5=3(x2+2xh+h2)4x4h+5=3x2+6xh+3h24x4h+5\begin{aligned} f(x+h) &= 3(x+h)^2 - 4(x+h) + 5 \\ &= 3(x^2 + 2xh + h^2) - 4x - 4h + 5 \\ &= 3x^2 + 6xh + 3h^2 - 4x - 4h + 5 \end{aligned}

Now we compute f(x+h)f(x)f(x+h) - f(x). The 3x23x^2, 4x-4x, and +5+5 terms appear in both, so they cancel, and we are left with:

6xh+3h24h6xh + 3h^2 - 4h

Dividing each term by hh gives:

6x+3h46x + 3h - 4

As hh approaches 00, the 3h3h term vanishes, leaving us with 6x46x - 4.

Plugging in x=2x = 2 gives 124=812 - 4 = 8. That's the exact derivative at x=2x = 2, and it's very close to our earlier approximation of 8.0038.003!

And 6x46x - 4 is a function in its own right. Plug in any xx and it tells you the rate of change of ff at that point. At x=2x = 2 it's 88, at x=0x = 0 it's 4-4, at x=5x = 5 it's 2626. The derivative of a function with respect to a variable is itself a function, telling us how sensitive the original function's output is to changes in that variable.

This is exactly the tool we need to train a neural network. The network is a function of its weights, the loss is a function of the network's output, so the loss is ultimately a function of every weight. For each weight, the derivative tells us how nudging it would move the loss, and gradient descent uses that to step every weight in the direction that reduces the loss. See What are neural networks and how do they work for the surrounding context.