Intuitively understanding derivatives

2026-04-28 · math · machine-learning

A derivative tells us how sensitive a function's output is to a small change in its input. That's the whole idea. The rest of this post is just unpacking what that means and how we calculate it.

Let's assume:

$f(x) = 3x^2 - 4x + 5$

$f(3) = 20$ .

If we plot this function, we'll get some sort of a parabola.

So what is a derivative and what is it telling us about the function?

The general definition is:

$L = \lim_{h \to 0} \frac{f(X+h) - f(X)}{h}$

What this basically tells us is how the change in $X$ affects the function output, more accurately: the derivative measures how sensitive the output is to a tiny change in the input.

I mentioned the parabola earlier because it helps to picture this: the derivative is the slope of the curve at a point. A slope carries both a direction (is the function rising or falling?) and an intensity (how steeply?), which is exactly what a rate of change is.

So we can pick a number $X$ and a function $f$ where $X$ is a term in a function $f$ to see what the relationship is between the change in $X$ and output of $f(X)$ .

Let's say $X = 2$ .

$f(2) = 3 \cdot 2^2 - 4 \cdot 2 + 5 = 9$

Let's take a small step forwards, $h = 0.001$ . If $X$ goes up by $0.001$ , by how much does $f(X)$ change?

So let's test it out:

$f(2+h) = f(2.001) = 3 \cdot 2.001^2 - 4 \cdot 2.001 + 5 = 9.008003$

So our difference is:

$f(2+h) - f(2) = 0.008003$

To get the rate of change, which is how much the output moved per unit of input movement, we divide by $h$ :

$\frac{0.008003}{h} = \frac{0.008003}{0.001} = 8.003$

Meaning, around $X = 2$ , for a very small increase in $X$ , the output increases by roughly 8.003 times that increase. So if $h = 0.001$ expected change is roughly $0.008003$ .

You've probably seen another way of calculating this, where you apply some rules to produce an expression like $6x - 4$ from $3x^2 - 4x + 5$ . That's really the same idea, just done by letting $h$ approach $0$ in general instead of picking a specific small value. Let's work through that too.

Let's now see what happens as $h$ approaches $0$ , instead of choosing a specific small value like $0.001$ .

We can substitute our specific function into the general definition.

Let's first expand $f(x+h)$ :

$\begin{aligned} f(x+h) &= 3(x+h)^2 - 4(x+h) + 5 \\ &= 3(x^2 + 2xh + h^2) - 4x - 4h + 5 \\ &= 3x^2 + 6xh + 3h^2 - 4x - 4h + 5 \end{aligned}$

Now we compute $f(x+h) - f(x)$ . The $3x^2$ , $-4x$ , and $+5$ terms appear in both, so they cancel, and we are left with:

$6xh + 3h^2 - 4h$

Dividing each term by $h$ gives:

$6x + 3h - 4$

As $h$ approaches $0$ , the $3h$ term vanishes, leaving us with $6x - 4$ .

Plugging in $x = 2$ gives $12 - 4 = 8$ . That's the exact derivative at $x = 2$ , and it's very close to our earlier approximation of $8.003$ !

And $6x - 4$ is a function in its own right. Plug in any $x$ and it tells you the rate of change of $f$ at that point. At $x = 2$ it's $8$ , at $x = 0$ it's $-4$ , at $x = 5$ it's $26$ . The derivative of a function with respect to a variable is itself a function, telling us how sensitive the original function's output is to changes in that variable.

This is exactly the tool we need to train a neural network. The network is a function of its weights, the loss is a function of the network's output, so the loss is ultimately a function of every weight. For each weight, the derivative tells us how nudging it would move the loss, and gradient descent uses that to step every weight in the direction that reduces the loss. See What are neural networks and how do they work for the surrounding context.