The jumble of rules for taking derivatives never truly clicked for me. The addition rule, product rule, quotient rule -- how do they fit together? What are we even trying to *do*?

Here's my take on derivatives:

- We have a system to analyze, our function f
- The derivative f' (aka df/dx) is the moment-by-moment behavior
- It turns out f is part of a bigger system (h = f + g)
- Using the behavior of the parts, can we figure out the behavior of the whole?

Yes. **Every part has a "point of view" about how much change it added. Combine every point of view to get the overall behavior.** Each derivative rule is an example of merging various points of view.

And why don't we analyze the entire system at once? For the same reason you don't eat a hamburger in one bite: small parts are easier to wrap your head around.

Instead of memorizing separate rules, let's see how they fit together:

The goal is to really grok the notion of "combining perspectives". This installment covers addition, multiplication, powers and the chain rule. Onward!

## Functions: Anything, Anything But Graphs

The default calculus explanation writes "f(x) = x^2" and shoves a graph in your face. Does this really help our intuition?

Not for me. Graphs squash input and output into a single curve, and hide the machinery that turns one into the other. But the derivative rules are *about* the machinery, so let's see it!

I visualize a function as the process "input(x) => f => output(y)".

It's not just me. Check out this incredible, mechanical targetting computer (beginning of youtube series).

The machine computes functions like addition and multiplication with gears -- you can *see the mechanics* unfolding!

Think of function f as a machine with an input lever "x" and an output lever "y". As we adjust x, f sets the height for y. Another analogy: x is the input signal, f receives it, does some magic, and spits out signal y. Use whatever analogy helps it click.

## Wiggle Wiggle Wiggle

The derivative is the "moment-by-moment" behavior of the function. What does that mean? (And don't mindlessly mumble "The derivative is the slope". *See any graphs around these parts, fella?*)

The derivative is how much we wiggle. The lever is at x, we "wiggle" it, and see how y changes. "Oh, we moved the input lever 1mm, and the output moved 5mm. Interesting."

The result can be written "output wiggle per input wiggle" or "dy/dx" (5mm / 1mm = 5, in our case). This is usually a formula, not a static value, because it can depend on your current input setting.

For example, when f(x) = x^2, the derivative is 2x. Yep, you've memorized that. What does it mean?

If our input lever is at x = 10 and we wiggle it slightly (moving it by dx=0.1 to 10.1), the output should change by dy. How much, exactly?

- We know f'(x) = dy/dx = 2 * x
- At x = 10 the "output wiggle per input wiggle" is = 2 * 10 = 20. The output moves 20 units for every unit of input movement.
- If dx = 0.1, then dy = 20 * dx = 20 * .1 = 2

And indeed, the difference between 10^2 and (10.1)^2 is about 2. The derivative estimated how far the output lever would move (a perfect, infinitely small wiggle would move 2 units; we moved 2.01).

The key to understanding the derivative rules:

- Set up your system
- Wiggle each part of the system separately, see how far the output moves
- Combine the results

The total wiggle is the sum of wiggles from each part.

## Addition and Subtraction

Time for our first system:

What happens when the input (x) changes?

In my head, I think "Function h takes a single input. It feeds the same input to f and g and adds the output levers. f and g wiggle independently, and don't even know about each other!"

Function f knows it will contribute some wiggle (df), g knows it will contribute some wiggle (dg), and we, the prowling overseers that we are, know their individual moment-by-moment behaviors are added:

Again, let's describe each "point of view":

- The overall system has behavior dh
- From f's perspective, it contributes df to the whole [it doesn't know about g]
- From g's perspective, it contributes dg to the whole [it doesn't know about f]

Every change to a system is due to some part changing (f and g). If we add the contributions from each possible variable, we've described the entire system.

## df vs df/dx

Sometimes we use df, other times df/dx -- what gives? (This confused me for a while)

**df**is a general notion of "however much f changed"**df/dx**is a specific notion of "however much f changed, in terms of how much x changed"

The generic "df" helps us see the overall behavior.

An analogy: Imagine you're driving cross-country and want to measure the fuel efficiency of your car. You'd measure the distance traveled, check your tank to see how much gas you used, and finally do the division to compute "miles per gallon". You measured distance and gasoline separately -- you didn't jump into the gas tank to get the rate on the go!

In calculus, sometimes we want to think about the actual change, not the ratio. Working at the "df" level gives us room to think about how the function wiggles overall. We can *eventually* scale it down in terms of a specific input.

And we'll do that now. The addition rule above can be written, on a "per dx" basis, as:

## Multiplication (Product Rule)

Next puzzle: suppose our system multiplies parts "f" and g". How does it behave?

Hrm, tricky -- the parts are interacting more closely. But the strategy is the same: see how each part contributes from its own point of view, and combine them:

- total change in h = f's contribution (from f's point of view) + g's contribution (from g's point of view)

Check out this diagram:

What's going on?

- We have our system: f and g are multiplied, giving h (the area of the rectangle)
- Input "x" changes by dx off in the distance. f changes by some amount df (think absolute change, not the rate!). Similarly, g changes by its own amount dg. Because f and g changed, the area of the rectangle changes too.
- What's the area change from f's point of view? Well, f knows he changed by df, but has
*no idea*what happened to g. From f's perspective, he's the only one who moved and will add a slice of area = df * g - Similarly, g doesn't know how f changed, but knows he'll add as slice of area "dg * f"

The overall change in the system (dh) is the two slices of area:

Now, like our miles per gallon example, we "divide by dx" to write this in terms of how much x changed:

(Aside: Divide by dx? Engineers will nod, mathematicians will frown. Technically, df/dx is not a fraction: it's the entire operation of taking the derivative (with the limit and all that). But infinitesimal-wise, intuition-wise, we are "scaling by dx". I'm a smiler.)

The key to the product rule: add two "slivers of area", one from each point of view.

**Gotcha:** But isn't there some effect from both f and g changing simultaneously (df * dg)?

Yep. However, this area is an infinitesimal * infinitesimal (a "2nd-order infinitesimal") and invisible at the current level. It's a tricky concept, but (df * dg) / dx vanishes compared to normal derivatives like df/dx. We vary f and g indepdendently and combine the results, and ignore results from them moving together.

## The Chain Rule: It's Not So Bad

Let's say g depends on f, which depends on x:

The chain rule lets us "zoom into" a function and see how an initial change (x) can effect the final result down the line (g).

**Interpretation 1: Convert the rates**

A common interpretation is to multiply the rates:

x wiggles f. This creates a rate of change of df/dx, which wiggles g by dg/df. The entire wiggle is then:

This is similar to the "factor-label" method in chemistry class:

If your "miles per second" rate changes, multiply by the conversion factor to get the new "miles per hour". The second doesn't know about the hour directly -- it goes through the second => minute conversion.

Similarly, g doesn't know about x directly, only f. Function g knows it should scale its input by dg/df to get the output. The initial rate (df/dx) gets modified as it moves up the chain.

**Interpretation 2: Convert the wiggle**

I prefer to see the chain rule on the "per-wiggle" basis:

- x wiggles by dx, so
- f wiggles by df, so
- g wiggles by dg

Cool. But how are they actually related? Oh yeah, the derivative! (It's the output wiggle per input wiggle):

Remember, the derivative of f (df/dx) is how much to scale the initial wiggle. And the same happens to g:

It will scale whatever wiggle comes along its input lever (f) by dg/df. If we write the df wiggle in terms of dx:

We have another version of the chain rule: dx starts the chain, which results in some final result dg. If we want the final wiggle in terms of dx, divide both sides by dx:

The chain rule isn't just factor-label unit cancellation -- it's the propagation of a wiggle, which gets adjusted at each step.

The chain rule works for several variables (a depends on b depends on c), just propagate the wiggle as you go.

Try to imagine "zooming into" different variable's point of view. Starting from dx and looking up, you see the entire chain of transformations needed before the impulse reaches g.

## Chain Rule: Example Time

Let's say we put a "squaring machine" in front of a "cubing machine":

input(x) => f:x^2 => g:f^3 => output(y)

f:x^2 means f squares its input. g:f^3 means g cubes its input, the value of f. For example:

input(2) => f(2) => g(4) => output:64

Start with 2, f squares it (2^2 = 4), and g cubes this (4^3 = 64). It's a 6th power machine:

And what's the derivative?

- f changes its input wiggle by df/dx = 2x
- g changes its input wiggle by dg/df = 3f^2

The final change is:

## Chain Rule: Gotchas

**Functions treat their inputs like a blob**

In the example, g's derivative ("x^3 = 3x^2") doesn't refer to the original "x", just whatever the input was (foo^3 = 3*foo^2). The input was f, and it treats f as a single value. Later on, we scurry in and rewrite f in terms of x. But g has no involvement with that -- it doesn't care that f can be rewritten in terms of smaller pieces.

**In many examples, the variable "x" is the "end of the line".**

Questions ask for df/dx, i.e. "Give me changes from x's point of view". Now, x could depend on something deeper variable, but that's not being asked for. It's like saying "I want miles per hour. I don't care about miles per minute or miles per second. Just give me miles per hour". df/dx means "stop looking at inputs once you get to x".

**How come we multiply derivatives with the chain rule, but add them for the others?**

The regular rules are about *combining points of view* to get an overall picture. What change does f see? What change does g see? Add them up for the total.

The chain rule is about going deeper into a single part (like f) and seeing if it's controlled by another variable. It's like looking inside a clock and saying "Hey, the minute hand is controlled by the second hand!". We're staying inside the same part.

Sure, eventually this "per-second" perspective of f could be added to some perspective from g. Great. But the chain rule is about diving deeper into "f's" root causes.

## Power Rule: Oft Memorized, Seldom Understood

What's the derivative of x^4? 4x^3? Great. You brought down the exponent and subtracted one. Now explain why!

Hrm. There's a few approaches, but here's my new favorite: x^4 is really x * x * x * x. It's the multiplication of 4 "independent" variables. Each x doesn't know about the others, it might as well be x * u * v * w.

Now think about the first x's point of view:

- It changes from x to x + dx
- The change in the overall function is [(x + dx) - x][u * v * w] = dx[u * v * w]
- The change on a "per dx" basis is [u * v * w]

Similarly,

- From u's point of view, it changes by du. It contributes (du/dx)*[x * v * w] on a "per dx" basis
- v contributes (dv/dx) * [x * u * w]
- w contributes (dw/dx) * [x * u * v]

The curtain is unveiled: x, u, v, and w are the same! The "point of view" conversion factor is 1 (du/dx = dv/dx = dw/dx = dx/dx = 1), and the total change is

In a sentence: the derivative of x^4 is 4x^3 because x^4 has four identical "points of view" which are being combined. Booyeah!

## Take A Breather

I hope you're seeing the derivative in a new light: we have a system of parts, we wiggle our input and see how the whole thing moves. It's about combining perspectives: what does each part add to the whole?

In the follow-up article, we'll look at even more powerful rules (exponents, quotients, and friends). Happy math.

## Other Posts In This Series

- A Gentle Introduction To Learning Calculus
- How To Understand Derivatives: The Product, Power & Chain Rules
- How To Understand Derivatives: The Quotient Rule, Exponents, and Logarithms
- An Intuitive Introduction To Limits
- Why Do We Need Limits and Infinitesimals?
- Learning Calculus: Overcoming Our Artificial Need for Precision
- Prehistoric Calculus: Discovering Pi
- A Calculus Analogy: Integrals as Multiplication
- Calculus: Building Intuition for the Derivative
- Understanding Calculus With A Bank Account Metaphor
- A Friendly Chat About Whether 0.999... = 1

## Leave a Reply

43 Comments on "How To Understand Derivatives: The Product, Power & Chain Rules"

I finally understand the product rule! That was amazing.

@Alex: Awesome! I’m the same way, the product rule never clicked until it was visualized as area.

You’ve come a long way since you left us, haven’t you, Kalid?

Kalid, you are the embodiment of enlightenment. You have been such an inspiration on my journey through math and I cannot thank you enough. Please keep doing what you’re doing!

@Bill: It’s a big, bright world out there!

@N: Wow, thank you for the heartfelt comment! I appreciate the encouragement, I plan to keep cranking :).

Awesome stuff as usual! I love this website it is more entertainment than education for me. You get to learn how interesting things work without the effort.

Thanks Denis, awesome to hear. I think any subject can be naturally entertaining, as long as we’re focused on really building our intuition for it (otherwise, you’re right, it feels like a lot of effort).

Great explanation. Especially about the chain rule. Thanks a lot.

@Hitoshi: Thanks, glad it helped!

Great as usual. In the spirit of Deming I offer this. The f'(df/dx) in the first two line will confuse some…perhaps f’ (aka df/dx).

Keep figuring it all out and sharing with the rest of us.

Thanks Mark — great suggestion, just updated :).

Awesome article. The description of the product rule really changed how I think about them.

Out of curiosity, how do you think your idea of the power rule extends to negative, fractional and irrational powers? It’s a bit harder to think about since you can’t just split them into linear parts.

Hi Gourav, thanks for the note. Great question about the negative, fractional and irrational powers. To follow the analogy, we could use the chain rule; suppose we have f(x) = x^-3. See x^-3 as shorthand for 1/x^3. We can do:

d/dx x^-3 = d/dx 1/x^3 = d/dx 1/u = -1/u^2 * du/dx

du/dx can be understood intuitively (3x^2), and we divide it by (x^3)^2. We can see the x powers fight it out as (x-1) – 2x = -x – 1 [The (x-1) power is from du/dx, and -2x is from 1/u^2. With x=3, get -3 – 1 = -4 as the power]. Notice how we still brought down the “3” (which was in du/dx). Hope this part made sense.

Once we get to fractional and irrational powers, it’s probably easiest to rewrite things in terms of e: x^3.4 = e^[ln(x)*3.4]. From here, we can use the chain rule and product rule and exponent rule (to be explained next time) we can get the result. Essentially, even a complex idea like a fractional exponent can be further broken down. It’s something I’d like to write more about — it’s helping to really test my intuition :).

I found the following website useful for understanding the product rule using what I already know.

http://woobiola.net/math/calc2b.htm

Nice post Kalid. I’ve spend the last couple of hours trying to develop this ‘machine-like’ intuition. Any chance that you could also post some examples on the quotient rule. I’ve been trying to work it out on my own but haven’t managed to get there. Honestly this is slightly worrying. I feel that if I truly understood what you are saying then the quotient rule should be no big deal. Thanks.

@John: Thanks for the comment. Intuitively, the quotient rule can be seen as a variation of the product rule since division is a variation of multiplication (in my head, “multiplying by a quantity that is getting smaller”). So, the quotient rule should look a lot like the product rule (two “slices” to take into account), but one of the slices is a shrinking one. I’ll be posting a follow-up soon.

And great gut-check by the way. If a concept isn’t clicking deep down it means there’s more intuition to build (and, probably, the explanation can use some refinement π ).

@Jisoo: Great, I like the simple diagrams!

[…] Post navigation ← Previous […]

First of all, great initiative and material. Loved the way u analyse things. I read this page a couple of months ago.

Recently I also read about binomial series and somehow I was able to narrate how the power rule was actually derived. So here it goes.

Lets take the simplest function y = x^2. Now what do we mean by derivative? It is simply the change in the output when we tweak the input a little.

Now lets take two number x and x+1. Now I want to find out how much y changes when we change x.

Change = (x+1)^2 – x^2

Conventional calculus tells us that it is 2*x.

But the actual value can be obtained by using binomial theorem.

We all know (a+b)^2 formula, = a^2 + b^2 + 2*a*b

Now (x+1)^2 = x^2 + 1 + 2*x

Change = x^2 + 1 + 2*x – x^2 = 2*x + 1

Haha , we have arrived at the answer. The calculus value and the actual value differ by 1. To remove that we apply the rule that the change in input is very very less when compared to the input value. x>>1.

Applying the above, we can approximate 2*x + 1 as 2*x.

In the same way , I applied the same to x^3 and the difference is 1 + 3*x*1(x+1) which can be approximated as 1 + 3*x(x) which can be reduced to 3*x since x>>1.

In general, for x^n, we have n+1 terms in the series. Of that we omit all powers of n upto n-2. We take only x^n and x^n-1 terms. The co-efficient of x^n-1 is n and hence the power rule is given as

(d/dx) of x^n = n*(x^(n-1)).

Thanks to the admin for invoking a interest in me to solve this. Hope it helps.

@Phoenix: Thanks for the comment. Yep, that’s the essence of it — to get more particular, turn the 1 into “dx” (the amount of change, so it’s (x+dx)^2), then do the binomial theorem and throw away the dx at the end (i.e., assume your change was “perfect”).

We know fβ(x) = dy/dx = 2 * x

At x = 10 the βoutput wiggle per input wiggleβ is = 2 * 10 = 20. The output

moves 20 units for every unit of input movement.

If dx = 0.1, then dy = 20 * dx = 20 * .1 = 2

Umm what?

If F(x) = x^2 and x=10 then the result of that would be 100, not 20.

If 2*10=2 then the output would move 2 units for every unit of input movement, not 10, this doesn’t make any sense at all.

That said, I speak as someone who passed a college level Calc course but never understood a lick of anything I was doing (dunno how that’s possible). Simply memorizing everything, I just remember an overwhelming urge to bash my brains out on my school desk. Actually getting that same urge now.

Crazy ehh?

If our input lever is at x = 10 and we wiggle it slightly (moving it by dx=0.1 to 10.1), the output should change by dy. How much, exactly?

We know fβ(x) = dy/dx = 2 * x

At x = 10 the βoutput wiggle per input wiggleβ is = 2 * 10 = 20. The output moves 20 units for every unit of input movement.

If dx = 0.1, then dy = 20 * dx = 20 * .1 = 2

To be clear, let me explain what I’m confused with. “The output moves 20 units for every unit of input movement.” What are we calling a unit? An integer? A dozen? twenty? (.1? 1? 20?)

The other thing that confuses me is that you go form dy/dx = 5 to a whole different formula/equation. Consistency would help those not as literate in mathematics out (like myself) quite a bit.

what s the payoff in learning all of this… the bank account metaphor was insightful, poupulations models might serve as good example, but also remember that learning is different is for each individual depending on what resides in their subconscious…

@andy: The payoff is understanding something you didn’t before! π Yep, if the analogy works for you, it works.

Thank you! I really wanted to understand how and why these rules work, not just how to apply them π

Kalid,

Thanks for your extraordinarily simple explanations of calc! I’m currently a sophomore in high school, and I could have just waited until next year to take the class, but I’ve wanted to learn for too long already! It’s amazing how simple and easy the math of change is! Now I get to make the semi-intelligent juniors feel dumb for someone a grade below them knowing more about the subject than they do…

To use calculus on any changing system, is it mandatory, that the system “MUST” follow a particular rule of change.

For example, when some system is changing by ratio of 1:2, then one can find out the change as 2:4 or 100:200 finally. What is the rule of change in calculus ? If not, will calculus be able to find an accurate answer every time ?

@Matthias: Glad it helped.

@JJ: Awesome, glad you’re getting a head start! You got it, the math isn’t much more than algebra, it’s just seeing how to put the variables together.

@Vishwas: Calculus is made for instantaneous rates of change, i.e. the rate of change at a certain moment in time. As you move away from that moment, the rate of change varies and is no longer accurate [and you use integration to add-up these constantly-shifting moments].

Still does not make any sense unfortunately – including the mechanical computer video and wiggles. Perhaps it’s hopeless and I will never understand calculus despite wanting to. I was lost earlier. If f(x) = x^2 and input is 10. Wiggle of 0.1 gives wiggle of 2.01 and wiggle of 0.01 gives 0.2001. How do these relate to the derivative?

Hi, am working through the tutorials from naught, to gather an understanding of calculus and within 4 weeks when my course will start. Is there anyone who might help me excel more than is possible on my own via skype or phone? Based in Melb, 24.01.14.

Hi Silrak, there’s a full series on calculus here: http://betterexplained.com/calculus/ which might help.

why you neglect df * dg area?

if rate of change is more then error produces in the way of intuition,

i understand 95% about product rule except this part

what a intuition to produt rule! i really like it except neglecting dy * dg

[…] Can we see a giant function as being parameterized by smaller ones? (See the chain rule.) […]

@raju

Great question! This question, and a few others very much like it, gave me a bit of trouble until I performed a bit of mental ju jitsu to convince myself I understood it. I’ll have a crack at answering this, if it helps great, if not feel free to ignore me, I won’t be offended.

I can answer it with a bit of simple, but creative, algebra. To do so I’m going to need to explain two different ‘sizing’ operations: the integral ( β« ) and the differential element (d). I don’t want to throw a bunch of integration at you in trying to explain derivatives, but if you blur out the strict definitions and just look at the ideas, then β« and d are just two different ideas that are a kind of compliment to each other.

β« – add up a whole bunch of things

d – take a thing and cut it into a bunch of ‘itty bitty pieces’, or i.b.p.

Taking a differential element of a pizza, or d(pizza), is just shaving off a little bit, it’s lifting a pepperoni and licking the bottom then replacing it while nobody’s looking. The small amount of pepperoni grease that’s missing is so small compared to the whole size of the pizza that no one notices it’s missing.

In that vein I’m going to talk about taking some i.b.p.’s of a few things. Start with:

h(x) = f(x) * g(x)

h is the full size of h, the output

dh is an i.b.p. of the output

x is the full size of an input

dx is an i.b.p. of an input

I’m going to take a few liberties and do some ‘algebra magic’ with quantities like dx, please understand that what I’m about to insinuate is not technically allowed, and if we followed the ‘mathematically correct’ path we would perform a whole lot of weird calculus operations just to come up with the same result, see Kalid’s ‘Aside’ note about the Engineers nodding and the mathematician frowning.

I’m going to assume you followed Kalid’s logic and got to this part:

dh = f*dg + g*df

dh is just an i.b.p. of h, it is the little bit of rectangle you add on to the full size rectangle of f*g.

f*dg is the vertical sliver rectangle

g*dh is the horizontal sliver rectangle

So it seems your question becomes the following: if I want to describe all of that i.b.p. of h that I’m adding I can see I need to add the two slivers, but don’t I also need to add that little square? Sure it’s teeny, but the slivers are teeny too, aren’t they?

How about we don’t disregard it, how about we add it in then see why it vanishes and the slivers are big enough to stay.

We should have:

dh = f*dg + g*df

but logic tells us we have:

dh = f*dg + g*df + (df*dg) that pesky little square

dh = f*dg + g*df + (df*dg) **now divide both sides by dx

dh/dx = f * dg/dx + g *df/dx + (df*dg)/dx **re-arrange () in 3rd term

dh/dx = f * dg/dx + g *df/dx + df * (dg/dx) **put that 3rd term right after the 2nd

= f * dg/dx + df * (dg/dx) + g * df/dx **factor out the dg/dx

= (f + df) (dg/dx) + g * df/dx **almost there, compare it to what we should have:

= (f ) (dg/dx) + g * df/dx

What happens to (f + df) as df gets ‘eensy weensy’? It gets really close to f. The full size of f on its own is indistinguishable from the full size of f plus an i.b.p. of f. No one sees the little bit of pepperoni grease missing compared to the full size of the pie (I live in New England, we call pizza ‘pie’ up here, we also take ‘r’s out of words and stick ’em other places they don’t belong: pahk the cah, Delter airlines).

With that pesky little square we have:

dh/dx = (f + df) (dg/dx) + g * df/dx

Without we have:

dh/dx = (f ) (dg/dx) + g * df/dx

But these two are the same!

———————————

If I haven’t confused you yet maybe this will throw you off guard (I jest, I really do wan’t you to understand)!

Here’s another reason that little square ( df * dg ) vanishes but the slivers remain. Let’s use an analogy that all of us understand so well on an intuitive level: Boolean Algebra! (Bang head against wall now)

+ means OR

* means AND

P(a) means probability event a happens

P(a) + P(b) means probability of a or b happening

P(a) * P(b) means probability of a happening AND THEN b happening

That little square is df * dg, its kind of like take a little chunk of f, then a little chunk of g. It’s like licking the pepperoni, AND THEN a little flea jumps on your tongue and licks a little grease from your tongue. You don’t notice it compared the ‘large’ little bit of grease you got, which itself is small compared to the pie. Just df on its own is you licking the pepperoni. Just dg on its own is the flea licking your tongue. The sum df + dg is either you licking the pepperoni or the flea licking your tongue. But df *dg is licking the pepperoni AND THEN getting licked by the flea, it means comparing the flea’s very little bit of grease to the whole pie.

Hope this helps, or at least that you got the chance to laugh at Calculus for a little while.

Excelsior,

Eric V