I cringe when hearing "Math teaches you to think".
It's a well-meaning but ineffective appeal that only satisfies existing fans (see: "Reading takes you anywhere!"). What activity, from crossword puzzles to memorizing song lyrics, doesn't help you think?
Math seems different, and here's why: it's a specific, powerful vocabulary for ideas.
Imagine a cook who only knows the terms "yummy" and "yucky". He makes a bad meal. What's wrong? Hrm. There's no way to describe it! Too mild? Salty? Sweet? Sour? Cold? These specific critiques become hazy variations of the "yucky" bucket. He probably wouldn't think "Needs more umami".
Words are handholds that latch onto thoughts. You (yes, you!) think with extreme mathematical sophistication. Your common-sense understanding of quantity includes concepts refined over millenia (zero, decimals, negatives).
What we call "Math" are just the ideas we haven't yet internalized.
Let's explore our idea of quantity. It's a funny notion, and some languages only have words for one, two and many. They never thought to subdivide "many", and you never thought to refer to your East and West hands.
Our concept of numbers shapes our world. Why do ancient years go from BC to AD? We needed separate labels for "before" and "after", which weren't on a single scale.
Why did the stock market set prices in increments of 1/8 until 2000 AD? We were based on centuries-old systems. Ask a modern trader if they'd rather go back.
Why is the decimal system useful for categorization? You can always find room for a decimal between two other ones, and progressively classify an item (1, 1.3, 1.38, 1.386).
Why do we accept the idea of a vacuum, empty space? Because you understand the notion of zero. (Maybe true vacuums don't exist -- you get the theory)
Why is anti-matter or anti-gravity palatable? Because you accept that positives could have negatives that act in opposite ways.
How could the universe come from nothing? Well, how can 0 be split into 1 and -1?
Our math vocabulary shapes what we're capable of thinking about. Multiplication and division, which eluded geniuses a few thousand years ago, are now homework for grade schoolers. All because we have better ways to think about numbers.
We have decent knowledge of one noun, quantity. Imagine improving our vocabulary for structure, shape, change, and chance. (Oh, I mean, the important-sounding Algebra, Geometry, Calculus and Statistics.)
Caveman Chef Og doesn't think he needs more than yummy/yucky. But you know it'd blow his mind, and his cooking, to understand sweet/sour/salty/spicy/tangy.
We're still cavemen when thinking about new ideas, and that's why we study math.
I’ve studied probability and statistics without experiencing them. What’s the difference? What are they trying to do?
This analogy helped:
Probability is starting with an animal, and figuring out what footprints it will make.
Statistics is seeing a footprint, and guessing the animal.
Probability is straightforward: you have the bear. Measure the foot size, the leg length, and you can deduce the footprints. “Oh, Mr.
Bubbles weighs 400lbs and has 3-foot legs, and will make tracks like this.” More academically: “We have a fair coin. After 10 flips, here are the possible outcomes.”
Statistics is harder. We measure the footprints and have to guess what animal it could be. A bear? A human? If we get 6 heads and 4 tails, what’re the chances of a fair coin?
The Usual Suspects
Here’s how we “find the animal” with statistics:
Get the tracks. Each piece of data is a point in “connect the dots”. The more data, the clearer the shape (1 spot in connect-the-dots isn’t helpful. One data point makes it hard to find a trend.)
Measure the basic characteristics. Every footprint has a depth, width, and height. Every data set has a mean, median, standard deviation, and so on. These universal, generic descriptions give a rough narrowing: “The footprint is 6 inches wide: a small bear, or a large man?”
Find the species. There are dozens of possible animals (probability distributions) to consider. We narrow it down with prior knowledge of the system. In the woods? Think horses, not zebras. Dealing with yes/no questions? Consider a binomial distribution.
Look up the specific animal. Once we have the distribution (“bears”), we look up our generic measurements in a table. “A 6-inch wide, 2-inch deep pawprint is most likely a 3-year-old, 400-lbs bear”. The lookup table is generated from the probability distribution, i.e. making measurements when the animal is in the zoo.
Make additional predictions. Once we know the animal, we can predict future behavior and other traits (“According to our calculations, Mr. Bubbles will poop in the woods.”). Statistics helps us get information about the origin of the data, from the data itself.
Ok! The metaphor isn’t perfect, but more palatable than “Statistics is the study of the collection, organization, analysis, and interpretation of data”. Need proof? Let’s see if we can ask intuitive “I tasted it!” questions:
Can we predict the next footprint? (Extrapolation)
Are the tracks following a path? (Regression / trend line)
Here’s two tracks, which animal was faster? Bigger? (Data from two drug trials: which was more effective?)
Is one animal moving in the same direction as another? (Correlation)
Are two animals tracking a common source? (Causation: two bears chasing the same rabbit)
These questions are much deeper than what I pondered when first learning stats. Every dry procedure now has a context: are we learning a new species? How to take the generic footprint measurements? How to make a table from a probability distribution? How to lookup measurements in a table?
Having an analogy for the statistics process makes later data crunching click. Happy math.
PS. The forwards-backwards difference between probability and statistics shows up all over math. Some procedures are easy to do (derivatives) but difficult to undo (integrals). (Thanks Denis)
What’s algebra about? When learning about variables (x, y, z), they seem to “hide” a number:
What number could be hiding inside of x? 2, in this case.
It seems that arithmetic still works, even when we don’t have the exact numbers up front. Later on, we might arrange these “hidden numbers” in complex ways:
Whoa — a bit harder to solve, but it’s possible. Today let’s figure out how factoring works and why it’s useful.
When we write a polynomial like “x^2 + x = 6″, we can think at a higher level.
We have an unknown number, x, which interacts with itself (x * x = x^2). We add in the original number (+ x) and the result is 6.
x^2, x and 6 are all “numbers”, but now we’re keeping track of how they’re made:
x^2 is a component interacting with itself
x is a component on its own
6 is the desired state we want the entire system to become
After the interactions are finished, we should get 6. What number could be hiding inside of x to make this true?
Hrm — this is tricky. So let’s fight with a trick of our own: we can make a different system to track the error in our original one (this is mind-bending, so hang on).
Our original system is x^2 + x. The desired state is 6. A new system:
will track the difference between the original system and the desired state. When are we happiest? When there’s no difference:
Ah! that’s why we’re so interested in setting polynomials to zero! If we have a system and the desired state, we can make a new equation to track the difference — and try make it zero. (This is deeper than just “subtract 6 from both sides” — we’re trying to describe the error!)
But… how do we actually get the error to zero? It’s still a jumble of components: x^2, x and 6 are flying everywhere.
Factor That Mamma Jamma
Factoring the rescue. My intuition: factoring lets us re-arrange a complex system (x^2 + x – 6) as a bunch of linked, smaller systems.
Imagine taking a pile of sticks (our messy, disorganized system) and standing them up so they support each other, like a teepee:
(That’s a 2-d example, with two sticks).
Remove any stick and the entire structure collapses. If we can rewrite our system:
as a series of multiplications:
we’ve put the sticks in a “teepee”. If Component A or Component B becomes 0, the structure collapses, and we get 0 as a result.
Neat! That is why factoring rocks: we re-arrange our error-system into a fragile teepee, so we can break it. We’ll find what obliterates our errors and puts our system in the ideal state.
Remember: We’re breaking the error in the system, not the system itself.
Onto The Factoring
Learning to “factor an equation” is the process of arranging your teepee. In this case:
If x = -3 then Component A falls down. If x = 2, Component B falls down. Either value causes the error to collapse, which means our original system (x^2 + x, the one we almost forgot about!) meets our requirements:
When x = -3, the error collapses, and we get (-3)2 + -3 = 6
When x = 2, the error collapses, and we get 22 + 2 = 6
Putting It All Together
I’ve wondered about the real purpose of factoring for a long, long time. In algebra class, equations are conveniently set to zero, and we’re not sure why. Here’s what happens in the real world:
Define the model: Write how your system behaves (x^2 + x)
Define the desired state: What should it equal? (6)
Define the error: The error is its own system: Error = actual – desired (i.e., x^2 + x – 6)
Factor the error: Rewrite the error as interlocking components: (x + 3)(x – 2)
Reduce the error to zero: Zero out one component or the other (x = -3, or x = 2).
When error = 0, our system must be in the desired state. We’re done!
Algebra is pretty darn useful:
Our system is a trajectory, the “desired state” is the target. What trajectory hits the target?
Our system is our widget sales, the “desired state” is our revenue target. What amount of earnings hits the goal?
Our system is the probability of our game winning, the “desired state” is a 50-50 (fair) outcome. What settings make it a fair game?
The idea of “matching a system to its desired state” is just one interpretation of why factoring is useful. If you have more, I’d like to hear them!
A cheatsheet for the process:
Some more food for thought:
Multiplication is often seen as AND. Component A must be there AND Component B must be there. If either condition is false, the system breaks.
The Fundamental Theorem of Algebra proves you have as many “components” as the highest polynomial. If your highest term is x^4, then you can factor into 4 interlocked components (discussion for another day). But this should make sense: if you rewrite an “x^4 system” into multiplications, shouldn’t there be 4 individual “x components” being multiplied? If there were 3, you could never get to x^4, and if there were 5, you’d overshoot and get an x^5 term.
Do you have a real-world system in a “teepee” arrangement, where a single failing component collapses the entire structure?
The quadratic formula can “autobreak” any system with x^2, x and constant components. There’s formulas for complex systems (with x^3, x^4, or even some x^5 components) but they start to get a bit crazy.
Is there any way to prevent a system from having these weak points? (Unfactorable? Non-zeroable?). Don’t forget, we thought systems like x^2 + 1 were “non-zeroable” until imaginary numbers came along.
I usually avoid current events, but recent skirmishes in the math world prompted me to chime in. To recap, there’ve been heated discussions about math education and the role of online resources like Khan Academy.
As fun as a good math showdown may appear, there’s a bigger threat: Apathy. And Justin Bieber.
Educators, online or not, don’t compete with each other. They struggle to be noticed in our math-phobic society, where we casually wonder “Should algebra be taught at all?” not “Can algebra be taught better?”.
Entertainment is great; I love Starcraft. But it’s alarming when a prominent learning initiative gets less attention than a throwaway pop song (Super Bass: 268M views in a year; Khan Academy: 175M views in 5 years). Online learning is a rounding error next to Justin Bieber — “Baby” has 700M views alone.
What do we need? The Math Avengers. Different heroes, different tactics, and not without differences… but everyone fighting on the same side. Against Bieber.
I could be walking into a knife fight with an ice cream cone, but I’d like to approach each side with empathy and offer specific suggestions to bridge the gap.
The Big Misunderstanding
Superheroes need a misunderstanding before working together. It’s inevitable, and here’s ours (as a math relationship, of course):
Bad Teacher < Online Learning < Good teacher
The problem is in considering each part separately.
Is Khan Academy (free, friendly, always available) better than a mean, uninformed, or absent teacher? Yes!
Is an engaging human experience better than learning from a computer? Yes!
But, really, the ultimate solution is Online learning + Good Teachers.
Tactics differ, but we can agree on the mission: give students great online resources, and give teachers tools to augment their classroom.
Why Do I Care?
I love learning. Here’s my brief background so you can root out my biases.
I was a good student. I was on the math team and hummed songs like “Life is a sine-wave, I want to de-rive it all night long…”. I drew comics about sine & cosine, the crimefighting duo. You might say I enjoyed math.
I entered college and was slapped in the face by my freshman year math class.
Professors at big universities must know everything, right? If I didn’t get a concept, something must be wrong with me, right?
I had a WWII-era, finish-half-a-proof-in-class, grouch of a teacher. I bombed the midterm and was distressed. Math… I loved math! I didn’t mind difficulties in Physics or Spanish. But math? What I used to sing and draw cartoons about?
Finals came. While cramming, I found notes online, far more helpful than my book and teacher. I sent an email to the class, gingerly suggesting BY EUCLID YOU NEED TO READ THESE WEBSITES THEY ARE SO MUCH BETTER THAN THE PROFESSOR. The websites turned up on an index card in the computer lab that evening. How many of us were struggling?
I was studying, staring at a blue book when an aha! moment struck. I could see the Matrix: equations were a description of twists, turns and rotations. Their meaning became “obvious” in the way a circle must be round. What else could it be?
I was elated and furious: “Why didn’t they explain it like that the first time?!”
Paranoid I’d forget, I put my notes online and they evolved into this site: insights that actually worked for me. Articles on e, imaginary numbers, and calculus became popular — I think we all crave deep understanding. Bad teaching was a burst of gamma rays: I’m normally mild mannered, but enter Hulk Mode when recalling how my passion nearly died.
My core beliefs:
A bad experience can undo years of good ones. Students need resources to sidestep bad teaching.
Hard-won insights, sometimes found after years of teaching, need to be shared
Learning “success” means having basic skills and the passion to learn more. A year, 5 years from now, do people seek out math? Or at least not hate it? (Compare #ihatemath to #ihategeography)
(Oh, I had great teachers too, like Prof. Kulkarni. The bad one just unlocked the Hulk.)
An Open letter to Khan Academy and Teachers
I recently heard a quote about constructive dialog: “Don’t argue the exact point a person made. Consider their position and respond to the best point they could have made.”
Here’s the concerns I see:
Packaging and presentation matters
Yes, other resources and tutorials exist, but there’s power in a giant, organized collection. We visit Wikipedia because we know what to expect, and it’s consistent.
Khan Academy provides consistent, non-judgmental tutorials. There are exercises and discussions for every topic. You don’t need to scour YouTube, digest hour-long calculus lectures, or open up PDF worksheets for practice.
So, let’s use the magic of friendly, exploratory, bite-sized learning of topics.
Teachers and online tools don’t “compete” any more than Mr. Rogers and Sesame Street did. They’re both ways to help.
I do think the name “Khan Academy” presents a challenge to community building. Would you rather write for Wikipedia or the Jimmy-Wales-o-pedia?
Wikipedia really feels like a community effort, and though there are alternatives, in general it’s a well-loved resource.
I think teachers may hesitate to use Khan Academy, not out of jealousy, but concern that a single pedagogical approach could overpower all others. Let’s build an online resource that can take input from the math community.
Human interaction matters
It’s easy to misunderstand Khan Academy’s goal. I’ve seen many of their blog posts and videos, and believe Khan Academy wants to work with teachers to promote deep understanding.
But, some news coverage shows students working silently in front of computers in class, not watching at home to free up class time for personal discussions.
The teacher doesn’t appear to be involved or interacting, and that misuse of a learning tool is a nightmare for teachers who want a personal connection. Let’s have an online resource that directly contributes to offline interactions also.
I’ve seen that insights emerge hours (or years) after learning a subject. For example, we’ve “known” since 4th grade what a million and billion are: 1,000,000 and 1,000,000,000.
But do we feel it? How long is a million seconds, roughly? C’mon, guess. Ready? It’s 12 days.
Ok, now how long is a billion seconds? It’s… wait for it… 31 years. 31 years!
That’s the difference between knowing and feeling an idea. Passion comes from feeling.
Teachers draw on years of experience to get ideas to click — let’s feed this back into the online lessons.
We teach for the same reason: to help students. Here’s a few specific situations to consider.
For many, Khan Academy is their only positive math experience: not teachers, or peers, or parents, but a video. Sure, it’s not the same as an in-person teacher, but it’s miles beyond an absent or hostile one. If an education experience gets someone excited to learn, and coming back to math, we should celebrate.
Remember, despite years of positive experiences and acing tests, a sufficiently bad class nearly drove me away from math. Resources like Khan Academy offer a lifeline: “Even with a bad teacher, I can still learn”.
When someone is interested, we need to feed their curiosity. I get a lot of traffic from Khan Academy comments — how can we help students dive deeper, without making them trudge randomly through the internet?
Lastly, we all learn differently. I generally prefer text to videos (faster to read, and I can “pause” with my eyes and think). Some like the homemade feel of Khan’s videos. Others might like the polished overviews in MinutePhysics. You might prefer 3-act math stories or modeling instruction.
Let’s offer several types of resources for students to enjoy.
Calling the Math Avengers
Still here? Fantastic. To all teachers, online and non:
What specific steps can we take to align our efforts?
One idea: Make a curated, collaborative, easy-to-explore teaching resource.
Khan Academy is well-organized: each topic has a video and sample problems. How about sections for complementary teaching styles, projects, and misconceptions?
Imagine a student could select their “Math hero” as Khan Academy or PatrickJMT or James Tanton and see lessons in the style they prefer (like Wikipedia, curate the list to “notable” resources).
Imagine teachers could explore the best in-class activities (“What projects work well for negative numbers?”).
Whatever the style, make it easy for other educators to contribute. Want project-based videos? Sure. Need step-by-step tutorials? Great. Prefer a conceptual overview? No problem.
Each teacher keeps their house style. Let Hulk smash, and Captain America handle the hostage negotiations. Use the hero that suits you.
Hi all! I’m starting an email list for BetterExplained readers and everyone interested in deep, intuitive learning. Sign up with the form below or click here to sign up.
Why Should I Sign Up?
If you like the blog, you’ll enjoy the email list too. I’ll be sharing the insights & techniques that took me from “huh?” to “aha!” on topics in math, programming, and communication. These periodic emails will include:
Exclusive content & previews
Short learning tips / essays / additional material that weren’t the right fit for the blog
Q&A on topics that are bothering you
Announcements & discounts for BetterExplained products I think you’ll enjoy
The blog explains ideas as I wish they were taught. The email list shares information I wish I’d seen.
Why Start an Email List?
A few reasons:
Email has better interaction. I can write, you can read & reply at your leisure. Social media seems noisy and non-personal — I want a conversation. (I credit Scott Young and patio11 for jump-starting the email idea).
The medium shapes the message. Blog posts are great for long-form, single-topic articles. Email favors shorter, bite-sized pieces. I found myself holding back material because it wasn’t a “blog-fit” (but still valuable!).
Better sustainability. The goal of BetterExplained is to be a lifelong project. Keeping in touch ensures the material stays useful and enjoyable, and that products (like the existing ebook + screencasts) dramatically increase your understanding.
I love blogging and will always write. Email is another method to get quick feedback, with blog posts to distill the final results.
I’m so thankful to have a little corner of the internet to share ideas, and I love hearing about and discussing the aha! moments that made things click. Let’s raise the bar for our teaching and learning: keep in touch!
Last time we tackled derivatives with a “machine” metaphor. Functions are a machine with an input (x) and output (y) lever. The derivative, dy/dx, is how much “output wiggle” we get when we wiggle the input:
Now, we can make a bigger machine from smaller ones (h = f + g, h = f * g, etc.). The derivative rules (addition rule, product rule) give us the “overall wiggle” in terms of the parts. The chain rule is special: we can “zoom into” a single derivative and rewrite it in terms of another input (like converting “miles per hour” to “miles per minute” — we’re converting the “time” input).
And with that recap, let’s build our intuition for the advanced derivative rules. Onward!
Division (Quotient Rule)
Ah, the quotient rule — the one nobody remembers. Oh, maybe you memorized it with a song like “Low dee high, high dee low…”, but that’s not understanding!
It’s time to visualize the division rule (who says “quotient” in real life?). The key is to see division as a type of multiplication:
We have a rectangle, we have area, but the sides are “f” and “1/g”. Input x changes off on the side (by dx), so f and g change (by df and dg)… but how does 1/g behave?
Chain rule to the rescue! We can wrap up 1/g into a nice, clean variable and then “zoom in” to see that yes, it has a division inside.
So let’s pretend 1/g is a separate function, m. Inside function m is a division, but ignore that for a minute. We just want to combine two perspectives:
f changes by df, contributing area df * m = df * (1 / g)
m changes by dm, contributing area dm * f = ?
We turned m into 1/g easily. Fine. But what is dm (how much 1/g changed) in terms of dg (how much g changed)?
We want the difference between neighboring values of 1/g: 1/g and 1(g + dg). For example:
What’s the difference between 1/4 and 1/3? 1/12
How about 1/5 and 1/4? 1/20
How about 1/6 and 1/5? 1/30
How does this work? We get the common denominator: for 1/3 and 1/4, it’s 1/12. And the difference between “neighbors” (like 1/3 and 1/4) will be 1 / common denominator, aka 1 / (x * (x + 1)). See if you can work out why!
If we make our derivative model perfect, and assume there’s no difference between neighbors, the +1 goes away and we get:
(This is useful as a general fact: The change from 1/100 to 1/101 = one ten thousandth)
The difference is negative, because the new value (1/4) is smaller than the original (1/3). So what’s the actual change?
g changes by dg, so 1/g becomes 1/(g + dg)
The instant rate of change is -1/g^2 [as we saw earlier]
The total change = dg * rate, or dg * (-1/g^2)
A few gut checks:
Why is the derivative negative? As dg increases, the denominator gets larger, the total value gets smaller, so we’re actually shrinking (1/3 to 1/4 is a shrink of 1/12).
Why do we have -1/g^2 * dg and not just -1/g^2? (This confused me at first). Remember, -1/g^2 is the chain rule conversion factor between the “g” and “1/g” scales (like saying 1 hour = 60 minutes). Fine. You still need to multiply by how far you went on the “g” scale, aka dg! An hour may be 60 minutes, but how many do you want to convert?
Where does dm fit in? m is another name for 1/g. dm represents the total change in 1/g, which as we saw, was -1/g^2 * dg. This substitution trick is used all over calculus to help split up gnarly calculations. “Oh, it looks like we’re doing a straight multiplication. Whoops, we zoomed in and saw one variable is actually a division — change perspective to the inner variable, and multiply by the conversion factor”.
Phew. To convert our “dg” wiggle into a “dm” wiggle we do:
Yay! Now, your overeager textbook may simplify this to:
and it burns! It burns! This “simplification” hides how the division rule is just a variation of the product rule. Remember, there’s still two slivers of area to combine:
The “f” (numerator) sliver grows as expected
The “g” (denominator) sliver is negative (as g increases, the area gets smaller)
Using your intuition, you know it’s the denominator that’s contributing the negative change.
which means, in English, “e changes by 100% of its current amount” (read more).
The “current amount” assumes x is the exponent, and we want changes from x’s point of view (df/dx). What if u(x)=x^2 is the exponent, but we still want changes from x’s point of view?
It’s the chain rule again — we want to zoom into u, get to x, and see how a wiggle of dx changes the whole system:
x changes by dx
u changes by du/dx, or d(x^2)/dx = 2x
How does e^u change?
Now remember, e^u doesn’t know we want changes from x’s point of view. e only knows its derivative is 100% of the current amount, which is the exponent u:
The overall change, on a per-x basis is:
This confused me at first. I originally thought the derivative would require us to bring down “u”. No — the derivative of e^foo is e^foo. No more.
But if foo is controlled by anything else, then we need to multiply the rate of change by the conversion factor (d(foo)/dx) when we jump into that inner point of view.
The derivative is ln(x) is 1/x. It’s usually given as a matter-of-fact.
My intuition is to see ln(x) as the time needed to grow to x:
ln(10) is the time to grow from 1 to 10, assuming 100% continuous growth
Ok, fine. How long does it take to grow to the “next” value, like 11? (x + dx, where dx = 1)
When we’re at x=10, we’re growing exponentially at 10 units per second. It takes roughly 1/10 of a second (1/x) to get to the next value. And when we’re at x=11, it takes 1/11 of a second to get to 12. And so on: the time to the next value is 1/x.
is mainly a fact to memorize, but it makes sense with a “time to grow” intepreration.
A Hairy Example: x^x
Time to test our intuition: what’s the derivative of x^x?
This is a bad mamma jamma. There’s two approaches:
Approach 1: Rewrite everything in terms of e.
Oh e, you’re so marvelous:
Any exponent (a^b) is really just e in different clothing: [e^ln(a)]^b. We’re just asking for the derivative of e^foo, where foo = ln(x) * x.
But wait! Since we want the derivative in terms of “x”, not foo, we need to jump into x’s point of view and multiply by d(foo)/dx:
The derivative of “ln(x) * x” is just a quick application of the product rule. If h=x^x, the final result is:
We wrote e^[ln(x)*x] in its original notation, x^x. Yay! The intuition was “rewrite in terms of e and follow the chain rule”.
Approach 2: Independent Points Of View
Remember, deriviatives assume each part of the system works independently. Rather than seeing x^x as a giant glob, assume it’s made from two interacting functions: u^v. We can then add their individual contributions. We’re sneaky though, u and v are the same (u = v = x), but don’t let them know!
From u’s point of view, v is just a static power (i.e., if v=3, then it’s u^3) so we have:
And from v’s point of view, u is just some static base (if u=5, we have 5^v). We rewrite into base e, and we get
We add each point of view for the total change:
And the reveal: u = v = x! There’s no conversion factor for this new viewpoint (du/dx = dv/dx = dx/dx = 1), and we have:
It’s the same as before! I was pretty excited to approach x^x from a few different angles.
By the way, use Wolfram Alpha (like so) to check your work on derivatives (click “show steps”).
Question: If u were more complex, where would we use du/dx?
Imagine u was a more complex function like u=x^2 + 3: where would we multiply by du/dx?
Let’s think about it: du/dx only comes into play from u’s point of view (when v is changing, u is a static value, and it doesn’t matter that u can be further broken down in terms of x). u’s contribution is
if we wanted the “dx” point of view, we’d include du/dx here:
We’re multiplying by the “du/dx” conversion factor to get things from x’s point of view. Similarly, if v were more complex, we’d have a dv/dx term when computing v’s point of view.
Look what happened — we figured out the genric d/du and converted it into a more specific d/dx when needed.
It’s Easier With Infinitesimals
Separating dy from dx in dy/dx is “against the rules” of limits, but works great with infinitesimals. You can figure out the derivative rules really quickly:
We set “df * dg” to zero when jumping out of the infinitesimal world and back to our regular number system.
Think in terms of “How much did g change? How much did f change?” and derivatives snap into place much easier. “Divide through” by dx at the end.
Summary: See the Machine
Our goal is to understand calculus intuition, not memorization. I need a few analogies to get me thinking:
Functions are machines, derivatives are the “wiggle” behavior
Derivative rules find the “overall wiggle” in terms of the wiggles of each part
The chain rule zooms into a perspective (hours => minutes)
The product rule adds area
The quotient rule adds area (but one area contribution is negative)
e changes by 100% of the current amount (d/dx e^x = 100% * e^x)
natural log is the time for e^x to reach the next value (x units/sec means 1/x to the next value)
With practice, ideas start clicking. Don’t worry about getting tripped up — I still tried to overuse the chain-rule when working with exponents. Learning is a process!
Appendix: Partial Derivatives
Let’s say our function depends on two inputs:
The derivative of f can be seen from x’s point of view (how does f change with x?) or y’s point of view (how does f change with y?). It’s the same idea: we have two “independent” perspectives that we combine for the overall behavior (it’s like combining the point of view of two Solipsists, who think they’re the only “real” people in the universe).
If x and y depend on the same variable (like t, time), we can write the following:
It’s a bit of the chain rule — we’re combining two perspectives, and for each perspective, we dive into its root cause (time).
If x and y are otherwise independent, we represent the derivative along each axis in a vector:
This is the gradient, a way to represent “From this point, if you travel in the x or y direction, here’s how you’ll change”. We combined our 1-dimensional “points of view” to get an understanding of the entire 2d system. Whoa.
The machine computes functions like addition and multiplication with gears — you can see the mechanics unfolding!
Think of function f as a machine with an input lever “x” and an output lever “y”. As we adjust x, f sets the height for y. Another analogy: x is the input signal, f receives it, does some magic, and spits out signal y. Use whatever analogy helps it click.
Wiggle Wiggle Wiggle
The derivative is the “moment-by-moment” behavior of the function. What does that mean? (And don’t mindlessly mumble “The derivative is the slope”. See any graphs around these parts, fella?)
The derivative is how much we wiggle. The lever is at x, we “wiggle” it, and see how y changes. “Oh, we moved the input lever 1mm, and the output moved 5mm. Interesting.”
The result can be written “output wiggle per input wiggle” or “dy/dx” (5mm / 1mm = 5, in our case). This is usually a formula, not a static value, because it can depend on your current input setting.
For example, when f(x) = x^2, the derivative is 2x. Yep, you’ve memorized that. What does it mean?
If our input lever is at x = 10 and we wiggle it slightly (moving it by dx=0.1 to 10.1), the output should change by dy. How much, exactly?
We know f’(x) = dy/dx = 2 * x
At x = 10 the “output wiggle per input wiggle” is = 2 * 10 = 20. The output moves 20 units for every unit of input movement.
If dx = 0.1, then dy = 20 * dx = 20 * .1 = 2
And indeed, the difference between 10^2 and (10.1)^2 is about 2. The derivative estimated how far the output lever would move (a perfect, infinitely small wiggle would move 2 units; we moved 2.01).
The key to understanding the derivative rules:
Set up your system
Wiggle each part of the system separately, see how far the output moves
Combine the results
The total wiggle is the sum of wiggles from each part.
Addition and Subtraction
Time for our first system:
What happens when the input (x) changes?
In my head, I think “Function h takes a single input. It feeds the same input to f and g and adds the output levers. f and g wiggle independently, and don’t even know about each other!”
Function f knows it will contribute some wiggle (df), g knows it will contribute some wiggle (dg), and we, the prowling overseers that we are, know their individual moment-by-moment behaviors are added:
Again, let’s describe each “point of view”:
The overall system has behavior dh
From f’s perspective, it contributes df to the whole [it doesn't know about g]
From g’s perspective, it contributes dg to the whole [it doesn't know about f]
Every change to a system is due to some part changing (f and g). If we add the contributions from each possible variable, we’ve described the entire system.
df vs df/dx
Sometimes we use df, other times df/dx — what gives? (This confused me for a while)
df is a general notion of “however much f changed”
df/dx is a specific notion of “however much f changed, in terms of how much x changed”
The generic “df” helps us see the overall behavior.
An analogy: Imagine you’re driving cross-country and want to measure the fuel efficiency of your car. You’d measure the distance traveled, check your tank to see how much gas you used, and finally do the division to compute “miles per gallon”. You measured distance and gasoline separately — you didn’t jump into the gas tank to get the rate on the go!
In calculus, sometimes we want to think about the actual change, not the ratio. Working at the “df” level gives us room to think about how the function wiggles overall. We can eventually scale it down in terms of a specific input.
And we’ll do that now. The addition rule above can be written, on a “per dx” basis, as:
Multiplication (Product Rule)
Next puzzle: suppose our system multiplies parts “f” and g”. How does it behave?
Hrm, tricky — the parts are interacting more closely. But the strategy is the same: see how each part contributes from its own point of view, and combine them:
total change in h = f’s contribution (from f’s point of view) + g’s contribution (from g’s point of view)
Check out this diagram:
What’s going on?
We have our system: f and g are multiplied, giving h (the area of the rectangle)
Input “x” changes by dx off in the distance. f changes by some amount df (think absolute change, not the rate!). Similarly, g changes by its own amount dg. Because f and g changed, the area of the rectangle changes too.
What’s the area change from f’s point of view? Well, f knows he changed by df, but has no idea what happened to g. From f’s perspective, he’s the only one who moved and will add a slice of area = df * g
Similarly, g doesn’t know how f changed, but knows he’ll add as slice of area “dg * f”
The overall change in the system (dh) is the two slices of area:
Now, like our miles per gallon example, we “divide by dx” to write this in terms of how much x changed:
(Aside: Divide by dx? Engineers will nod, mathematicians will frown. Technically, df/dx is not a fraction: it’s the entire operation of taking the derivative (with the limit and all that). But infinitesimal-wise, intuition-wise, we are “scaling by dx”. I’m a smiler.)
The key to the product rule: add two “slivers of area”, one from each point of view.
Gotcha: But isn’t there some effect from both f and g changing simultaneously (df * dg)?
Yep. However, this area is an infinitesimal * infinitesimal (a “2nd-order infinitesimal”) and invisible at the current level. It’s a tricky concept, but (df * dg) / dx vanishes compared to normal derivatives like df/dx. We vary f and g indepdendently and combine the results, and ignore results from them moving together.
The Chain Rule: It’s Not So Bad
Let’s say g depends on f, which depends on x:
The chain rule lets us “zoom into” a function and see how an initial change (x) can effect the final result down the line (g).
Interpretation 1: Convert the rates
A common interpretation is to multiply the rates:
x wiggles f. This creates a rate of change of df/dx, which wiggles g by dg/df. The entire wiggle is then:
This is similar to the “factor-label” method in chemistry class:
If your “miles per second” rate changes, multiply by the conversion factor to get the new “miles per hour”. The second doesn’t know about the hour directly — it goes through the second => minute conversion.
Similarly, g doesn’t know about x directly, only f. Function g knows it should scale its input by dg/df to get the output. The initial rate (df/dx) gets modified as it moves up the chain.
Interpretation 2: Convert the wiggle
I prefer to see the chain rule on the “per-wiggle” basis:
x wiggles by dx, so
f wiggles by df, so
g wiggles by dg
Cool. But how are they actually related? Oh yeah, the derivative! (It’s the output wiggle per input wiggle):
Remember, the derivative of f (df/dx) is how much to scale the initial wiggle. And the same happens to g:
It will scale whatever wiggle comes along its input lever (f) by dg/df. If we write the df wiggle in terms of dx:
We have another version of the chain rule: dx starts the chain, which results in some final result dg. If we want the final wiggle in terms of dx, divide both sides by dx:
The chain rule isn’t just factor-label unit cancellation — it’s the propagation of a wiggle, which gets adjusted at each step.
The chain rule works for several variables (a depends on b depends on c), just propagate the wiggle as you go.
Try to imagine “zooming into” different variable’s point of view. Starting from dx and looking up, you see the entire chain of transformations needed before the impulse reaches g.
Chain Rule: Example Time
Let’s say we put a “squaring machine” in front of a “cubing machine”:
input(x) => f:x^2 => g:f^3 => output(y)
f:x^2 means f squares its input. g:f^3 means g cubes its input, the value of f. For example:
input(2) => f(2) => g(4) => output:64
Start with 2, f squares it (2^2 = 4), and g cubes this (4^3 = 64). It’s a 6th power machine:
And what’s the derivative?
f changes its input wiggle by df/dx = 2x
g changes its input wiggle by dg/df = 3f^2
The final change is:
Chain Rule: Gotchas
Functions treat their inputs like a blob
In the example, g’s derivative (“x^3 = 3x^2″) doesn’t refer to the original “x”, just whatever the input was (foo^3 = 3*foo^2). The input was f, and it treats f as a single value. Later on, we scurry in and rewrite f in terms of x. But g has no involvement with that — it doesn’t care that f can be rewritten in terms of smaller pieces.
In many examples, the variable “x” is the “end of the line”.
Questions ask for df/dx, i.e. “Give me changes from x’s point of view”. Now, x could depend on something deeper variable, but that’s not being asked for. It’s like saying “I want miles per hour. I don’t care about miles per minute or miles per second. Just give me miles per hour”. df/dx means “stop looking at inputs once you get to x”.
How come we multiply derivatives with the chain rule, but add them for the others?
The regular rules are about combining points of view to get an overall picture. What change does f see? What change does g see? Add them up for the total.
The chain rule is about going deeper into a single part (like f) and seeing if it’s controlled by another variable. It’s like looking inside a clock and saying “Hey, the minute hand is controlled by the second hand!”. We’re staying inside the same part.
Sure, eventually this “per-second” perspective of f could be added to some perspective from g. Great. But the chain rule is about diving deeper into “f’s” root causes.
Power Rule: Oft Memorized, Seldom Understood
What’s the derivative of x^4? 4x^3? Great. You brought down the exponent and subtracted one. Now explain why!
Hrm. There’s a few approaches, but here’s my new favorite: x^4 is really x * x * x * x. It’s the multiplication of 4 “independent” variables. Each x doesn’t know about the others, it might as well be x * u * v * w.
Now think about the first x’s point of view:
It changes from x to x + dx
The change in the overall function is [(x + dx) - x][u * v * w] = dx[u * v * w]
The change on a “per dx” basis is [u * v * w]
From u’s point of view, it changes by du. It contributes (du/dx)*[x * v * w] on a “per dx” basis
v contributes (dv/dx) * [x * u * w]
w contributes (dw/dx) * [x * u * v]
The curtain is unveiled: x, u, v, and w are the same! The “point of view” conversion factor is 1 (du/dx = dv/dx = dw/dx = dx/dx = 1), and the total change is
In a sentence: the derivative of x^4 is 4x^3 because x^4 has four identical “points of view” which are being combined. Booyeah!
Take A Breather
I hope you’re seeing the derivative in a new light: we have a system of parts, we wiggle our input and see how the whole thing moves. It’s about combining perspectives: what does each part add to the whole?
In the follow-up article, we’ll look at even more powerful rules (exponents, quotients, and friends). Happy math.
Why do analogies work so well? They’re building blocks for our thoughts, written in the associative language of our brains.
At first, I thought analogies had to be perfect models of the idea they explained. Nope.
“All models are wrong, but some are useful” – George Box
Analogies are handles to grasp a larger, more slippery idea. They’re a raft to cross a river, and can be abandoned once on the other side. Unempathetic experts may think the raft is useless, since they no longer use it, or perhaps they were such marvelous swimmers it was never needed!
Analogies are perfectly fine. But why do they work so well?
Our brains are association machines. Connections, relationships, patterns — we need meaning! Yet we present topics as if we could be programmed with raw information.
Consider the typical language class:
Here’s the grammar
Here’s the vocabulary
Put the vocab in the grammar and go!
(We know how well that works). The mistake is thinking direct study of the grammar and vocabulary will build fluency — it’s a tough slog. I suspect a class of 80% speaking, listening, making idioms, building pronunciation and 20% vocabulary/grammar does much better than the reverse.
Start with simple analogies you deeply understand, then attach extra details.
Here’s an example: I can casually describe i (the imaginary number) as the square root of -1 and you can blindly accept it.
But you won’t really believe me until I start down the path of “Hey, numbers can be 2 dimensional, and i is a rotation into the 2nd dimension”. The word “rotation” stretches our brain about what a number could be — the number line may not be the final step. We’re having a real discussion and can start learning!
See, you’re extremely fluent with the idea of a line, and the idea of a second dimension, and we can work “i is a rotation” into that framework. In computer terms: we are programming with the native language of the machine. Our brain thinks with connections, so explain new data in terms of existing connections!
Although a subject can be distilled into rules and facts, drinking this concentrated math isn’t the best way to enjoy it. It’s not how our brains work, and presenting raw data suffers from a painful translation step.
I don’t think of algebra, trig and other math as a table of equations. It’s a web of connections and insights. But why show facts and hope you recreate the mental model in my head, instead of describing it directly?
No, no — let’s have a brain-to-brain. Here’s the analogies in my head, I want you to have them too.
Joshua Zucker helped me refine my recent “functions are plates, derivatives are breaking plates into shards, integrals are weighing the pieces” analogy, and corrected a huge misconception about integrals and anti-derivatives.
Stan and YatharthROCK have been helping me develop ideas on aha.betterexplained.com (thanks guys).
The aha / question area is like a mini-forum to discuss analogies. (Teachers, please ransack these insights and whatever is useful — every article is free to use, print, mime, etc. for non-commercial use). The goal is to enable conversations about what is actually working.
I have a nefarious plan for the widget: gather improvements for each article! Knowing exactly what parts of the article helped (or didn’t) makes it easier to keep revising: I want the analogies to sing.
Try it out
The aha section is new, there’ll be some bugs, but I’d love your feedback anyway:
Share what really worked. As you read articles, post what analogies helped the most.
Ask questions. Have a question that’s been bothering you? Add it!
Vote on what’s helping. Just click the heart to rate an insight or question.
The aha section isn’t a replacement for comments — it’s a way to organize the best parts. If there’s other types of “aha” items you’d like organized (Followup:, Example:, etc.) let me know!
How do you wish the derivative was explained to you? Here's my take.
Psst! The derivative is the heart of calculus, buried inside this definition:
But what does it mean?
Let's say I gave you a magic newspaper that listed the daily stock market changes for the next few years (+1% Monday, -2% Tuesday...). What could you do?
Well, you'd apply the changes one-by-one, plot out future prices, and buy low / sell high to build your empire. You could even hire away the monkeys who currently throw darts at newspapers.
Others call the derivative "the slope of a function" -- it's so bland! Like the stock list, the derivative is a total, predictive understanding of a system. You can plot the past/present/future, find minimums/maximums, and yes, staff your simian workforce.
Step away from the gnarly equation. Equations exist to convey ideas: understand the idea, not the grammar.
Derivatives create a perfect model of change from an imperfect guess.
This result came over thousands of years of thinking, from Archimedes to Newton. Let's look at the analogies behind it.
We all live in a shiny continuum
Infinity is a constant source of paradoxes ("headaches"):
A line is made up of points? Sure.
So there's an infinite number of points on a line? Yep.
How do you cross a room when there's an infinite number of points to visit? (Gee, thanks Zeno).
And yet, we move. My intuition is to fight infinity with infinity. Sure, there's infinity points between 0 and 1. But I move two infinities of points per second (somehow!) and I cross the gap in half a second.
Distance has infinite points, motion is possible, therefore motion is in terms of "infinities of points per second".
Instead of thinking of differences ("How far to the next point?") we can compare rates ("How fast are you moving through this continuum?").
It's strange, but you can see 10/5 as "I need to travel 10 'infinities' in 5 segments of time. To do this, I travel 2 'infinities' for each unit of time".
Analogy: See division as a rate of motion through a continuum of points
What's after zero?
Another brain-buster: What number comes after zero? .01? .0001?
Hrm. Anything you can name, I can name smaller (I'll just halve your number... nyah!).
Even though we can't calculate the number after zero, it must be there, right? Like demons of yore, it's the "number that cannot be written, lest ye be smitten".
Call the gap to the next number "dx". I don't know exactly how big it is, but it's there!
Analogy: dx is a "jump" to the next number in the continuum.
Measurements depend on the instrument
The derivative predicts change. Ok, how do we measure speed (change in distance)?
Officer: Do you know how fast you were going?
Driver: I have no idea.
Officer: 95 miles per hour.
Driver: But I haven't been driving for an hour!
We clearly don't need a "full hour" to measure your speed. We can take a before-and-after measurement (over 1 second, let's say) and get your instantaneous speed. If you moved 140 feet in one second, you're going ~95mph. Simple, right?
Not exactly. Imagine a video camera pointed at Clark Kent (Superman's alter-ego). The camera records 24 pictures/sec (40ms per photo) and Clark seems still. On a second-by-second basis, he's not moving, and his speed is 0mph.
Wrong again! Between each photo, within that 40ms, Clark changes to Superman, solves crimes, and returns to his chair for a nice photo. We measured 0mph but he's really moving -- he goes too fast for our instruments!
Analogy: Like a camera watching Superman, the speed we measure depends on the instrument!
Running the Treadmill
We're nearing the chewy, slightly tangy center of the derivative. We need before-and-after measurements to detect change, but our measurements could be flawed.
Imagine a shirtless Santa on a treadmill (go on, I'll wait). We're going to measure his heart rate in a stress test: we attach dozens of heavy, cold electrodes and get him jogging.
Santa huffs, he puffs, and his heart rate shoots to 190 beats per minute. That must be his "under stress" heart rate, correct?
Nope. See, the very presence of stern scientists and cold electrodes increased his heart rate! We measured 190bpm, but who knows what we'd see if the electrodes weren't there! Of course, if the electrodes weren't there, we wouldn't have a measurement.
What to do? Well, look at the system:
measurement = actual amount + measurement effect
Ah. After lots of studies, we may find "Oh, each electrode adds 10bpm to the heartrate". We make the measurement (imperfect guess of 190) and remove the effect of electrodes ("perfect estimate").
Analogy: Remove the "electrode effect" after making your measurement
By the way, the "electrode effect" shows up everywhere. Research studies have the Hawthorne Effect where people change their behavior because they are being studied. Gee, it seems everyone we scrutinize sticks to their diet!
Understanding the derivative
Armed with these insights, we can see how the derivative models change:
Start with some system to study, f(x):
Change by the smallest amount possible (dx)
Get the before-and-after difference: f(x + dx) - f(x)
We don't know exactly how small "dx" is, and we don't care: get the rate of motion through the continuum: [f(x + dx) - f(x)] / dx
This rate, however small, has some error (our cameras are too slow!). Predict what happens if the measurement were perfect, if dx wasn't there.
The magic's in the final step: how do we remove the electrodes? We have two approaches:
Limits: what happens when dx shrinks to nothingness, beyond any error margin?
Infinitesimals: What if dx is a tiny number, undetectable in our number system?
Both are ways to formalize the notion of "How do we throw away dx when it's not needed?".
My pet peeve: Limits are a modern formalism, they didn't exist in Newton's time. They help make dx disappear "cleanly". But teaching them before the derivative is like showing a steering wheel without a car! It's a tool to help the derivative work, not something to be studied in a vacuum.
An Example: f(x) = x^2
Let's shake loose the cobwebs with an example. How does the function f(x) = x^2 change as we move through the continuum?
Note the difference in the last 2 equations:
One has the error built in (dx)
The other has the "true" change, where dx = 0 (our measurements have no effect on the outcome)
Time for real numbers. Here's the values for f(x) = x^2, with intervals of dx = 1:
1, 4, 9, 16, 25, 36, 49, 64...
The absolute change between each result is:
1, 3, 5, 7, 9, 11, 13, 15...
(Here, the absolute change is the "speed" between each step, where the interval is 1)
Consider the jump from x=2 to x=3 (3^2 - 2^2 = 5). What is "5" made of?
Measured rate = Actual Rate + Error
5 = 2x + dx
5 = 2(2) + 1
Sure, we measured a "5 units moved per second" because we went from 4 to 9 in one interval. But our instruments trick us! 4 units of speed came from the real change, and 1 unit was due to shoddy instruments (1.0 is a large jump, no?).
If we restrict ourselves to integers, 5 is the perfect speed measurement from 4 to 9. There's no "error" in assuming dx = 1 because that's the true interval between neighboring points.
But in the real world, measurements every 1.0 seconds is too slow. What if our dx was 0.1? What speed would we measure at x=2?
Well, we examine the change from x=2 to x=2.1:
2.1^2 - 2^2 = 0.41
Remember, 0.41 is what we changed in an interval of 0.1. Our speed-per-unit is 0.41 / .1 = 4.1. And again we have:
Measured rate = Actual Rate + Error
4.1 = 2x + dx
Interesting. With dx=0.1, the measured and actual rates are close (4.1 to 4, 2.5% error). When dx=1, the rates are pretty different (5 to 4, 25% error).
Following the pattern, we see that throwing out the electrodes (letting dx=0) reveals the true rate of 2x.
In plain English: We analyzed how f(x) = x^2 changes, found an "imperfect" measurement of 2x + dx, and deduced a "perfect" model of change as 2x.
The derivative is "better division", where you get the speed through the continuum at every instant. Something like 10/5 = 2 says "you have a constant speed of 2 through the continuum".
When your speed changes as you go, you need to describe your speed at each instant. That's the derivative.
If you apply this changing speed to each instant (take the integral of the derivative), you recreate the original behavior, just like applying the daily stock market changes to recreate the full price history. But this is a big topic for another day.
Gotcha: The Many meanings of "Derivative"
You'll see "derivative" in many contexts:
"The derivative of x^2 is 2x" means "At every point, we are changing by a speed of 2x (twice the current x-position)". (General formula for change)
"The derivative is 44" means "At our current location, our rate of change is 44." When f(x) = x^2, at x=22 we're changing at 44 (Specific rate of change).
"The derivative is dx" may refer to the tiny, hypothetical jump to the next position. Technically, dx is the "differential" but the terms get mixed up. Sometimes people will say "derivative of x" and mean dx.
Gotcha: Our models may not be perfect
We found the "perfect" model by making a measurement and improving it. Sometimes, this isn't good enough -- we're predicting what would happen if dx wasn't there, but added dx to get our initial guess!
Some ill-behaved functions defy the prediction: there's a difference between removing dx with the limit and what actually happens at that instant. These are called "discontinuous" functions, which is essentially "cannot be modeled with limits". As you can guess, the derivative doesn't work on them because we can't actually predict their behavior.
Discontinuous functions are rare in practice, and often exist as "Gotcha!" test questions ("Oh, you tried to take the derivative of a discontinuous function, you fail"). Realize the theoretical limitation of derivatives, and then realize their practical use in measuring every natural phenomena. Nearly every function you'll see (sine, cosine, e, polynomials, etc.) is continuous.
Gotcha: Integration doesn't really exist
The relationship between derivatives, integrals and anti-derivatives is nuanced (and I got it wrong originally). Here's a metaphor. Start with a plate, your function to examine:
Differentiation is breaking the plate into shards. There is a specific procedure: take a difference, find a rate of change, then assume dx isn't there.
Integration is weighing the shards: your original function was "this" big. There's a procedure, cumulative addition, but it doesn't tell you what the plate looked like.
Anti-differentiation is figuring out the original shape of the plate from the pile of shards.
There's no algorithm to find the anti-derivative; we have to guess. We make a lookup table with a bunch of known derivatives (original plate => pile of shards) and look at our existing pile to see if it's similar. "Let's find the integral of 10x. Well, it looks like 2x is the derivative of x^2. So... scribble scribble... 10x is the derivative of 5x^2.".
Finding derivatives is mechanics; finding anti-derivatives is an art. Sometimes we get stuck: we take the changes, apply them piece by piece, and mechanically reconstruct a pattern. It might not be the "real" original plate, but is good enough to work with.
Another subtlety: aren't the integral and anti-derivative the same? (That's what I originally thought)
Yes, but this isn't obvious: it's the fundamental theorem of calculus! (It's like saying "Aren't a^2 + b^2 and c^2 the same? Yes, but this isn't obvious: it's the Pythagorean theorem!"). Thanks to Joshua Zucker for helping sort me out.
Math is a language, and I want to "read" calculus (not "recite" calculus, i.e. like we can recite medieval German hymns). I need the message behind the definitions.
My biggest aha! was realizing the transient role of dx: it makes a measurement, and is removed to make a perfect model. Limits/infinitesimals are a formalism, we can't get caught up in them. Newton seemed to do ok without them.
Armed with these analogies, other math questions become interesting:
How do we measure different sizes of infinity? (In some sense they're all "infinite", in other senses the range (0,1) is smaller than (0,2))
What are the real rules about making "dx go away"? (How do infinitesimals and limits really work?)
How do we describe numbers without writing them down? "The next number after 0" is the beginnings of analysis (which I want to learn).
The fundamentals are interesting when you see why they exist. Happy math.
I see the dot product as directional multiplication. But multiplication goes beyond repeated counting: it’s applying the essence of one item to another.
Normal multiplication combines growth rates: “3 x 4″ can mean “Take your 3x growth and make it 4x larger (i.e., 12x)”. Complex multiplication lets us combine rotations. Integrals let us do piece-by-piece multiplication.
A vector is “growth in a direction”. The dot product lets us apply the directional growth of one vector to another: the result is how much we went along the original path (positive progress, negative, or zero).
Today let’s build our intuition for how the dot product works.
Getting the Formula Out of the Way
You’ve seen the dot product equation everywhere:
And also the justification: “Well Billy, the Law of Cosines (you remember that, don’t you?) says the following calculations are the same, so they are.” Not good enough — it doesn’t click! Beyond the computation, what does it mean?
The goal is to apply one vector to another. Each computation examines this from a rectangular perspective (x- and y-coordinates) or a polar one (magnitudes and angles). The “blah = foo” equation above really means “Here’s two equivalent ways to ‘directionally multiply’ vectors”.
(Similarly, we can show that Euler’s formula (e^ix = cos(x) + i*sin(x)) is true because the Taylor series is the same on both sides. Accurate but unsatisfying! Instead, see how both sides can describe the same motion.)
Seeing Numbers as vectors
Let’s start simple, and see 3 x 4 as a dot product:
The number 3 is “directional growth” in a single dimension (x-axis, let’s say), and 4 is “directional growth” in that same direction. 3 x 4 = 12 means 12x growth in that single dimension. Ok.
Now, suppose each number refers to a different dimension. Let’s say 3 means “triple your bananas” (sigh… or “x-axis”) and 4 means “quadruple your oranges” (y-axis). They’re not the same type of number: what happens when we apply growth, aka use the dot product, in our “bananas, oranges” universe?
(3,0) is “Triple your bananas, destroy your oranges”
(0,4) is “Destroy your bananas, quadruple your oranges”
Applying (0,4) to (3,0) means “Destroy your banana growth, quadruple your orange growth”. But (3, 0) had no orange growth to begin with, so the net result is 0 (“Destroy all your fruit, buddy”).
See how we’re “applying” and not adding. With addition, we sort of smush the items together: (3,0) + (0, 4) = (3, 4) [a vector which triples your oranges and quadruples your bananas].
“Application” is different. We’re mutating the original vector according to the rules in the second. And the rules are “Destroy your banana growth rate, and triple your orange growth rate“. And, sadly, this leaves us with nothing.
The final result of this process can be:
zero: we don’t have any growth in the original direction
positive number: we have some growth in the original direction
negative number: we have negative (reverse) growth in the original direction
Understanding the Calculation
“Applying vectors” is still a bit abstract. I think “How much energy/push is one vector giving to the other?”. Here’s how I visualize it:
Like multiplying complex numbers, see how each x- and y-component interacts:
We list out all four combinations (x&x, y&x, x&y, y&y). Since the x- and y-coordinates don’t affect each other (like holding a bucket sideways under a waterfall — nothing falls in), the total energy absorbtion is absorbtion(x) + absorbtion(y):
Polar coordinates: Projection
The word “projection” is so sterile: I prefer “along the path”. How much energy is actually going in our original direction?
Here’s one way to see it:
Take two vectors, a and b. Rotate our coordinates so b is horizontal: it becomes (|b|, 0), and everything is on this new x-axis. What’s the dot product now? (It shouldn’t change just because we tilted our head).
Well, vector a has new coordinates (a1, a2), and we get:
a1 is really “What is the x-coordinate of a, assuming b is the x-axis?”. That is |a|cos(θ), aka the “projection”:
Analogies for the Dot Product
The common interpretation is “geometric projection”, but it’s so bland. Here’s some analogies that click for me:
One vector are solar rays, the other is where the solar panel is pointing (yes, yes, the normal vector). Larger numbers mean stronger rays or a larger panel. How much energy is absorbed?
Energy = Overlap in direction * Strength of rays * Size of panel
If you hold your panel sideways to the sun, no rays hit (cos(θ) = 0).
But… but… solar rays are leaving the sun, and the panel is facing the sun, and the dot product is negative when vectors are opposed! Take a deep breath, and remember the goal is to embrace the analogy (besides, physicists lose track of negative signs all the time).
Mario-Kart Speed Boost
In Mario Kart, there are “boost pads” on the ground that increase your speed (Never played? I’m sorry.)
Imagine the red vector is your speed (x and y direction), and the blue vector is the orientation of the boost pad (x and y direction). Larger numbers are more power.
How much boost will you get? For the analogy, imagine the pad multiplies your speed:
If you come in going 0, you’ll get nothing
If you cross the pad perpendicularly, you’ll get 0 [just like the banana obliteration, it will give you 0x boost in the perpendicular direction]
But, if we have some overlap, our x-speed will get an x-boost, and our y-speed gets a y-boost:
Neat, eh? Another way to see it: your incoming speed is |a|, and the max boost is |b|. The amount of boost you actually get (for being lined up with it) is cos(θ), for the total |a||b|cos(θ).
Physics Physics Physics
The dot product appears all over physics: some field (electric, gravitational) is pulling on some particle. We’d love to multiply, and we could if everything were lined up. But that’s never the case, so we take the dot product to account for potential differences in direction.
It’s all a useful generalization: Integrals are “multiplication, taking changes into account” and the dot product is “multiplication, taking direction into account”.
Logarithms are everywhere. Ever use any of the following phrases?
Order of magnitude
You’re describing numbers in terms of their powers of 10 — a logarithm. Ever mention an interest rate or rate of return? It’s the logarithm of your growth.
Surprised that logarithms are so common? Me too. Many attempts at Math In the Real World are attempts to point out logarithms in some arcane formula or pretending we’re geologists fascinated by the Richter Scale. “Scientists care about logs, and you should too. Also, can you imagine a world without zinc?”
No, no, no, no no, no no! (Mama mia!)
Math expresses concepts with notation like “ln” or “log”. Finding “math in the real world” means encountering ideas in life and seeing how they could be written with notation. Don’t look for the literal symbols! When was the last time you wrote a division sign? When was the last time you chopped up some food?
Ok, ok, we get it: what are logarithms about?
Logarithms find the cause for an effect, i.e the input for some output
A common “effect” is seeing something grow, like going from $100 to $150 in 5 years. How did this happen? We’re not sure, but the logarithm finds a possible cause: A continuous return of ln(150/100) / 5 = 8.1% would account for that change. It might not be the actual cause (did all the growth happen in the final year?), but it’s a smooth average we can compare to other changes.
By the way, the notion of “cause and effect” is nuanced. Why is 1000 bigger than 100?
100 is 10 which grew by itself for 2 time periods (10 * 10)
1000 is 10 which grew by itself for 3 time periods (10 * 10 * 10)
We can think of numbers as outputs (1000 is “1000 outputs”) and inputs (“How many times does 10 need to grow to make those outputs?”). So,
1000 outputs > 100 outputs
3 inputs > 2 inputs [i.e., because log(1000) > log(100)]
Why is this useful?
Logarithms put numbers on a human-friendly scale.
Large numbers break our brains. Millions and trillions are “really big” even though a million seconds is 12 days and a trillion seconds is 30,000 years. It’s the difference between an American vacation year and the entirety of human civilization.
The trick to overcoming “huge number blindness” is to write numbers in terms of “inputs” (i.e. their power base 10). This smaller scale (0 to 100) is much easier to grasp:
power of 0 = 10^0 = 1 (single item)
power of 1 = 10^1 = 10
power of 3 = 10^3 = thousand
power of 6 = 10^6 = million
power of 9 = 10^9 = billion
power of 12 = 10^12 = trillion
power of 23 = 10^23 = number of molecules in a dozen grams of carbon
power of 80 = 10^80 = number of molecules in the universe
A 0 to 80 scale took us from a single item to the number of things in the universe. Not too shabby.
Logarithms count multiplication as steps
Logarithms describe changes in terms of multiplication: in the examples above, each step is 10x bigger. With the natural log, each step is “e” (2.71828…) times more.
When dealing with a series of multiplications, logarithms help “count” them, just like addition counts for us when effects are added.
Show me the math
Time for the meat: let’s see where logarithms show up!
Six-figure salary or 2-digit expense
We’re describing numbers in terms of their digits, i.e. how many powers of 10 they have (are they in the tens, hundreds, thousands, ten-thousands, etc.). Adding a digit means “multiplying by 10″, i.e.
Logarithms count the number of multiplications added on, so starting with 1 (a single digit) we add 5 more digits (10^5) and 100,000 get a 6-figure result. Talking about “6″ instead of “One hundred thousand” is the essence of logarithms. It gives a rough sense of scale without jumping into details.
Bonus question: How would you describe 500,000? Saying “6 figure” is misleading because 6-figures often implies something closer to 100,000. Would “6.5 figure” work?
Not really. In our heads, 6.5 means “halfway” between 6 and 7 figures, but that’s an adder’s mindset. With logarithms a “.5″ means halfway in terms of multiplication, i.e the square root (9^.5 means the square root of 9 — 3 is halfway in terms of multiplication because it’s 1 to 3 and 3 to 9).
Taking log(500,000) we get 5.7, add 1 for the extra digit, and we can say “500,000 is a 6.7 figure number”. Try it out here:
Order of magnitude
We geeks love this phrase. It means roughly “10x difference” but just sounds cooler than “1 digit larger”.
In computers, where everything is counted with bits (1 or 0), each bit has a doubling effect (not 10x). So going from 8 to 16 bits is “8 orders of magnitude” or 2^8 = 256 times larger. (These bit sizes refers to the amount of memory available, not the processor speed). Going from 16 to 32 bits means 16 orders of magnitude, or 2^16 ~ 65,536 times larger.
Isn’t “16 extra bits of memory” better than “65,536 times more memory?”.
How do we figure out growth rates? A country doesn’t intend to grow at 8.56% per year. You look at the GDP one year and the GDP the next, and take the logarithm to find the implicit growth rate.
My two favorite interpretations of the natural logarithm (ln(x)), i.e. the natural log of 1.5:
Assuming 100% growth, how long do you need to grow to get to 1.5? (.405, less than half the time period)
Assuming 1 unit of time, how fast do you need to grow to get to 1.5? (40.5% per year, continuously compounded)
Logarithms are how we figure out how fast we’re growing.
Measurement Scale: Google PageRank
Google gives every page on the web a score (PageRank) which is a rough measure of authority / importance. This is a logarithmic scale, which in my head means “PageRank counts the number of digits in your score”.
So, a site with pagerank 2 (“2 digits”) is 10x more popular than a PageRank 1 site. My site is PageRank 5 and CNN has PageRank 9, so there’s a difference of 4 orders of magnitude (10^4 = 10,000).
Roughly speaking, I get about 7000 visits / day. Using my envelope math, I can guess CNN gets about 7000 * 10,000 = 70 million visits / day. (How’d I do that? In my head, I think 7k * 10k = 70 * k * k = 70 * M). They might have a few times more than that (100M, 200M) but probably not up to 700M.
Google conveys a lot of information with a very rough scale (1-10).
Measurement Scale: Richter, Decibel, etc.
Sigh. We’re at the typical “logarithms in the real world” example: Richter scale and Decibel. The idea is to put events which can vary drastically (earthquakes) on a single 1 – 10 scale. Just like PageRank, each 1-point increase is a 10x improvement in power.
Decibels are similar, though it can be negative. Sounds can go from intensely quiet (pindrop) to extremely loud (airplane) and our brains can process it all. In reality, the sound of an airplane’s engine is millions (billions, trillions) of times more powerful than a pindrop, and it’s inconvenient to have a scale that goes from 1 to a gazillion. Logs keep everything on a reasonable scale.
You’ll often see items plotted on a “log scale”. In my head, this means one side is counting “number of digits” or “number of multiplications”, not the value itself. Again, this helps show wildly varying events on a single scale (going from 1 to 10, not 1 to billions).
Moore’s law is a great example: we double the number of transistors every 18 months (image courtesy Wikipedia).
The neat thing about log-scale graphs is exponential changes (processor speed) appear as a straight line. Growing 10x per year means you’re steadily marching up the “digits” scale.
Onward and upward
If a concept is well-known but not well-loved, it means we need to build our intuition. Find the analogies that work, and don’t settle for the slop a textbook will trot out. In my head:
Logarithms find the root cause for an effect (see growth, find interest rate)
They help count multiplications or digits, with the bonus of partial counts (500k is a 6.7 digit number)