All posts by kalid

An Intuitive Introduction To Limits

Limits, the Foundations Of Calculus, seem so artificial and weasely: “Let x approach 0, but not get there, yet we’ll act like it’s there… ” Ugh.

Here’s how I learned to enjoy them:

  • What is a limit? Our best prediction of a point we didn’t observe.
  • How do we make a prediction? Zoom into the neighboring points. If our prediction is always in-between neighboring points, no matter how much we zoom, that’s our estimate.
  • Why do we need limits? Math has “black hole” scenarios (dividing by zero, going to infinity), and limits give us a reasonable estimate.
  • How do we know we’re right? We don’t. Our prediction, the limit, isn’t required to match reality. But for most natural phenomena, it sure seems to.

Limits let us ask “What if?”. If we can directly observe a function at a value (like x=0, or x growing infinitely), we don’t need a prediction. The limit wonders, “If you can see everything except a single value, what do you think is there?”.

When our prediction is consistent and improves the closer we look, we feel confident in it. And if the function behaves smoothly, like most real-world functions do, the limit is where the missing point must be.

Key Analogy: Predicting A Soccer Ball

Pretend you’re watching a soccer game. Unfortunately, the connection is choppy:

soccer limits

Ack! We missed what happened at 4:00. Even so, what’s your prediction for the ball’s position?

Easy. Just grab the neighboring instants (3:59 and 4:01) and predict the ball to be somewhere in-between.

And… it works! Real-world objects don’t teleport; they move through intermediate positions along their path from A to B. Our prediction is “At 4:00, the ball was between its position at 3:59 and 4:01″. Not bad.

With a slow-motion camera, we might even say “At 4:00, the ball was between its positions at 3:59.999 and 4:00.001″.

Our prediction is feeling solid. Can we articulate why?

  • The predictions agree at increasing zoom levels. Imagine the 3:59-4:01 range was 9.9-10.1 meters, but after zooming into 3:59.999-4:00.001, the range widened to 9-12 meters. Uh oh! Zooming should narrow our estimate, not make it worse! Not every zoom level needs to be accurate (imagine seeing the game every 5 minutes), but to feel confident, there must be some threshold where subsequent zooms only strengthen our range estimate.

  • The before-and-after agree. Imagine at 3:59 the ball was at 10 meters, rolling right, and at 4:01 it was at 50 meters, rolling left. What happened? We had a sudden jump (a camera change?) and now we can’t pin down the ball’s position. Which one had the ball at 4:00? This ambiguity shatters our ability to make a confident prediction.

With these requirements in place, we might say “At 4:00, the ball was at 10 meters. This estimate is confirmed by our initial zoom (3:59-4:01, which estimates 9.9 to 10.1 meters) and the following one (3:59.999-4:00.001, which estimates 9.999 to 10.001 meters)”.

Limits are a strategy for making confident predictions.

Exploring The Intuition

Let’s not bring out the math definitions just yet. What things, in the real world, do we want an accurate prediction for but can’t easily measure?

What’s the circumference of a circle?

Finding pi “experimentally” is tough: bust out a string and a ruler?

We can’t measure a shape with seemingly infinite sides, but we can wonder “Is there a predicted value for pi that is always accurate as we keep increasing the sides?”

Archimedes figured out that pi had a range of

\displaystyle{3 \frac{10}{71} < \pi < 3 \frac{1}{7} }

using a process like this:

It was the precursor to calculus: he determined that pi was a number that stayed between his ever-shrinking boundaries. Nowadays, we have modern limit definitions of pi.

What does perfectly continuous growth look like?

e, one of my favorite numbers, can be defined like this:

\displaystyle{e = \lim_{n\to\infty} \left( 1 + \frac{1}{n} \right)^n}

We can’t easily measure the result of infinitely-compounded growth. But, if we could make a prediction, is there a single rate that is ever-accurate? It seems to be around 2.71828…

Can we use simple shapes to measure complex ones?

Circles and curves are tough to measure, but rectangles are easy. If we could use an infinite number of rectangles to simulate curved area, can we get a result that withstands infinite scrutiny? (Maybe we can find the area of a circle.)

Can we find the speed at an instant?

Speed is funny: it needs a before-and-after measurement (distance traveled / time taken), but can’t we have a speed at individual instants? Hrm.

Limits help answer this conundrum: predict your speed when traveling to a neighboring instant. Then ask the “impossible question”: what’s your predicted speed when the gap to the neighboring instant is zero?

Note: The limit isn’t a magic cure-all. We can’t assume one exists, and there may not be an answer to every question. For example: Is the number of integers even or odd? The quantity is infinite, and neither the “even” nor “odd” prediction stays accurate as we count higher. No well-supported prediction exists.

For pi, e, and the foundations of calculus, smart minds did the proofs to determine that “Yes, our predicted values get more accurate the closer we look.” Now I see why limits are so important: they’re a stamp of approval on our predictions.

The Math: The Formal Definition Of A Limit

Limits are well-supported predictions. Here’s the official definition:

\displaystyle{ \lim_{x \to c}f(x) = L } means for all real ε > 0 there exists a real δ > 0 such that for all x with 0 < |x − c| < δ, we have |f(x) − L| < ε

Let’s make this readable:

Math EnglishHuman English
\displaystyle{ \lim_{x \to c}f(x) = L }
  means
When we “strongly predict” that f(c) = L, we mean
for all real ε > 0for any error margin we want (+/- .1 meters)
there exists a real δ > 0there is a zoom level (+/- .1 seconds)
such that for all x with 0 < |x − c| < δ, we have |f(x) − L| < εwhere the prediction stays accurate to within the error margin

There’s a few subtleties here:

  • The zoom level (delta, δ) is the function input, i.e. the time in the video
  • The error margin (epsilon, ε) is the most the function output (the ball’s position) can differ from our prediction throughout the entire zoom level
  • The absolute value condition (0 < |x − c| < δ) means positive and negative offsets must work, and we’re skipping the black hole itself (when |x – c| = 0).

We can’t evaluate the black hole input, but we can say “Except for the missing point, the entire zoom level confirms the prediction f(c) = L.” And because f(c) = L holds for any error margin we can find, we feel confident.

Could we have multiple predictions? Imagine we predicted L1 and L2 for f(c). There’s some difference between them (call it .1), therefore there’s some error margin (.01) that would reveal the more accurate one. Every function output in the range can’t be within .01 of both predictions. We either have a single, infinitely-accurate prediction, or we don’t.

Yes, we can get cute and ask for the “left hand limit” (prediction from before the event) and the “right hand limit” (prediction from after the event), but we only have a real limit when they agree.

A function is continuous when it always matches the predicted value (and discontinuous if not):

\displaystyle{\lim_{x \to c}{f(x)} = f(c)}

Calculus typically studies continuous functions, playing the game “We’re making predictions, but only because we know they’ll be correct.”

The Math: Showing The Limit Exists

We have the requirements for a solid prediction. Questions asking you to “Prove the limit exists” ask you to justify your estimate.

For example: Prove the limit at x=2 exists for

\displaystyle{f(x) = \frac{(2x+1)(x-2)}{(x - 2)}}

The first check: do we even need a limit? Unfortunately, we do: just plugging in “x=2″ means we have a division by zero. Drats.

But intuitively, we see the same “zero” (x – 2) could be cancelled from the top and bottom. Here’s how to dance this dangerous tango:

  • Assume x is anywhere except 2 (It must be! We’re making a prediction from the outside.)
  • We can then cancel (x – 2) from the top and bottom, since it isn’t zero.
  • We’re left with f(x) = 2x + 1. This function can be used outside the black hole.
  • What does this simpler function predict? That f(2) = 2*2 + 1 = 5.

So f(2) = 5 is our prediction. But did you see the sneakiness? We pretended x wasn’t 2 [to divide out (x-2)], then plugged in 2 after that troublesome item was gone! Think of it this way: we used the simple behavior from outside the event to predict the gnarly behavior at the event.

We can prove these shenanigans give a solid prediction, and that f(2) = 5 is infinitely accurate.

For any accuracy threshold (ε), we need to find the “zoom range” (δ) where we stay within the given accuracy. For example, can we keep the estimate between +/- 1.0?

Sure. We need to find out where

\displaystyle{|f(x) - 5| < 1.0}

so


\begin{align*}
|2x + 1 - 5| &< 1.0 \\
|2x - 4| &< 1.0 \\
|2(x - 2)| &< 1.0 \\
2|(x - 2)| &< 1.0 \\
|x - 2| &< 0.5
\end{align*}

In other words, x must stay within 0.5 of 2 to maintain the initial accuracy requirement of 1.0. Indeed, when x is between 1.5 and 2.5, f(x) goes from f(1.5) = 4 to and f(2.5) = 6, staying +/- 1.0 from our predicted value of 5.

We can generalize to any error tolerance (ε) by plugging it in for 1.0 above. We get:

\displaystyle{|x - 2| < 0.5 \cdot \epsilon}

If our zoom level is “δ = 0.5 * ε”, we’ll stay within the original error. If our error is 1.0 we need to zoom to .5; if it’s 0.1, we need to zoom to 0.05.

This simple function was a convenient example. The idea is to start with the initial constraint (|f(x) – L| < ε), plug in f(x) and L, and solve for the distance away from the black-hole point (|x – c| < ?). It’s often an exercise in algebra.

Sometimes you’re asked to simply find the limit (plug in 2 and get f(2) = 5), other times you’re asked to prove a limit exists, i.e. crank through the epsilon-delta algebra.

Flipping Zero and Infinity

Infinity, when used in a limit, means “grows without stopping”. The symbol ∞ is no more a number than the sentence “grows without stopping” or “my supply of underpants is dwindling”. They are concepts, not numbers (for our level of math, Aleph me alone).

When using ∞ in a limit, we’re asking: “As x grows without stopping, can we make a prediction that remains accurate?”. If there is a limit, it means the predicted value is always confirmed, no matter how far out we look.

But, I still don’t like infinity because I can’t see it. But I can see zero. With limits, you can rewrite

\displaystyle{\lim_{x \to \infty}}

as

\displaystyle{\lim_{\frac{1}{x} \to 0}}

You can get sneaky and define y = 1/x, replace items in your formula, and then use

\displaystyle{\lim_{y \to 0^+}}

so it looks like a normal problem again! (Note from Tim in the comments: the limit is coming from the right, since x was going to positive infinity). I prefer this arrangement, because I can see the location we’re narrowing in on (we’re always running out of paper when charting the infinite version).

Why Aren’t Limits Used More Often?

Imagine a kid who figured out that “Putting a zero on the end” made a number 10x larger. Have 5? Write down “5″ then “0″ or 50. Have 100? Make it 1000. And so on.

He didn’t figure out why multiplication works, why this rule is justified… but, you’ve gotta admit, he sure can multiply by 10. Sure, there are some edge cases (Would 0 become “00″?), but it works pretty well.

The rules of calculus were discovered informally (by modern standards). Newton deduced that “The derivative of x^3 is 3x^2″ without rigorous justification. Yet engines whirl and airplanes fly based on his unofficial results.

The calculus pedagogy mistake is creating a roadblock like “You must know Limits™ before appreciating calculus”, when it’s clear the inventors of calculus didn’t. I’d prefer this progression:

  • Calculus asks seemingly impossible questions: When can rectangles measure a curve? Can we detect instantaneous change?
  • Limits give a strategy for answering “impossible” questions (“If you can make a prediction that withstands infinite scrutiny, we’ll say it’s ok.”)
  • They’re a great tag-team: Calculus explores, limits verify. We memorize shortcuts for the results we verified with limits (d/dx x^3 = 3x^2), just like we memorize shortcuts for the rules we verified with multiplication (adding a zero means times 10). But it’s still nice to know why the shortcuts are justified.

Limits aren’t the only tool for checking the answers to impossible questions; infinitesimals work too. The key is understanding what we’re trying to predict, then learning the rules of making predictions.

Happy math.

Understanding Bayes Theorem With Ratios

My first intuition about Bayes Theorem was “take evidence and account for false positives”. Does a lab result mean you’re sick? Well, how rare is the disease, and how often do healthy people test positive? Misleading signals must be considered.

This helped me muddle through practice problems, but I couldn’t think with Bayes. The big obstacles:

Percentages are hard to reason with. Odds compare the relative frequency of scenarios (A:B) while percentages use a part-to-whole “global scenario” [A/(A+B)]. A coin has equal odds (1:1) or a 50% chance of heads. Great. What happens when heads are 18x more likely? Well, the odds are 18:1, can you rattle off the decimal percentage? (I’ll wait…) Odds require less computation, so let’s start with them.

Equations miss the big picture. Here’s Bayes Theorem, as typically presented:

\displaystyle{\displaystyle{\Pr(\mathrm{A}|\mathrm{X}) = \frac{\Pr(\mathrm{X}|\mathrm{A})\Pr(\mathrm{A})}{\Pr(\mathrm{X|A})\Pr(A)+ \Pr(\mathrm{X|\sim A})\Pr(\sim A)}}}

It reads right-to-left, with a mess of conditional probabilities. How about this version:

original odds * evidence adjustment = new odds

Bayes is about starting with a guess (1:3 odds for rain:sunshine), taking evidence (it’s July in the Sahara, sunshine 1000x more likely), and updating your guess (1:3000 chance of rain:sunshine). The “evidence adjustment” is how much better, or worse, we feel about our odds now that we have extra information (if it was December in Seattle, you might say rain was 1000x as likely).

Let’s start with ratios and sneak up to the complex version.

Caveman Statistician Og

Og just finished his CaveD program, and runs statistical research for his tribe:

  • He saw 50 deer and 5 bears overall (50:5 odds)
  • At night, he saw 10 deer and 4 bears (10:4 odds)

What can he deduce? Well,

original odds * evidence adjustment = new odds

or

evidence adjustment = new odds / original odds

At night, he realizes deer are 1/4 as likely as they were previously:

10:4 / 50:5 = 2.5 / 10 = 1/4

(Put another way, bears are 4x as likely at night)

Let’s cover ratios a bit. A:B describes how much A we get for every B (imagine miles per gallon as the ratio miles:gallon). Compare values with division: going from 25:1 to 50:1 means you doubled your efficiency (50/25 = 2). Similarly, we just discovered how our “deers per bear” amount changed.

Og happily continues his research:

  • By the river, bears are 20x more likely (he saw 2 deer and 4 bears, so 2:4 / 50:5 = 1:20)
  • In winter, deer are 3x as likely (30 deer and 1 bear, 30:1 / 50:5 = 3:1)

He takes a scenario, compares it to the baseline, and computes the evidence adjustment.

Caveman Clarence subscribes to Og’s journal, and wants to apply the findings to his forest (where deer:bears are 25:1). Suppose Clarence hears an animal approaching:

  • His general estimate is 25:1 odds of deer:bear
  • It’s at night, with bears 4x as likely => 25:4
  • It’s by the river, with bears 20x as likely => 25:80
  • It’s in the winter, with deer 3x more likely => 75:80

Clarence guesses “bear” with near-even odds (75:80) and tiptoes out of there.

That’s Bayes. In fancy language:

  • Start with a prior probability, the general odds before evidence
  • Collect evidence, and determine how much it changes the odds
  • Compute the posterior probability, the odds after updating

Bayesian Spam Filter

Let’s build a spam filter based on Og’s Bayesian Bear Detector.

First, grab a collection of regular and spam email. Record how often a word appears in each:

             spam      normal
hello          3         3
darling        1         5
buy            3         2
viagra         3         0
...

(“hello” appears equally, but “buy” skews toward spam)

We compute odds just like before. Let’s assume incoming email has 9:1 chance of spam, and we see “hello darling”:

  • A generic message has 9:1 odds of spam:regular
  • Adjust for “hello” => keep the 9:1 odds (“hello” is equally-likely in both sets)
  • Adjust for “darling” => 9:5 odds (“darling” appears 5x as often in normal emails)
  • Final chances => 9:5 odds of spam

We’re learning towards spam (9:5 odds). However, it’s less spammy than our starting odds (9:1), so we let it through.

Now consider a message like “buy viagra”:

  • Prior belief: 9:1 chance of spam
  • Adjust for “buy”: 27:2 (3:2 adjustment towards spam)
  • Adjust for (“viagra”): …uh oh!

“Viagra” never appeared in a normal message. Is it a guarantee of spam?

Probably not: we should intelligently adjust for new evidence. Let’s assume there’s a regular email, somewhere, with that word, and make the “viagra” odds 3:1. Our chances become 27:2 * 3:1 = 81:2.

Now we’re geting somewhere! Our initial 9:1 guess shifts to 81:2. Now is it spam?

Well, how horrible is a false positive?

81:2 odds imply for every 81 spam messages like this, we’ll incorrectly block 2 normal emails. That ratio might be too painful. With more evidence (more words or other characteristics), we might wait for 1000:1 odds before calling a message spam.

Exploring Bayes Theorem

We can check our intuition by seeing if we naturally ask leading questions:

  • Is evidence truly independent? Are there links between animal behavior at night and in the winter, or words that appear together? Sure. We “naively” assume evidence is independent (and yet, in our bumbling, create effective filters anyway).

  • How much evidence is enough? Is seeing 2 deer & 1 bear the same 2:1 evidence adjustment as 200 deer and 100 bears?

  • How accurate were the starting odds in the first place? Prior beliefs change everything. (“A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”)

  • Do absolute probabilities matter? We usually need the most-likely theory (“Deer or bear?”), not the global chance of this scenario (“What’s the probability of deers at night in the winter by the river vs. bears at night in the winter by the river?”). Many Bayesian calculations ignore the global probabilities, which cancel when dividing, and essentially use an odds-centric approach.

  • Can our filter be tricked? A spam message might add chunks of normal text to appear innocuous and “poison” the filter. You’ve probably seen this yourself.

  • What evidence should we use? Let the data speak. Email might have dozens of characteristics (time of day, message headers, country of origin, HTML tags…). Give every characteristic a likelihood factor and let Bayes sort ‘em out.

Thinking With Ratios and Percentages

The ratio and percentage approaches ask slightly different questions:

Ratios: Given the odds of each outcome, how does evidence adjust them?

The evidence adjustment just skews the initial odds, piece-by-piece.

Percentages: What is the chance of an outcome after supporting evidence is found?

In the percentage case,

  • “% Bears” is the overall chance of a bear appearing anywhere
  • “% Bears Going to River” is how likely a bear is to trigger the “river” data point
  • “% Bear at River” is the combined chance of having a bear, and it going to the river. In stats terms, P(event and evidence) = P(event) * P(event implies evidence) = P(event) * P(evidence|event). I see conditional probabilities as “Chances that X implies Y” not the twisted “Chances of Y, given X happened”.

Let’s redo the original cancer example:

  • 1% of the population has cancer
  • 9.6% of healthy people test positive, 80% of people with cancer do

If you see a positive result, what’s the chance of cancer?

Ratio Approach:

  • Cancer:Healthy ratio is 1:99
  • Evidence adjustment: 80/100 : 9.6/100 = 80:9.6 (80% of sick people are “at the river”, and 9.6% of healthy people are).
  • Final odds: 1:99 * 80:9.6 = 80:950.4 (roughly 1:12 odds of cancer, ~7.7% chance)

The intuition: the initial 1:99 odds are pretty skewed. Even with a 8.3x (80:9.6) boost from a positive test result, cancer remains unlikely.

Percentage Approach:

  • Cancer chance is 1%
  • Chance of true positive = 1% * 80% = .008
  • Chance of false positive = 99% * 9.6% = .09504
  • Chance of having cancer = .008 / (.008 + .09504) = 7.7%

When written with percentages, we start from absolute chances. There’s a global 0.8% chance of finding a sick patient with a positive result, and a global 9.504% chance of a healthy patient with a positive result. We then compute the chance these global percentages indicate something useful.

Let the approaches be complements: percentages for a bird’s-eye view, and ratios for seeing how individual odds are adjusted. We’ll save the myriad other interpretations for another day.

Happy math.

An Interactive Guide To The Fourier Transform

The Fourier Transform is one of deepest insights ever made. Unfortunately, the meaning is buried within dense equations:

\displaystyle{X_k = \sum_{n=0}^{N-1} x_n \cdot e^{-i 2 \pi k n / N}}

\displaystyle{x_n = \frac{1}{N} \sum_{k=0}^{N-1} X_k \cdot e^{i 2 \pi k n / N}}

Yikes. Rather than jumping into the symbols, let's experience the key idea firsthand. Here's a plain-English metaphor:

  • What does the Fourier Transform do? Given a smoothie, it finds the recipe.
  • How? Run the smoothie through filters to extract each ingredient.
  • Why? Recipes are easier to analyze, compare, and modify than the smoothie itself.
  • How do we get the smoothie back? Blend the ingredients.

Next, we'll refine the analogy into "math-English":

  • The Fourier Transform takes a time-based pattern, measures each "cycle ingredient" (cycle strength, offset, & rotation speed), and returns the overall "cycle recipe" (frequency graph)

Time for the equations? No! Let's get our hands dirty and experience cycles making patterns with live simulations.

If all goes well, we'll have an aha! moment and intuitively realize why the Fourier Transform is possible. We'll save the detailed math analysis for the follow-up.

This isn't a force-march through the equations, it's the casual stroll I wish I had. Onward!

From Smoothie to Recipe

A math transformation is a change of perspective. We change our notion of quantity from "single items" (lines in the sand, tally system) to "groups of 10" (decimal) depending on what we're counting. Scoring a game? Tally it up. Multiplying? Decimals, please.

The Fourier Transform changes our perspective from consumer to producer, turning What did I see? into How was it made?

In other words: given a smoothie, let's find the recipe.

Why? Well, recipes are great descriptions of drinks. You wouldn't share a drop-by-drop analysis, you'd say "I had an orange/banana smoothie". A recipe is more easily categorized, compared, and modified than the object itself.

So... given a smoothie, how do we find the recipe?

Well, imagine you had a few filters lying around:

  • Pour through the "banana" filter. 1 oz of bananas are extracted.
  • Pour through the "orange" filter. 2 oz of oranges.
  • Pour through the "milk" filter. 3 oz of milk.
  • Pour through the "water" filter. 3 oz of water.

We can reverse-engineer the recipe by filtering each ingredient. The catch?

  • Filters must be independent. The banana filter needs to capture bananas, and nothing else. Adding more oranges should never affect the banana reading.

  • Filters must be complete. We won't get the real recipe if we leave out a filter ("There were mangoes too!"). Our collection of filters must catch every last ingredient.

  • Ingredients must be combine-able. Smoothies can be separated and re-combined without issue (A cookie? Not so much. Who wants crumbs?). The ingredients, when separated and combined in any order, must behave the same.

Seeing The World As Cycles

The Fourier Transform takes a specific viewpoint: What if any signal could be filtered into a bunch of circular paths?

Whoa. This concept is mind-blowing, and poor Joseph Fourier had his idea rejected at first. (Really Joe, even a staircase pattern can be made from circles?)

And despite decades of debate in the math community, we expect students to internalize the idea without issue. Ugh. Let's walk through the intuition.

The Fourier Transform finds the recipe for a signal, like our smoothie process:

  • Start with a time-based signal
  • Apply filters to measure each possible "circular ingredient"
  • Collect the full recipe, listing the amount of each "circular ingredient"

Stop. Here's where most tutorials excitedly throw engineering applications at your face. Don't get scared; think of the examples as "Wow, we're finally seeing the source code (DNA) behind previously confusing ideas".

  • If earthquake vibrations can be separated into "ingredients" (vibrations of different speeds & strengths), buildings can be designed to avoid interacting with the strongest ones.

  • If sound waves can be separated into ingredients (bass and treble frequencies), we can boost the parts we care about, and hide the ones we don't. The crackle of random noise can be removed. Maybe similar "sound recipes" can be compared (music recognition services compare recipes, not the raw audio clips).

  • If computer data can be represented with oscillating patterns, perhaps the least-important ones can be ignored. This "lossy compression" can drastically shrink file sizes (and why JPEG and MP3 files are much smaller than raw .bmp or .wav files).

  • If a radio wave is our signal, we can use filters to listen to a particular channel. In the smoothie world, imagine each person paid attention to a different ingredient: Adam looks for apples, Bob looks for bananas, and Charlie gets cauliflower (sorry bud).

The Fourier Transform is useful in engineering, sure, but it's a metaphor about finding the root causes behind an observed effect.

Think With Circles, Not Just Sinusoids

One of my giant confusions was separating the definitions of "sinusoid" and "circle".

  • A "sinusoid" is a specific back-and-forth pattern (a sine or cosine wave), and 99% of the time, it refers to motion in one dimension.
  • A "circle" is a round, 2d pattern you probably know. If you enjoy using 10-dollar words to describe 10-cent ideas, you might call a circular path a "complex sinusoid".

Labeling a circular path as a "complex sinusoid" is like describing a word as a "multi-letter". You zoomed into the wrong level of detail. Words are about concepts, not the letters they can be split into!

The Fourier Transform is about circular paths (not 1-d sinusoids) and Euler's formula is a clever way to generate one:

euler path

Must we use imaginary exponents to move in a circle? Nope. But it's convenient and compact. We can separate the path into real and imaginary parts, but don't forget the big picture: we move in circles.

Following Circular Paths

Let's say we're chatting on the phone and, like usual, I want us to draw the same circular path simultaneously. (You promised!) What should I say?

  • How big is the circle? (Amplitude, i.e. size of radius)
  • How fast do we draw it? (Frequency. 1 circle/second is a frequency of 1 Hertz (Hz) or 2*pi radians/sec)
  • Where do we start? (Phase angle, where 0 degrees is the x-axis)

I could say "2-inch radius, start at 45 degrees, 1 circle per second, go!". After half a second, we should each be pointing to: starting point + amount traveled = 45 + 180 = 225 degrees (on a 2-inch circle).

Every circular path needs a size, speed, and starting angle (amplitude/frequency/phase). We can even combine paths: imagine tiny motorcars, driving in circles at different speeds.

The combined position of all the cycles is our signal, just like the combined flavor of all the ingredients is our smoothie.

Here's a simulation of a basic circular path:

(Based on this animation, here's the source code. Modern browser required. Click the graph to pause/unpause.)

The magnitude of each cycle is listed in order, starting at 0Hz. Cycles [0 1] means

  • 0 strength for the 0Hz cycle (0Hz = a constant cycle, stuck on the x-axis at zero degrees)
  • 1 strength for the 1Hz cycle (completes 1 cycle per time interval)

Now the tricky part:

  • The blue graph measures the real part of the cycle. Another lovely math confusion: the real axis of the circle, which is usually horizontal, has its magnitude shown on the vertical axis. You can mentally rotate the circle 90 degrees if you like.
  • The time points are spaced at the fastest frequency. A 1Hz signal needs 2 time points for a start and stop (a single data point doesn't have a frequency). The time values [1 -1] shows the amplitude at these equally-spaced intervals.

With me? [0 1] is a pure 1Hz cycle.

Now let's add a 2Hz cycle to the mix. [0 1 1] means "Nothing at 0Hz, 1Hz of strength 1, 2Hz of strength 1":

Whoa. The little motorcars are getting wild: the green lines are the 1Hz and 2Hz cycles, and the blue line is the combined result. Try toggling the green checkbox to see the final result clearly. The combined "flavor" is a sway that starts at the max and dips low for the rest of the interval.

The yellow dots are when we actually measure the signal. With 3 cycles defined (0Hz, 1Hz, 2Hz), each dot is 1/3 of the way through the signal. In this case, cycles [0 1 1] generate the time values [2 -1 -1], which starts at the max (2) and dips low (-1).

Oh! We can't forget phase, the starting angle! Use magnitude:angle to set the phase. So [0 1:45] is a 1Hz cycle that starts at 45 degrees:

This is a shifted version of [0 1]. On the time side we get [.7 -.7] instead of [1 -1], because our cycle isn't exactly lined up with our measuring intervals, which are still at the halfway point (this could be desired!).

The Fourier Transform finds the set of cycle speeds, strengths and phases to match any time signal.

Our signal becomes an abstract notion that we consider as "observations in the time domain" or "ingredients in the frequency domain".

Enough talk: try it out! In the simulator, type any time or cycle pattern you'd like to see. If it's time points, you'll get a collection of cycles (that combine into a "wave") that matches your desired points.

But… doesn't the combined wave have strange values between the yellow time intervals? Sure. But who's to say whether a signal travels in straight lines, or curves, or zips into other dimensions when we aren't measuring it? It behaves exactly as we need at the equally-spaced moments we asked for.

Making A Spike In Time

Can we make a spike in time, like (4 0 0 0), using cycles? I'll use parentheses () for a sequence of time points, and brackets [] for a sequence of cycles.

Although the spike seems boring to us time-dwellers (that's it?), think about the complexity in the cycle world. Our cycle ingredients must start aligned (at the max value, 4) and then "explode outwards", each cycle with partners that cancel it in the future. Every remaining point is zero, which is a tricky balance with multiple cycles running around (we can't just "turn them off").

Let's walk through each time point:

  • At time 0, the first instant, every cycle ingredient is at its max. Ignoring the other time points, (4 ? ? ?) can be made from 4 cycles (0Hz 1Hz 2Hz 3Hz), each with a magnitude of 1 and phase of 0 (i.e., 1 + 1 + 1 + 1 = 4).

  • At every future point (t = 1, 2, 3), the sum of all cycles must cancel.

Here's the trick: when two cycles are on opposites sides of the circle (North & South, East & West, etc.) their combined position is zero (3 cycles can cancel if they're spread evenly at 0, 120, and 240 degrees).

Imagine a constellation of points moving around the circle. Here's the position of each cycle at every instant:

Time 0 1 2 3 
------------
0Hz: 0 0 0 0 
1Hz: 0 1 2 3
2Hz: 0 2 0 2
3Hz: 0 3 2 1

Notice how the the 3Hz cycle starts at 0, gets to position 3, then position "6" (with only 4 positions, 6 modulo 4 = 2), then position "9" (9 modulo 4 = 1).

When our cycle is 4 units long, cycle speeds a half-cycle apart (2 units) will either be lined up (difference of 0, 4, 8…) or on opposite sides (difference of 2, 6, 10…).

OK. Let's drill into each time point:

  • Time 0: All cycles at their max (total of 4)
  • Time 1: 1Hz and 3Hz cancel (positions 1 & 3 are opposites), 0Hz and 2Hz cancel as well. The net is 0.
  • Time 2: 0Hz and 2Hz line up at position 0, while 1Hz and 3Hz line up at position 2 (the opposite side). The total is still 0.
  • Time 3: 0Hz and 2Hz cancel. 1Hz and 3Hz cancel.
  • Time 4 (repeat of t=0): All cycles line up.

The trick is having individual speeds cancel (0Hz vs 2Hz, 1Hz vs 3Hz), or having the lined-up pairs cancel (0Hz + 2Hz vs 1Hz + 3Hz).

When every cycle has equal power and 0 phase, we start aligned and cancel afterwards. (I don't have a nice proof yet -- any takers? -- but you can see it yourself. Try [1 1], [1 1 1], [1 1 1 1] and notice the time spikes: (2 0), (3 0 0), (4 0 0 0)).

Here's how I visualize the initial alignment, followed by a net cancellation:

Moving The Time Spike

Not everything happens at t=0. Can we change our spike to (0 4 0 0)?

It seems the cycle ingredients should be similar to (4 0 0 0), but the cycles must align at t=1 (one second in the future). Here's where phase comes in.

Imagine a race with 4 runners. Normal races have everyone lined up at the starting line, the (4 0 0 0) time pattern. Boring.

What if we want everyone to finish at the same time? Easy. Just move people forward or backwards by the appropriate distance. Maybe granny can start 2 feet in front of the finish line, Usain Bolt can start 100m back, and they can cross the tape holding hands.

Phase shifts, the starting angle, are delays in the cycle universe. Here's how we adjust the starting position to delay every cycle 1 second:

  • A 0Hz cycle doesn't move, so it's already aligned
  • A 1Hz cycle goes 1 revolution in the entire 4 seconds, so a 1-second delay is a quarter-turn. Phase shift it 90 degrees backwards (-90) and it gets to phase=0, the max value, at t=1.
  • A 2Hz cycle is twice as fast, so give it twice the angle to cover (-180 or 180 phase shift -- it's across the circle, either way).
  • A 3Hz cycle is 3x as fast, so give it 3x the distance to move (-270 or +90 phase shift)

If time points (4 0 0 0) are made from cycles [1 1 1 1], then time points (0 4 0 0) are made from [1 1:-90 1:180 1:90]. (Note: I'm using "1Hz", but I mean "1 cycle over the entire time period").

Whoa -- we're working out the cycles in our head!

The interference visualization is similar, except the alignment is at t=1.

Test your intuition: Can you make (0 0 4 0), i.e. a 2-second delay? 0Hz has no phase. 1Hz has 180 degrees, 2Hz has 360 (aka 0), and 3Hz has 540 (aka 180), so it's [1 1:180 1 1:180].

Discovering The Full Transform

The big insight: our signal is just a bunch of time spikes! If we merge the recipes for each time spike, we should get the recipe for the full signal.

The Fourier Transform builds the recipe frequency-by-frequency:

  • Separate the full signal (a b c d) into "time spikes": (a 0 0 0) (0 b 0 0) (0 0 c 0) (0 0 0 d)
  • For any frequency (like 2Hz), the tentative recipe is "a/4 + b/4 + c/4 + d/4" (the strength of each spike is split among all frequencies)
  • Wait! We need to offset each spike with a phase delay (the angle for a "1 second delay" depends on the frequency).
  • Actual recipe for a frequency = a/4 (no offset) + b/4 (1 second offset) + c/4 (2 second offset) + d/4 (3 second offset).

We can then loop through every frequency to get the full transform.

Here's the conversion from "math English" to full math:

A few notes:

  • N = number of time samples we have
  • n = current sample we're considering (0 .. N-1)
  • xn = value of the signal at time n
  • k = current frequency we're considering (0 Hertz up to N-1 Hertz)
  • Xk = amount of frequency k in the signal (amplitude and phase, a complex number)
  • The 1/N factor is usually moved to the reverse transform (going from frequencies back to time). This is allowed, though I prefer 1/N in the forward transform since it gives the actual sizes for the time spikes. You can get wild and even use 1/sqrt(N) on both transforms (going forward and back still has the 1/N factor).
  • n/N is the percent of the time we've gone through. 2 * pi * k is our speed in radians / sec. e^-ix is our backwards-moving circular path. The combination is how far we've moved, for this speed and time.
  • The raw equations for the Fourier Transform just say "add the complex numbers". Many programming languages cannot handle complex numbers directly, so you convert everything to rectangular coordinates and add those.

Onward

This was my most challenging article yet. The Fourier Transform has several flavors (discrete/continuous/finite/infinite), covers deep math (Dirac delta functions), and it's easy to get lost in details. I was constantly bumping into the edge of my knowledge.

But there's always simple analogies out there -- I refuse to think otherwise. Whether it's a smoothie or Usain Bolt & Granny crossing the finish line, take a simple understanding and refine it. The analogy is flawed, and that's ok: it's a raft to use, and leave behind once we cross the river.

I realized how feeble my own understanding was when I couldn't work out the transform of (1 0 0 0) in my head. For me, it was like saying I "knew" addition but, gee whiz, I'm not sure what "1 + 1 + 1 + 1" would be. Why not? Shouldn't we have an intuition for the simplest of operations?

That discomfort led me around the web to build my intuition. In addition to the references in the article, I'd like to thank:

Today's goal was to experience the Fourier Transform. We'll save the advanced analysis for next time.

Happy math.

Appendix: Projecting Onto Cycles

Stuart Riffle has a great interpretation of the Fourier Transform:

Imagine spinning your signal in a centrifuge and checking for a bias. I have a correction: we must spin backwards (the exponent in the equation above needs a negative sign). You already know why: we need a phase delay so spikes appear in the future.

Appendix: Another Awesome Visualization

Lucas Vieira, author of excellent Wikipedia animations, was inspired to make this interactive animation:

(Detailed list of control options)

The Fourier Transform is about cycles added to cycles added to cycles. Try making a "time spike" by setting a strength of 1 for every component (press Enter after inputting each number). Fun fact: with enough terms, you can draw any shape, even Homer Simpson.

Appendix: Using the code

All the code and examples are open source (MIT licensed, do what you like).

An Intuitive Guide to Linear Algebra

Despite two linear algebra classes, my knowledge consisted of “Matrices, determinants, eigen something something”.

Why? Well, let’s try this course format:

  • Name the course “Linear Algebra” but focus on things called matrices and vectors
  • Label items with similar-looking letters (i/j), and even better, similar-looking-and-sounding ones (m/n)
  • Teach concepts like Row/Column order with mnemonics instead of explaining the reasoning
  • Favor abstract examples (2d vectors! 3d vectors!) and avoid real-world topics until the final week

The survivors are physicists, graphics programmers and other masochists. We missed the key insight:

Linear algebra gives you mini-spreadsheets for your math equations.

We can take a table of data (a matrix) and create updated tables from the original. It’s the power of a spreadsheet written as an equation.

Here’s the linear algebra introduction I wish I had, with a real-world stock market example.

What’s in a name?

“Algebra” means, roughly, “relationships”. Grade-school algebra explores the relationship between unknown numbers. Without knowing x and y, we can still work out that (x + y)2 = x2 + 2xy + y2.

“Linear Algebra” means, roughly, “line-like relationships”. Let’s clarify a bit.

Straight lines are predictable. Imagine a rooftop: move forward 3 horizontal feet (relative to the ground) and you might rise 1 foot in elevation (The slope! Rise/run = 1/3). Move forward 6 feet, and you’d expect a rise of 2 feet. Contrast this with climbing a dome: each horizontal foot forward raises you a different amount.

Lines are nice and predictable:

  • If 3 feet forward has a 1-foot rise, then going 10x as far should give a 10x rise (30 feet forward is a 10-foot rise)
  • If 3 feet forward has a 1-foot rise, and 6 feet has a 2-foot rise, then (3 + 6) feet should have a (1 + 2) foot rise

In math terms, an operation F is linear if scaling inputs scales the output, and adding inputs adds the outputs:


\begin{align*}
F(ax) &= a \cdot F(x) \\
F(x + y) &= F(x) + F(y)
\end{align*}

In our example, F(x) calculates the rise when moving forward x feet, and the properties hold:

\displaystyle{F(10 \cdot 3) = 10 \cdot F(3) = 10}

\displaystyle{F(3+6) = F(3) + F(6) = 3}

Linear Operations

An operation is a calculation based on some inputs. Which operations are linear and predictable? Multiplication, it seems.

Exponents (F(x) = x2) aren’t predictable: 102 is 100, but 202 is 400. We doubled the input but quadrupled the output.

Surprisingly, regular addition isn’t linear either. Consider the “add three” function:


\begin{align*}
F(x) &= x + 3 \\
F(10) &= 13 \\
F(20) &= 23
\end{align*}

We doubled the input and did not double the output. (Yes, F(x) = x + 3 happens to be the equation for an offset line, but it’s still not “linear” because F(10) isn’t 10 * F(1). Fun.)

Our only hope is to multiply by a constant: F(x) = ax (in our roof example, a=1/3). However, we can still combine linear operations to make a new linear operation:

\displaystyle{G(x, y, z) = F(x + y + z) = F(x) + F(y) + F(z)}

G is made of 3 linear subpieces: if we double the inputs, we’ll double the output.

We have “mini arithmetic”: multiply inputs by a constant, and add the results. It’s actually useful because we can split inputs apart, analyze them individually, and combine the results:

\displaystyle{G(x,y,z) = G(x,0,0) + G(0,y,0) + G(0,0,z)}

If the inputs interacted like exponents, we couldn’t separate them — we’d have to analyze everything at once.

Organizing Inputs and Operations

Most courses hit you in the face with the details of a matrix. “Ok kids, let’s learn to speak. Select a subject, verb and object. Next, conjugate the verb. Then, add the prepositions…”

No! Grammar is not the focus. What’s the key idea?

  • We have a bunch of inputs to track
  • We have predictable, linear operations to perform (our “mini-arithmetic”)
  • We generate a result, perhaps transforming it again

Ok. First, how should we track a bunch of inputs? How about a list:

x
y
z

Not bad. We could write it (x, y, z) too — hang onto that thought.

Next, how should we track our operations? Remember, we only have “mini arithmetic”: multiplications, with a final addition. If our operation F behaves like this:

\displaystyle{F(x, y, z) = 3x + 4y + 5z}

We could abbreviate the entire function as (3, 4, 5). We know to multiply the first input by the first value, the second input by the second value, etc., and add the result.

Only need the first input?

\displaystyle{G(x, y, z) = 3x + 0y + 0z = (3, 0, 0)}

Let’s spice it up: how should we handle multiple sets of inputs? Let’s say we want to run operation F on both (a, b, c) and (x, y, z). We could try this:

\displaystyle{F(a, b, c, x, y, z) = ?}

But it won’t work: F expects 3 inputs, not 6. We should separate the inputs into groups:

1st Input  2nd Input
--------------------
a          x
b          y
c          z

Much neater.

And how could we run the same input through several operations? Have a row for each operation:

F: 3 4 5
G: 3 0 0

Neat. We’re getting organized: inputs in vertical columns, operations in horizontal rows.

Visualizing The Matrix

Words aren’t enough. Here’s how I visualize inputs, operations, and outputs:

linear algebra reference

Imagine “pouring” each input along each operation:

linear algebra pour in

As an input passes an operation, it creates an output item. In our example, the input (a, b, c) goes against operation F and outputs 3a + 4b + 5c. It goes against operation G and outputs 3a + 0 + 0.

Time for the red pill. A matrix is a shorthand for our diagrams:

\text{I}\text{nputs} = A = \begin{bmatrix} \text{i}\text{nput1}&\text{i}\text{nput2}\end{bmatrix} = \begin{bmatrix}a & x\\b & y\\c & z\end{bmatrix}

\text{Operations} = M = \begin{bmatrix}\text{operation1}\\ \text{operation2}\end{bmatrix} = \begin{bmatrix}3 & 4 & 5\\3 & 0 & 0\end{bmatrix}

A matrix is a single variable representing a spreadsheet of inputs or operations.

Trickiness #1: The reading order

Instead of an input => matrix => output flow, we use function notation, like y = f(x) or f(x) = y. We usually write a matrix with a capital letter (F), and a single input column with lowercase (x). Because we have several inputs (A) and outputs (B), they’re considered matrices too:

\displaystyle{MA = B}


\begin{bmatrix}3 & 4 & 5\\3 & 0 & 0\end{bmatrix} \begin{bmatrix}a & x\\b & y\\c & z\end{bmatrix}
= \begin{bmatrix}3a + 4b + 5c & 3x + 4y + 5z\\ 3a & 3x\end{bmatrix}

Trickiness #2: The numbering

Matrix size is measured as RxC: row count, then column count, and abbreviated “m x n” (I hear ya, “r x c” would be easier to remember). Items in the matrix are referenced the same way: aij is the ith row and jth column (I hear ya, “i” and “j” are easily confused on a chalkboard). Mnemonics are ok with context, and here’s what I use:

  • RC, like Roman Centurion or RC Cola
  • Use an “L” shape. Count down the L, then across

Why does RC ordering make sense? Our operations matrix is 2×3 and our input matrix is 3×2. Writing them together:

[Operation Matrix] [Input Matrix]
[operation count x operation size] [input size x input count]
[m x n] [p x q] = [m x q]
[2 x 3] [3 x 2] = [2 x 2]

Notice the matrices touch at the “size of operation” and “size of input” (n = p). They should match! If our inputs have 3 components, our operations should expect 3 items. In fact, we can only multiply matrices when n = p.

The output matrix has m operation rows for each input, and q inputs, giving a “m x q” matrix.

Fancier Operations

Let’s get comfortable with operations. Assuming 3 inputs, we can whip up a few 1-operation matrices:

  • Adder: [1 1 1]
  • Averager: [1/3 1/3 1/3]

The “Adder” is just a + b + c. The “Averager” is similar: (a + b + c)/3 = a/3 + b/3 + c/3.

Try these 1-liners:

  • First-input only: [1 0 0]
  • Second-input only: [0 1 0]
  • Third-input only: [0 0 1]

And if we merge them into a single matrix:

[1 0 0]
[0 1 0]
[0 0 1]

Whoa — it’s the “identity matrix”, which copies 3 inputs to 3 outputs, unchanged. How about this guy?

[1 0 0]
[0 0 1]
[0 1 0]

He reorders the inputs: (x, y, z) becomes (x, z, y).

And this one?

[2 0 0]
[0 2 0]
[0 0 2]

He’s an input doubler. We could rewrite him to 2*I (the identity matrix) if we were so inclined.

And yes, when we decide to treat inputs as vector coordinates, the operations matrix will transform our vectors. Here’s a few examples:

  • Scale: make all inputs bigger/smaller
  • Skew: make certain inputs bigger/smaller
  • Flip: make inputs negative
  • Rotate: make new coordinates based on old ones (East becomes North, North becomes West, etc.)

These are geometric interpretations of multiplication, and how to warp a vector space. Just remember that vectors are examples of data to modify.

A Non-Vector Example: Stock Market Portfolios

Let’s practice linear algebra in the real world:

  • Input data: stock portfolios with dollars in Apple, Google and Microsoft stock
  • Operations: the changes in company values after a news event
  • Output: updated portfolios

And a bonus output: let’s make a new portfolio listing the net profit/loss from the event.

Normally, we’d track this in a spreadsheet. Let’s learn to think with linear algebra:

  • The input vector could be ($Apple, $Google, $Microsoft), showing the dollars in each stock. (Oh! These dollar values could come from another matrix that multiplied the number of shares by their price. Fancy that!)

  • The 4 output operations should be: Update Apple value, Update Google value, Update Microsoft value, Compute Profit

Visualize the problem. Imagine running through each operation:

linear algebra stock example

The key is understanding why we’re setting up the matrix like this, not blindly crunching numbers.

Got it? Let’s introduce the scenario.

Suppose a secret iDevice is launched: Apple jumps 20%, Google drops 5%, and Microsoft stays the same. We want to adjust each stock value, using something similar to the identity matrix:

New Apple     [1.2  0      0]
New Google    [0    0.95   0]
New Microsoft [0    0      1]

The new Apple value is the original, increased by 20% (Google = 5% decrease, Microsoft = no change).

Oh wait! We need the overall profit:

Total change = (.20 * Apple) + (-.05 * Google) + (0 * Microsoft)

Our final operations matrix:

New Apple       [1.2  0      0]
New Google      [0    0.95   0]
New Microsoft   [0    0      1]
Total Profit    [.20  -.05   0]

Making sense? Three inputs enter, four outputs leave. The first three operations are a “modified copy” and the last brings the changes together.

Now let’s feed in the portfolios for Alice ($1000, $1000, $1000) and Bob ($500, $2000, $500). We can crunch the numbers by hand, or use a Wolfram Alpha (calculation):

matrix stock computation

(Note: Inputs should be in columns, but it’s easier to type rows. The Transpose operation, indicated by t (tau), converts rows to columns.)

The final numbers: Alice has $1200 in AAPL, $950 in GOOG, $1000 in MSFT, with a net profit of $150. Bob has $600 in AAPL, $1900 in GOOG, and $500 in MSFT, with a net profit of $0.

What’s happening? We’re doing math with our own spreadsheet. Linear algebra emerged in the 1800s yet spreadsheets were invented in the 1980s. I blame the gap on poor linear algebra education.

Historical Notes: Solving Simultaneous equations

An early use of tables of numbers (not yet a “matrix”) was bookkeeping for linear systems:


\begin{align*}
x + 2y + 3z &= 3 \\
2x + 3y + 1z &= -10 \\
5x + -y + 2z &= 14
\end{align*}

becomes


\begin{bmatrix}1 & 2 & 3\\2 & 3 & 1\\5 & -1 & 2\end{bmatrix} \begin{bmatrix}x \\y \\ z \end{bmatrix}
= \begin{bmatrix}3 \\ -10 \\ 14 \end{bmatrix}

We can avoid hand cramps by adding/subtracting rows in the matrix and output, vs. rewriting the full equations. As the matrix evolves into the identity matrix, the values of x, y and z are revealed on the output side.

This process, called Gauss-Jordan elimination, saves time. However, linear algebra is mainly about matrix transformations, not solving large sets of equations (it’d be like using Excel for your shopping list).

Terminology, Determinants, and Eigenstuff

Words have technical categories to describe their use (nouns, verbs, adjectives). Matrices can be similarly subdivided.

Descriptions like “upper-triangular”, “symmetric”, “diagonal” are the shape of the matrix, and influence their transformations.

The determinant is the “size” of the output transformation. If the input was a unit vector (representing area or volume of 1), the determinant is the size of the transformed area or volume. A determinant of 0 means matrix is “destructive” and cannot be reversed (similar to multiplying by zero: information was lost).

The eigenvector and eigenvalue represent the “axes” of the transformation.

Consider spinning a globe: every location faces a new direction, except the poles.

An “eigenvector” is an input that doesn’t change direction when it’s run through the matrix (it points “along the axis”). And although the direction doesn’t change, the size might. The eigenvalue is the amount the eigenvector is scaled up or down when going through the matrix.

(My intuition here is weak, and I’d like to explore more. Here’s a nice diagram and video.)

Matrices As Inputs

A funky thought: we can treat the operations matrix as inputs!

Think of a recipe as a list of commands (Add 2 cups of sugar, 3 cups of flour…).

What if we want the metric version? Take the instructions, treat them like text, and convert the units. The recipe is “input” to modify. When we’re done, we can follow the instructions again.

An operations matrix is similar: commands to modify. Applying one operations matrix to another gives a new operations matrix that applies both transformations, in order.

If N is “adjust for portfolio for news” and T is “adjust portfolio for taxes” then applying both:

TN = X

means “Create matrix X, which first adjusts for news, and then adjusts for taxes”. Whoa! We didn’t need an input portfolio, we applied one matrix directly to the other.

The beauty of linear algebra is representing an entire spreadsheet calculation with a single letter. Want to apply the same transformation a few times? Use N2 or N3.

Can We Use Regular Addition, Please?

Yes, because you asked nicely. Our “mini arithmetic” seems limiting: multiplications, but no addition? Time to expand our brains.

Imagine adding a dummy entry of 1 to our input: (x, y, z) becomes (x, y, z, 1).

Now our operations matrix has an extra, known value to play with! If we want x + 1 we can write:

[1 0 0 1]

And x + y - 3 would be:

[1 1 0 -3]

Huzzah!

Want the geeky explanation? We’re pretending our input exists in a 1-higher dimension, and put a “1″ in that dimension. We skew that higher dimension, which looks like a slide in the current one. For example: take input (x, y, z, 1) and run it through:

[1 0 0 1]
[0 1 0 1]
[0 0 1 1]
[0 0 0 1]

The result is (x + 1, y + 1, z + 1, 1). Ignoring the 4th dimension, every input got a +1. We keep the dummy entry, and can do more slides later.

Mini-arithmetic isn’t so limited after all.

Onward

I’ve overlooked some linear algebra subtleties, and I’m not too concerned. Why?

These metaphors are helping me think with matrices, more than the classes I “aced”. I can finally respond to “Why is linear algebra useful?” with “Why are spreadsheets useful?”

They’re not, unless you want a tool used to attack nearly every real-world problem. Ask a businessman if they’d rather donate a kidney or be banned from Excel forever. That’s the impact of linear algebra we’ve overlooked: efficient notation to bring spreadsheets into our math equations.

Happy math.

Math As Language: Understanding the Equals Sign

It’s easy to forget math is a language for communicating ideas. As words, “two and three is equal to five” is cumbersome. Replacing numbers and operations with symbols helps: “2 + 3 is equal to 5″.

But we can do better. In 1557, Robert Recorde invented the equals sign, written with two parallel lines (=), because “noe 2 thynges, can be moare equalle”.

“2 + 3 = 5″ is much easier to read. Unfortuantely, the meaning of “equals” changes with the context — just ask programmers who have to distinguish =, == and ===.

A “equals” B is a generic conclusion: what specific relationship are we trying to convey?

Simplification

I see “2 + 3 = 5″ as “2 + 3 can be simplified to 5″. The equals sign transitions a complex form on the left to an equivalent, simpler form on the right.

Temporary Assignment

Statements like “speed = 50″ mean “the speed is 50, for this scenario”. It’s only good for the problem at hand, and there’s no need to remember this “fact”.

Fundamental Connection

Consider a mathematical truth like a2 + b2 = c2, where a, b, and c are the sides of a right triangle.

I read this equals sign as “must always be equal to” or “can be seen as” because it states a permanent relationship, not a coincidence. The arithmetic of 32 + 42 = 52 is a simplification; the geometry of a2 + b2 = c2 is a deep mathematical truth.

The formula to add 1 to n is:

\displaystyle{\frac{n(n+1)}{2}}

which can be seen as a type of geometric rearrangement, combinatorics, averaging, or even list-making.

Factual Definition

Statements like

\displaystyle{e = \lim_{n\to\infty} \left( 1 + \frac{100\%}{n} \right)^n}

are definitions of our choosing; the left hand side is a shortcut for the right hand side. It’s similar to temporary assignment, but reserved for “facts” that won’t change between scenarios (e always has the same value in every equation, but “speed” can change).

Constraints

Here’s a tricky one. We might write

x + y = 5

x – y = 3

which indicates conditions we want to be true. I read this as “x + y should be 5, if possible” and “x – y should be 3, if possible”. If we satisfy the constraints (x=4, y=1), great!

If we can’t meet both goals (x + y = 5; 2x + 2y = 9) then the equations could be true individually but not together.

Example: Demystifying Euler’s Formula

Untangling the equals sign helped me decode Euler’s formula:

\displaystyle{e^{i \cdot \pi} = -1}

A strange beast, indeed. What type of “equals” is it?

A pedant might say it’s just a simplification and break out the calulus to show it. This isn’t enlightening: there’s a fundamental relationship to discover.

e^i*pi refers to the same destination as -1. Two fingers pointing at the same moon.

They are both ways to describe “the other side of the unit circle, 180 degrees away”. -1 walks there, trodding straight through the grass, while e^i*pi takes the scenic route and rotates through the imaginary dimension. This works for any point on the circle: rotate there, or move in straight lines.

euler's formula

Two paths with the same destination: that’s what their equality means. Move beyond a generic equals and find the deeper, specific connection (“simplifies to”, “has been chosen to be”, “refers to the same concept as”).

Happy math.

Why Do We Learn Math?

I cringe when hearing "Math teaches you to think".

It's a well-meaning but ineffective appeal that only satisfies existing fans (see: "Reading takes you anywhere!"). What activity, from crossword puzzles to memorizing song lyrics, doesn't help you think?

Math seems different, and here's why: it's a specific, powerful vocabulary for ideas.

Imagine a cook who only knows the terms "yummy" and "yucky". He makes a bad meal. What's wrong? Hrm. There's no way to describe it! Too mild? Salty? Sweet? Sour? Cold? These specific critiques become hazy variations of the "yucky" bucket. He probably wouldn't think "Needs more umami".

Words are handholds that latch onto thoughts. You (yes, you!) think with extreme mathematical sophistication. Your common-sense understanding of quantity includes concepts refined over millenia (zero, decimals, negatives).

What we call "Math" are just the ideas we haven't yet internalized.

Let's explore our idea of quantity. It's a funny notion, and some languages only have words for one, two and many. They never thought to subdivide "many", and you never thought to refer to your East and West hands.

Here's how we've refined quantity over the years:

  • We have "number words" for each type of quantity ("one, two, three... five hundred seventy nine")
  • The "number words" can be written with symbols, not regular letters, like lines in the sand. The unary (tally) system has a line for each object.
  • Shortcuts exist for large counts (Roman numerals: V = five, X = ten, C = hundred)
  • We even have a shortcut to represent emptiness: 0
  • The position of a symbol is a shortcut for other numbers. 123 means 100 + 20 + 3.
  • Numbers can have incredibly small, fractional differences: 1.1, 1.01, 1.001...
  • Numbers can be negative, less than nothing (Wha?). This represents "opposite" or "reverse", e.g., negative height is underground, negative savings is debt.
  • Numbers can be 2-dimensional (or more). This isn't yet commonplace, so it's called "Math" (scary M).
  • Numbers can be undetectably small, yet still not zero. This is also called "Math".

Our concept of numbers shapes our world. Why do ancient years go from BC to AD? We needed separate labels for "before" and "after", which weren't on a single scale.

Why did the stock market set prices in increments of 1/8 until 2000 AD? We were based on centuries-old systems. Ask a modern trader if they'd rather go back.

Why is the decimal system useful for categorization? You can always find room for a decimal between two other ones, and progressively classify an item (1, 1.3, 1.38, 1.386).

Why do we accept the idea of a vacuum, empty space? Because you understand the notion of zero. (Maybe true vacuums don't exist -- you get the theory)

Why is anti-matter or anti-gravity palatable? Because you accept that positives could have negatives that act in opposite ways.

How could the universe come from nothing? Well, how can 0 be split into 1 and -1?

Our math vocabulary shapes what we're capable of thinking about. Multiplication and division, which eluded geniuses a few thousand years ago, are now homework for grade schoolers. All because we have better ways to think about numbers.

We have decent knowledge of one noun: quantity. Imagine improving our vocabulary for structure, shape, change, and chance. (Oh, I mean, the important-sounding Algebra, Geometry, Calculus and Statistics.)

Caveman Chef Og doesn't think he needs more than yummy/yucky. But you know it'd blow his mind, and his cooking, to understand sweet/sour/salty/spicy/tangy.

We're still cavemen when thinking about new ideas, and that's why we study math.

A Brief Introduction to Probability & Statistics

I’ve studied probability and statistics without experiencing them. What’s the difference? What are they trying to do?

This analogy helped:

  • Probability is starting with an animal, and figuring out what footprints it will make.
  • Statistics is seeing a footprint, and guessing the animal.

Probability is straightforward: you have the bear. Measure the foot size, the leg length, and you can deduce the footprints. “Oh, Mr. Bubbles weighs 400lbs and has 3-foot legs, and will make tracks like this.” More academically: “We have a fair coin. After 10 flips, here are the possible outcomes.”

Statistics is harder. We measure the footprints and have to guess what animal it could be. A bear? A human? If we get 6 heads and 4 tails, what’re the chances of a fair coin?

The Usual Suspects

Here’s how we “find the animal” with statistics:

Get the tracks. Each piece of data is a point in “connect the dots”. The more data, the clearer the shape (1 spot in connect-the-dots isn’t helpful. One data point makes it hard to find a trend.)

Measure the basic characteristics. Every footprint has a depth, width, and height. Every data set has a mean, median, standard deviation, and so on. These universal, generic descriptions give a rough narrowing: “The footprint is 6 inches wide: a small bear, or a large man?”

Find the species. There are dozens of possible animals (probability distributions) to consider. We narrow it down with prior knowledge of the system. In the woods? Think horses, not zebras. Dealing with yes/no questions? Consider a binomial distribution.

Look up the specific animal. Once we have the distribution (“bears”), we look up our generic measurements in a table. “A 6-inch wide, 2-inch deep pawprint is most likely a 3-year-old, 400-lbs bear”. The lookup table is generated from the probability distribution, i.e. making measurements when the animal is in the zoo.

Make additional predictions. Once we know the animal, we can predict future behavior and other traits (“According to our calculations, Mr. Bubbles will poop in the woods.”). Statistics helps us get information about the origin of the data, from the data itself.

Ok! The metaphor isn’t perfect, but more palatable than “Statistics is the study of the collection, organization, analysis, and interpretation of data”. Need proof? Let’s see if we can ask intuitive “I tasted it!” questions:

  • What are the most common species? (Common distributions)
  • Are new ones being discovered?
  • Can we predict the next footprint? (Extrapolation)
  • Are the tracks following a path? (Regression / trend line)
  • Here’s two tracks, which animal was faster? Bigger? (Data from two drug trials: which was more effective?)
  • Is one animal moving in the same direction as another? (Correlation)
  • Are two animals tracking a common source? (Causation: two bears chasing the same rabbit)

These questions are much deeper than what I pondered when first learning stats. Every dry procedure now has a context: are we learning a new species? How to take the generic footprint measurements? How to make a table from a probability distribution? How to lookup measurements in a table?

Having an analogy for the statistics process makes later data crunching click. Happy math.

PS. The forwards-backwards difference between probability and statistics shows up all over math. Some procedures are easy to do (derivatives) but difficult to undo (integrals). (Thanks Denis)

Understanding Algebra: Why do we factor equations?

What’s algebra about? When learning about variables (x, y, z), they seem to “hide” a number:

\displaystyle{x + 3 = 5}

What number could be hiding inside of x? 2, in this case.

It seems that arithmetic still works, even when we don’t have the exact numbers up front. Later on, we might arrange these “hidden numbers” in complex ways:

\displaystyle{x^2 + x = 6}

Whoa — a bit harder to solve, but it’s possible. Today let’s figure out how factoring works and why it’s useful.

Polynomials

When we write a polynomial like “x^2 + x = 6″, we can think at a higher level.

We have an unknown number, x, which interacts with itself (x * x = x^2). We add in the original number (+ x) and the result is 6.

x^2, x and 6 are all “numbers”, but now we’re keeping track of how they’re made:

  • x^2 is a component interacting with itself
  • x is a component on its own
  • 6 is the desired state we want the entire system to become

After the interactions are finished, we should get 6. What number could be hiding inside of x to make this true?

Hrm — this is tricky. So let’s fight with a trick of our own: we can make a different system to track the error in our original one (this is mind-bending, so hang on).

Our original system is x^2 + x. The desired state is 6. A new system:

\displaystyle{x^2 + x - 6}

will track the difference between the original system and the desired state. When are we happiest? When there’s no difference:

\displaystyle{x^2 + x - 6 = 0}

Ah! that’s why we’re so interested in setting polynomials to zero! If we have a system and the desired state, we can make a new equation to track the difference — and try make it zero. (This is deeper than just “subtract 6 from both sides” — we’re trying to describe the error!)

But… how do we actually get the error to zero? It’s still a jumble of components: x^2, x and 6 are flying everywhere.

Factor That Mamma Jamma

Factoring the rescue. My intuition: factoring lets us re-arrange a complex system (x^2 + x – 6) as a bunch of linked, smaller systems.

Imagine taking a pile of sticks (our messy, disorganized system) and standing them up so they support each other, like a teepee:

/\

(That’s a 2-d example, with two sticks).

Remove any stick and the entire structure collapses. If we can rewrite our system:

\displaystyle{x^2 + x - 6 = 0}

as a series of multiplications:

\displaystyle{Component \ A \cdot Component \ B = 0}

we’ve put the sticks in a “teepee”. If Component A or Component B becomes 0, the structure collapses, and we get 0 as a result.

Neat! That is why factoring rocks: we re-arrange our error-system into a fragile teepee, so we can break it. We’ll find what obliterates our errors and puts our system in the ideal state.

Remember: We’re breaking the error in the system, not the system itself.

Onto The Factoring

Learning to “factor an equation” is the process of arranging your teepee. In this case:


\begin{align*}
x^2 + x - 6 &= (x + 3)(x -2) \\
&= Component \ A \cdot Component \ B
\end{align*}

If x = -3 then Component A falls down. If x = 2, Component B falls down. Either value causes the error to collapse, which means our original system (x^2 + x, the one we almost forgot about!) meets our requirements:

  • When x = -3, the error collapses, and we get (-3)2 + -3 = 6
  • When x = 2, the error collapses, and we get 22 + 2 = 6

Putting It All Together

I’ve wondered about the real purpose of factoring for a long, long time. In algebra class, equations are conveniently set to zero, and we’re not sure why. Here’s what happens in the real world:

  • Define the model: Write how your system behaves (x^2 + x)
  • Define the desired state: What should it equal? (6)
  • Define the error: The error is its own system: Error = actual – desired (i.e., x^2 + x – 6)
  • Factor the error: Rewrite the error as interlocking components: (x + 3)(x – 2)
  • Reduce the error to zero: Zero out one component or the other (x = -3, or x = 2).

When error = 0, our system must be in the desired state. We’re done!

Algebra is pretty darn useful:

  • Our system is a trajectory, the “desired state” is the target. What trajectory hits the target?
  • Our system is our widget sales, the “desired state” is our revenue target. What amount of earnings hits the goal?
  • Our system is the probability of our game winning, the “desired state” is a 50-50 (fair) outcome. What settings make it a fair game?

The idea of “matching a system to its desired state” is just one interpretation of why factoring is useful. If you have more, I’d like to hear them!

Appendix

A cheatsheet for the process:

Some more food for thought:

  • Multiplication is often seen as AND. Component A must be there AND Component B must be there. If either condition is false, the system breaks.

  • The Fundamental Theorem of Algebra proves you have as many “components” as the highest polynomial. If your highest term is x^4, then you can factor into 4 interlocked components (discussion for another day). But this should make sense: if you rewrite an “x^4 system” into multiplications, shouldn’t there be 4 individual “x components” being multiplied? If there were 3, you could never get to x^4, and if there were 5, you’d overshoot and get an x^5 term.

  • Do you have a real-world system in a “teepee” arrangement, where a single failing component collapses the entire structure?

  • The quadratic formula can “autobreak” any system with x^2, x and constant components. There’s formulas for complex systems (with x^3, x^4, or even some x^5 components) but they start to get a bit crazy.

  • Is there any way to prevent a system from having these weak points? (Unfactorable? Non-zeroable?). Don’t forget, we thought systems like x^2 + 1 were “non-zeroable” until imaginary numbers came along.

Happy math.

Finding Unity in the Math Wars

I usually avoid current events, but recent skirmishes in the math world prompted me to chime in. To recap, there’ve been heated discussions about math education and the role of online resources like Khan Academy.

As fun as a good math showdown may appear, there’s a bigger threat: Apathy. And Justin Bieber.

Educators, online or not, don’t compete with each other. They struggle to be noticed in our math-phobic society, where we casually wonder “Should algebra be taught at all?” not “Can algebra be taught better?”.

Entertainment is great; I love Starcraft. But it’s alarming when a prominent learning initiative gets less attention than a throwaway pop song (Super Bass: 268M views in a year; Khan Academy: 175M views in 5 years). Online learning is a rounding error next to Justin Bieber — “Baby” has 700M views alone.

What do we need? The Math Avengers. Different heroes, different tactics, and not without differences… but everyone fighting on the same side. Against Bieber.

I could be walking into a knife fight with an ice cream cone, but I’d like to approach each side with empathy and offer specific suggestions to bridge the gap.

The Big Misunderstanding

Superheroes need a misunderstanding before working together. It’s inevitable, and here’s ours (as a math relationship, of course):

Bad Teacher < Online Learning < Good teacher

The problem is in considering each part separately.

  • Is Khan Academy (free, friendly, always available) better than a mean, uninformed, or absent teacher? Yes!

  • Is an engaging human experience better than learning from a computer? Yes!

But, really, the ultimate solution is Online learning + Good Teachers.

Tactics differ, but we can agree on the mission: give students great online resources, and give teachers tools to augment their classroom.

Why Do I Care?

I love learning. Here’s my brief background so you can root out my biases.

I was a good student. I was on the math team and hummed songs like “Life is a sine-wave, I want to de-rive it all night long…”. I drew comics about sine & cosine, the crimefighting duo. You might say I enjoyed math.

I entered college and was slapped in the face by my freshman year math class.

Professors at big universities must know everything, right? If I didn’t get a concept, something must be wrong with me, right?

I had a WWII-era, finish-half-a-proof-in-class, grouch of a teacher. I bombed the midterm and was distressed. Math… I loved math! I didn’t mind difficulties in Physics or Spanish. But math? What I used to sing and draw cartoons about?

Finals came. While cramming, I found notes online, far more helpful than my book and teacher. I sent an email to the class, gingerly suggesting BY EUCLID YOU NEED TO READ THESE WEBSITES THEY ARE SO MUCH BETTER THAN THE PROFESSOR. The websites turned up on an index card in the computer lab that evening. How many of us were struggling?

I was studying, staring at a blue book when an aha! moment struck. I could see the Matrix: equations were a description of twists, turns and rotations. Their meaning became “obvious” in the way a circle must be round. What else could it be?

I was elated and furious: “Why didn’t they explain it like that the first time?!”

Paranoid I’d forget, I put my notes online and they evolved into this site: insights that actually worked for me. Articles on e, imaginary numbers, and calculus became popular — I think we all crave deep understanding. Bad teaching was a burst of gamma rays: I’m normally mild mannered, but enter Hulk Mode when recalling how my passion nearly died.

My core beliefs:

  • A bad experience can undo years of good ones. Students need resources to sidestep bad teaching.

  • Hard-won insights, sometimes found after years of teaching, need to be shared

  • Learning “success” means having basic skills and the passion to learn more. A year, 5 years from now, do people seek out math? Or at least not hate it? (Compare #ihatemath to #ihategeography)

(Oh, I had great teachers too, like Prof. Kulkarni. The bad one just unlocked the Hulk.)

An Open letter to Khan Academy and Teachers

I recently heard a quote about constructive dialog: “Don’t argue the exact point a person made. Consider their position and respond to the best point they could have made.”

Here’s the concerns I see:

Packaging and presentation matters

Yes, other resources and tutorials exist, but there’s power in a giant, organized collection. We visit Wikipedia because we know what to expect, and it’s consistent.

Khan Academy provides consistent, non-judgmental tutorials. There are exercises and discussions for every topic. You don’t need to scour YouTube, digest hour-long calculus lectures, or open up PDF worksheets for practice.

So, let’s use the magic of friendly, exploratory, bite-sized learning of topics.

Community matters

Teachers and online tools don’t “compete” any more than Mr. Rogers and Sesame Street did. They’re both ways to help.

I do think the name “Khan Academy” presents a challenge to community building. Would you rather write for Wikipedia or the Jimmy-Wales-o-pedia?

Wikipedia really feels like a community effort, and though there are alternatives, in general it’s a well-loved resource.

I think teachers may hesitate to use Khan Academy, not out of jealousy, but concern that a single pedagogical approach could overpower all others. Let’s build an online resource that can take input from the math community.

Human interaction matters

It’s easy to misunderstand Khan Academy’s goal. I’ve seen many of their blog posts and videos, and believe Khan Academy wants to work with teachers to promote deep understanding.

But, some news coverage shows students working silently in front of computers in class, not watching at home to free up class time for personal discussions.

The teacher doesn’t appear to be involved or interacting, and that misuse of a learning tool is a nightmare for teachers who want a personal connection. Let’s have an online resource that directly contributes to offline interactions also.

Experience matters

I’ve seen that insights emerge hours (or years) after learning a subject. For example, we’ve “known” since 4th grade what a million and billion are: 1,000,000 and 1,000,000,000.

But do we feel it? How long is a million seconds, roughly? C’mon, guess. Ready? It’s 12 days.

Ok, now how long is a billion seconds? It’s… wait for it… 31 years. 31 years!

That’s the difference between knowing and feeling an idea. Passion comes from feeling.

Teachers draw on years of experience to get ideas to click — let’s feed this back into the online lessons.

Students matter

We teach for the same reason: to help students. Here’s a few specific situations to consider.

For many, Khan Academy is their only positive math experience: not teachers, or peers, or parents, but a video. Sure, it’s not the same as an in-person teacher, but it’s miles beyond an absent or hostile one. If an education experience gets someone excited to learn, and coming back to math, we should celebrate.

Remember, despite years of positive experiences and acing tests, a sufficiently bad class nearly drove me away from math. Resources like Khan Academy offer a lifeline: “Even with a bad teacher, I can still learn”.

When someone is interested, we need to feed their curiosity. I get a lot of traffic from Khan Academy comments — how can we help students dive deeper, without making them trudge randomly through the internet?

Lastly, we all learn differently. I generally prefer text to videos (faster to read, and I can “pause” with my eyes and think). Some like the homemade feel of Khan’s videos. Others might like the polished overviews in MinutePhysics. You might prefer 3-act math stories or modeling instruction.

Let’s offer several types of resources for students to enjoy.

Calling the Math Avengers

Still here? Fantastic. To all teachers, online and non:

  • What specific steps can we take to align our efforts?

One idea: Make a curated, collaborative, easy-to-explore teaching resource.

Khan Academy is well-organized: each topic has a video and sample problems. How about sections for complementary teaching styles, projects, and misconceptions?

Imagine a student could select their “Math hero” as Khan Academy or PatrickJMT or James Tanton and see lessons in the style they prefer (like Wikipedia, curate the list to “notable” resources).

Imagine teachers could explore the best in-class activities (“What projects work well for negative numbers?”).

Whatever the style, make it easy for other educators to contribute. Want project-based videos? Sure. Need step-by-step tutorials? Great. Prefer a conceptual overview? No problem.

Each teacher keeps their house style. Let Hulk smash, and Captain America handle the hostage negotiations. Use the hero that suits you.

(It’s a public google doc you can copy and edit)

Perfect? Nope. But it’s a starting point to think about how we can work together.

Let’s focus on the overlap and align our efforts: different heroes, different tactics, and on the same side.

Sign Up for the BetterExplained Email List

Hi all! I’m starting an email list for BetterExplained readers and everyone interested in deep, intuitive learning. Sign up with the form below or click here to sign up.

Why Should I Sign Up?

If you like the blog, you’ll enjoy the email list too. I’ll be sharing the insights & techniques that took me from “huh?” to “aha!” on topics in math, programming, and communication. These periodic emails will include:

  • Exclusive content & previews
  • Short learning tips / essays / additional material that weren’t the right fit for the blog
  • Q&A on topics that are bothering you
  • Announcements & discounts for BetterExplained products I think you’ll enjoy

The blog explains ideas as I wish they were taught. The email list shares information I wish I’d seen.

Why Start an Email List?

A few reasons:

  • Email has better interaction. I can write, you can read & reply at your leisure. Social media seems noisy and non-personal — I want a conversation. (I credit Scott Young and patio11 for jump-starting the email idea).

  • The medium shapes the message. Blog posts are great for long-form, single-topic articles. Email favors shorter, bite-sized pieces. I found myself holding back material because it wasn’t a “blog-fit” (but still valuable!).

  • Better sustainability. The goal of BetterExplained is to be a lifelong project. Keeping in touch ensures the material stays useful and enjoyable, and that products (like the existing ebook + screencasts) dramatically increase your understanding.

I love blogging and will always write. Email is another method to get quick feedback, with blog posts to distill the final results.

I’m so thankful to have a little corner of the internet to share ideas, and I love hearing about and discussing the aha! moments that made things click. Let’s raise the bar for our teaching and learning: keep in touch!

-Kalid

How To Understand Derivatives: The Quotient Rule, Exponents, and Logarithms

Last time we tackled derivatives with a “machine” metaphor. Functions are a machine with an input (x) and output (y) lever. The derivative, dy/dx, is how much “output wiggle” we get when we wiggle the input:

simple function

Now, we can make a bigger machine from smaller ones (h = f + g, h = f * g, etc.). The derivative rules (addition rule, product rule) give us the “overall wiggle” in terms of the parts. The chain rule is special: we can “zoom into” a single derivative and rewrite it in terms of another input (like converting “miles per hour” to “miles per minute” — we’re converting the “time” input).

And with that recap, let’s build our intuition for the advanced derivative rules. Onward!

Division (Quotient Rule)

Ah, the quotient rule — the one nobody remembers. Oh, maybe you memorized it with a song like “Low dee high, high dee low…”, but that’s not understanding!

It’s time to visualize the division rule (who says “quotient” in real life?). The key is to see division as a type of multiplication:

\displaystyle{h = \frac{f}{g} = f \cdot \frac{1}{g}}

derivative product rule

We have a rectangle, we have area, but the sides are “f” and “1/g”. Input x changes off on the side (by dx), so f and g change (by df and dg)… but how does 1/g behave?

Chain rule to the rescue! We can wrap up 1/g into a nice, clean variable and then “zoom in” to see that yes, it has a division inside.

So let’s pretend 1/g is a separate function, m. Inside function m is a division, but ignore that for a minute. We just want to combine two perspectives:

  • f changes by df, contributing area df * m = df * (1 / g)
  • m changes by dm, contributing area dm * f = ?

We turned m into 1/g easily. Fine. But what is dm (how much 1/g changed) in terms of dg (how much g changed)?

We want the difference between neighboring values of 1/g: 1/g and 1(g + dg). For example:

  • What’s the difference between 1/4 and 1/3? 1/12
  • How about 1/5 and 1/4? 1/20
  • How about 1/6 and 1/5? 1/30

How does this work? We get the common denominator: for 1/3 and 1/4, it’s 1/12. And the difference between “neighbors” (like 1/3 and 1/4) will be 1 / common denominator, aka 1 / (x * (x + 1)). See if you can work out why!

\displaystyle{\frac{1}{x + 1} - \frac{1}{x} = \frac{-1}{x(x+1)}}

If we make our derivative model perfect, and assume there’s no difference between neighbors, the +1 goes away and we get:

\displaystyle{\frac{-1}{x(x+1)} \sim \frac{-1}{x^2}}

(This is useful as a general fact: The change from 1/100 to 1/101 = one ten thousandth)

The difference is negative, because the new value (1/4) is smaller than the original (1/3). So what’s the actual change?

  • g changes by dg, so 1/g becomes 1/(g + dg)
  • The instant rate of change is -1/g^2 [as we saw earlier]
  • The total change = dg * rate, or dg * (-1/g^2)

A few gut checks:

  • Why is the derivative negative? As dg increases, the denominator gets larger, the total value gets smaller, so we’re actually shrinking (1/3 to 1/4 is a shrink of 1/12).

  • Why do we have -1/g^2 * dg and not just -1/g^2? (This confused me at first). Remember, -1/g^2 is the chain rule conversion factor between the “g” and “1/g” scales (like saying 1 hour = 60 minutes). Fine. You still need to multiply by how far you went on the “g” scale, aka dg! An hour may be 60 minutes, but how many do you want to convert?

  • Where does dm fit in? m is another name for 1/g. dm represents the total change in 1/g, which as we saw, was -1/g^2 * dg. This substitution trick is used all over calculus to help split up gnarly calculations. “Oh, it looks like we’re doing a straight multiplication. Whoops, we zoomed in and saw one variable is actually a division — change perspective to the inner variable, and multiply by the conversion factor”.

Phew. To convert our “dg” wiggle into a “dm” wiggle we do:

\displaystyle{dm = \frac{-1}{g^2} \cdot dg}

And get:


\begin{align*}
dh &= (df \cdot m) + (f \cdot dm) \\
dh &= (df \cdot \frac{1}{g}) + (f \cdot \frac{-1}{g^2} \cdot dg)
\end{align*}

derivative product rule

Yay! Now, your overeager textbook may simplify this to:

\displaystyle{ dh = \frac{df \cdot g - f \cdot dg}{g^2}}

and it burns! It burns! This “simplification” hides how the division rule is just a variation of the product rule. Remember, there’s still two slivers of area to combine:

  • The “f” (numerator) sliver grows as expected
  • The “g” (denominator) sliver is negative (as g increases, the area gets smaller)

Using your intuition, you know it’s the denominator that’s contributing the negative change.

Exponents (e^x)

e is my favorite number. It has the property

\displaystyle{\frac{d}{dx} e^x = e^x}

which means, in English, “e changes by 100% of its current amount” (read more).

The “current amount” assumes x is the exponent, and we want changes from x’s point of view (df/dx). What if u(x)=x^2 is the exponent, but we still want changes from x’s point of view?


\begin{align*}
u &= x^2 \\
\frac{df}{dx} e^u &= ?
\end{align*}

It’s the chain rule again — we want to zoom into u, get to x, and see how a wiggle of dx changes the whole system:

  • x changes by dx
  • u changes by du/dx, or d(x^2)/dx = 2x
  • How does e^u change?

Now remember, e^u doesn’t know we want changes from x’s point of view. e only knows its derivative is 100% of the current amount, which is the exponent u:

\displaystyle{ \frac{d(e^u)}{du} = e^u }

The overall change, on a per-x basis is:

\displaystyle{ \frac{d(e^u)}{dx} = \frac{du}{dx} e^u = 2x \cdot e^u = 2x \cdot e^{x^2} }

This confused me at first. I originally thought the derivative would require us to bring down “u”. No — the derivative of e^foo is e^foo. No more.

But if foo is controlled by anything else, then we need to multiply the rate of change by the conversion factor (d(foo)/dx) when we jump into that inner point of view.

Natural Logarithm

The derivative is ln(x) is 1/x. It’s usually given as a matter-of-fact.

My intuition is to see ln(x) as the time needed to grow to x:

  • ln(10) is the time to grow from 1 to 10, assuming 100% continuous growth

Ok, fine. How long does it take to grow to the “next” value, like 11? (x + dx, where dx = 1)

When we’re at x=10, we’re growing exponentially at 10 units per second. It takes roughly 1/10 of a second (1/x) to get to the next value. And when we’re at x=11, it takes 1/11 of a second to get to 12. And so on: the time to the next value is 1/x.

The derivative

\displaystyle{\frac{d}{dx}ln(x) = \frac{1}{x}}

is mainly a fact to memorize, but it makes sense with a “time to grow” intepreration.

A Hairy Example: x^x

Time to test our intuition: what’s the derivative of x^x?

\displaystyle{\frac{d}{dx} x^x = ? }

This is a bad mamma jamma. There’s two approaches:

Approach 1: Rewrite everything in terms of e.

Oh e, you’re so marvelous:


\begin{align*}
h(x) &= x^x \\
 &= [e^{ln(x)}]^x \\
 &= e^{ln(x) \cdot x}
\end{align*}

Any exponent (a^b) is really just e in different clothing: [e^ln(a)]^b. We’re just asking for the derivative of e^foo, where foo = ln(x) * x.

But wait! Since we want the derivative in terms of “x”, not foo, we need to jump into x’s point of view and multiply by d(foo)/dx:


\begin{align*}
\frac{d}{dx} ln(x) \cdot x &= x \cdot \frac{1}{x} + ln(x) \cdot 1 \\
&= 1 + ln(x)
\end{align*}

The derivative of “ln(x) * x” is just a quick application of the product rule. If h=x^x, the final result is:

\displaystyle{h'(x) = (1 + ln(x)) \cdot e^{ln(x) \cdot x} = (1 + ln(x)) \cdot x^x}

We wrote e^[ln(x)*x] in its original notation, x^x. Yay! The intuition was “rewrite in terms of e and follow the chain rule”.

Approach 2: Independent Points Of View

Remember, deriviatives assume each part of the system works independently. Rather than seeing x^x as a giant glob, assume it’s made from two interacting functions: u^v. We can then add their individual contributions. We’re sneaky though, u and v are the same (u = v = x), but don’t let them know!

From u’s point of view, v is just a static power (i.e., if v=3, then it’s u^3) so we have:

\displaystyle{\frac{d}{du} u^v = v \cdot u^{v - 1}}

And from v’s point of view, u is just some static base (if u=5, we have 5^v). We rewrite into base e, and we get


\begin{align*}
\frac{d}{dv} u^v &= \frac{d}{dv} [e^{ln(u)}]^v \\
&= \frac{d}{dv} e^{ln(u) \cdot v} \\ 
&=  ln(u) \cdot e^{ln(u) \cdot v}
\end{align*}

We add each point of view for the total change:

\displaystyle{ln(u) \cdot e^{ln(u) \cdot v} + v \cdot u^{v - 1} }

And the reveal: u = v = x! There’s no conversion factor for this new viewpoint (du/dx = dv/dx = dx/dx = 1), and we have:


\begin{align*}
h' &= ln(x) \cdot e^{ln(x) \cdot x} + x \cdot x^{x - 1} \\
 &= ln(x) \cdot x^x + x^{x - 1 + 1} \\
 &= ln(x) \cdot x^x + x^x \\
 &= (1 + ln(x)) \cdot x^x
\end{align*}

It’s the same as before! I was pretty excited to approach x^x from a few different angles.

By the way, use Wolfram Alpha (like so) to check your work on derivatives (click “show steps”).

Question: If u were more complex, where would we use du/dx?

Imagine u was a more complex function like u=x^2 + 3: where would we multiply by du/dx?

Let’s think about it: du/dx only comes into play from u’s point of view (when v is changing, u is a static value, and it doesn’t matter that u can be further broken down in terms of x). u’s contribution is

\displaystyle{\frac{d}{du} u^v = v \cdot u^{v - 1}}

if we wanted the “dx” point of view, we’d include du/dx here:

\displaystyle{\frac{d}{du} \frac{du}{dx} u^v = v \cdot u^{v - 1} \frac{du}{dx}}

We’re multiplying by the “du/dx” conversion factor to get things from x’s point of view. Similarly, if v were more complex, we’d have a dv/dx term when computing v’s point of view.

Look what happened — we figured out the genric d/du and converted it into a more specific d/dx when needed.

It’s Easier With Infinitesimals

Separating dy from dx in dy/dx is “against the rules” of limits, but works great with infinitesimals. You can figure out the derivative rules really quickly:

Product rule:


\begin{align*}
(fg)' &= (f + df)(g + dg) - fg \\
&= [fg + f dg + g df + df dg ]- fg \\
&= f dg + g df + df dg
\end{align*}

We set “df * dg” to zero when jumping out of the infinitesimal world and back to our regular number system.

Think in terms of “How much did g change? How much did f change?” and derivatives snap into place much easier. “Divide through” by dx at the end.

Summary: See the Machine

Our goal is to understand calculus intuition, not memorization. I need a few analogies to get me thinking:

  • Functions are machines, derivatives are the “wiggle” behavior
  • Derivative rules find the “overall wiggle” in terms of the wiggles of each part
  • The chain rule zooms into a perspective (hours => minutes)
  • The product rule adds area
  • The quotient rule adds area (but one area contribution is negative)
  • e changes by 100% of the current amount (d/dx e^x = 100% * e^x)
  • natural log is the time for e^x to reach the next value (x units/sec means 1/x to the next value)

With practice, ideas start clicking. Don’t worry about getting tripped up — I still tried to overuse the chain-rule when working with exponents. Learning is a process!

Happy math.

Appendix: Partial Derivatives

Let’s say our function depends on two inputs:

\displaystyle{f(x,y)}

The derivative of f can be seen from x’s point of view (how does f change with x?) or y’s point of view (how does f change with y?). It’s the same idea: we have two “independent” perspectives that we combine for the overall behavior (it’s like combining the point of view of two Solipsists, who think they’re the only “real” people in the universe).

If x and y depend on the same variable (like t, time), we can write the following:

\displaystyle{\frac{df}{dt} = \frac{df}{dx} \cdot \frac{dx}{dt} + \frac{df}{dy} \cdot \frac{dy}{dt}}

It’s a bit of the chain rule — we’re combining two perspectives, and for each perspective, we dive into its root cause (time).

If x and y are otherwise independent, we represent the derivative along each axis in a vector:

\displaystyle{(\frac{df}{dx}, \frac{df}{dy})}

This is the gradient, a way to represent “From this point, if you travel in the x or y direction, here’s how you’ll change”. We combined our 1-dimensional “points of view” to get an understanding of the entire 2d system. Whoa.

How To Understand Derivatives: The Product, Power & Chain Rules

The jumble of rules for taking derivatives never truly clicked for me. The addition rule, product rule, quotient rule — how do they fit together? What are we even trying to do?

Here’s my take on derivatives:

  • We have a system to analyze, our function f
  • The derivative f’ (aka df/dx) is the moment-by-moment behavior
  • It turns out f is part of a bigger system (h = f + g)
  • Using the behavior of the parts, can we figure out the behavior of the whole?

Yes. Every part has a “point of view” about how much change it added. Combine every point of view to get the overall behavior. Each derivative rule is an example of merging various points of view.

And why don’t we analyze the entire system at once? For the same reason you don’t eat a hamburger in one bite: small parts are easier to wrap your head around.

Instead of memorizing separate rules, let’s see how they fit together:

table

The goal is to really grok the notion of “combining perspectives”. This installment covers addition, multiplication, powers and the chain rule. Onward!

Functions: Anything, Anything But Graphs

The default calculus explanation writes “f(x) = x^2″ and shoves a graph in your face. Does this really help our intuition?

Not for me. Graphs squash input and output into a single curve, and hide the machinery that turns one into the other. But the derivative rules are about the machinery, so let’s see it!

I visualize a function as the process “input(x) => f => output(y)”.

simple function

It’s not just me. Check out this incredible, mechanical targetting computer (beginning of youtube series).

The machine computes functions like addition and multiplication with gears — you can see the mechanics unfolding!

simple function

Think of function f as a machine with an input lever “x” and an output lever “y”. As we adjust x, f sets the height for y. Another analogy: x is the input signal, f receives it, does some magic, and spits out signal y. Use whatever analogy helps it click.

Wiggle Wiggle Wiggle

The derivative is the “moment-by-moment” behavior of the function. What does that mean? (And don’t mindlessly mumble “The derivative is the slope”. See any graphs around these parts, fella?)

The derivative is how much we wiggle. The lever is at x, we “wiggle” it, and see how y changes. “Oh, we moved the input lever 1mm, and the output moved 5mm. Interesting.”

The result can be written “output wiggle per input wiggle” or “dy/dx” (5mm / 1mm = 5, in our case). This is usually a formula, not a static value, because it can depend on your current input setting.

For example, when f(x) = x^2, the derivative is 2x. Yep, you’ve memorized that. What does it mean?

If our input lever is at x = 10 and we wiggle it slightly (moving it by dx=0.1 to 10.1), the output should change by dy. How much, exactly?

  • We know f’(x) = dy/dx = 2 * x
  • At x = 10 the “output wiggle per input wiggle” is = 2 * 10 = 20. The output moves 20 units for every unit of input movement.
  • If dx = 0.1, then dy = 20 * dx = 20 * .1 = 2

And indeed, the difference between 10^2 and (10.1)^2 is about 2. The derivative estimated how far the output lever would move (a perfect, infinitely small wiggle would move 2 units; we moved 2.01).

The key to understanding the derivative rules:

  • Set up your system
  • Wiggle each part of the system separately, see how far the output moves
  • Combine the results

The total wiggle is the sum of wiggles from each part.

Addition and Subtraction

Time for our first system:

\displaystyle{h(x) = f(x) + g(x) }

derivative addition

What happens when the input (x) changes?

In my head, I think “Function h takes a single input. It feeds the same input to f and g and adds the output levers. f and g wiggle independently, and don’t even know about each other!”

Function f knows it will contribute some wiggle (df), g knows it will contribute some wiggle (dg), and we, the prowling overseers that we are, know their individual moment-by-moment behaviors are added:

\displaystyle{dh = df + dg} derivative addition

Again, let’s describe each “point of view”:

  • The overall system has behavior dh
  • From f’s perspective, it contributes df to the whole [it doesn't know about g]
  • From g’s perspective, it contributes dg to the whole [it doesn't know about f]

Every change to a system is due to some part changing (f and g). If we add the contributions from each possible variable, we’ve described the entire system.

df vs df/dx

Sometimes we use df, other times df/dx — what gives? (This confused me for a while)

  • df is a general notion of “however much f changed”
  • df/dx is a specific notion of “however much f changed, in terms of how much x changed”

The generic “df” helps us see the overall behavior.

An analogy: Imagine you’re driving cross-country and want to measure the fuel efficiency of your car. You’d measure the distance traveled, check your tank to see how much gas you used, and finally do the division to compute “miles per gallon”. You measured distance and gasoline separately — you didn’t jump into the gas tank to get the rate on the go!

In calculus, sometimes we want to think about the actual change, not the ratio. Working at the “df” level gives us room to think about how the function wiggles overall. We can eventually scale it down in terms of a specific input.

And we’ll do that now. The addition rule above can be written, on a “per dx” basis, as:

\displaystyle{\frac{dh}{dx} = \frac{df}{dx} + \frac{dg}{dx}}

Multiplication (Product Rule)

Next puzzle: suppose our system multiplies parts “f” and g”. How does it behave?

\displaystyle{h(x) = f(x) \cdot g(x)}

Hrm, tricky — the parts are interacting more closely. But the strategy is the same: see how each part contributes from its own point of view, and combine them:

  • total change in h = f’s contribution (from f’s point of view) + g’s contribution (from g’s point of view)

Check out this diagram:

derivative product rule

What’s going on?

  • We have our system: f and g are multiplied, giving h (the area of the rectangle)
  • Input “x” changes by dx off in the distance. f changes by some amount df (think absolute change, not the rate!). Similarly, g changes by its own amount dg. Because f and g changed, the area of the rectangle changes too.
  • What’s the area change from f’s point of view? Well, f knows he changed by df, but has no idea what happened to g. From f’s perspective, he’s the only one who moved and will add a slice of area = df * g
  • Similarly, g doesn’t know how f changed, but knows he’ll add as slice of area “dg * f”

The overall change in the system (dh) is the two slices of area:

\displaystyle{dh = f \cdot dg + g \cdot df}

Now, like our miles per gallon example, we “divide by dx” to write this in terms of how much x changed:

\displaystyle{\frac{dh}{dx} = f \cdot \frac{dg}{dx} + g \cdot \frac{df}{dx}}

(Aside: Divide by dx? Engineers will nod, mathematicians will frown. Technically, df/dx is not a fraction: it’s the entire operation of taking the derivative (with the limit and all that). But infinitesimal-wise, intuition-wise, we are “scaling by dx”. I’m a smiler.)

The key to the product rule: add two “slivers of area”, one from each point of view.

Gotcha: But isn’t there some effect from both f and g changing simultaneously (df * dg)?

Yep. However, this area is an infinitesimal * infinitesimal (a “2nd-order infinitesimal”) and invisible at the current level. It’s a tricky concept, but (df * dg) / dx vanishes compared to normal derivatives like df/dx. We vary f and g indepdendently and combine the results, and ignore results from them moving together.

The Chain Rule: It’s Not So Bad

Let’s say g depends on f, which depends on x:

\displaystyle{y = g(f(x))} derivative product rule

The chain rule lets us “zoom into” a function and see how an initial change (x) can effect the final result down the line (g).

Interpretation 1: Convert the rates

A common interpretation is to multiply the rates:

\displaystyle{\frac{dg}{dx} = \frac{dg}{df} \cdot \frac{df}{dx}}

x wiggles f. This creates a rate of change of df/dx, which wiggles g by dg/df. The entire wiggle is then:

\displaystyle{\frac{dg}{df} \cdot \frac{df}{dx}}

This is similar to the “factor-label” method in chemistry class:

\displaystyle{\frac{miles}{second} = \frac{miles}{hour} \cdot \frac{1 \ hour}{60 \ minutes} \cdot \frac{1 \ minute}{60 \ seconds} = \frac{miles}{hour} \cdot \frac{1}{3600}}

If your “miles per second” rate changes, multiply by the conversion factor to get the new “miles per hour”. The second doesn’t know about the hour directly — it goes through the second => minute conversion.

Similarly, g doesn’t know about x directly, only f. Function g knows it should scale its input by dg/df to get the output. The initial rate (df/dx) gets modified as it moves up the chain.

Interpretation 2: Convert the wiggle

I prefer to see the chain rule on the “per-wiggle” basis:

  • x wiggles by dx, so
  • f wiggles by df, so
  • g wiggles by dg

Cool. But how are they actually related? Oh yeah, the derivative! (It’s the output wiggle per input wiggle):

\displaystyle{df = dx \cdot \frac{df}{dx}}

Remember, the derivative of f (df/dx) is how much to scale the initial wiggle. And the same happens to g:

\displaystyle{dg = df \cdot \frac{dg}{df}}

It will scale whatever wiggle comes along its input lever (f) by dg/df. If we write the df wiggle in terms of dx:

\displaystyle{dg = (dx \cdot \frac{df}{dx}) \cdot \frac{dg}{df}}

We have another version of the chain rule: dx starts the chain, which results in some final result dg. If we want the final wiggle in terms of dx, divide both sides by dx:

\displaystyle{\frac{dg}{dx} = \frac{df}{dx} \cdot \frac{dg}{df}}

The chain rule isn’t just factor-label unit cancellation — it’s the propagation of a wiggle, which gets adjusted at each step.

The chain rule works for several variables (a depends on b depends on c), just propagate the wiggle as you go.

Try to imagine “zooming into” different variable’s point of view. Starting from dx and looking up, you see the entire chain of transformations needed before the impulse reaches g.

Chain Rule: Example Time

Let’s say we put a “squaring machine” in front of a “cubing machine”:

input(x) => f:x^2 => g:f^3 => output(y)

f:x^2 means f squares its input. g:f^3 means g cubes its input, the value of f. For example:

input(2) => f(2) => g(4) => output:64

Start with 2, f squares it (2^2 = 4), and g cubes this (4^3 = 64). It’s a 6th power machine:

\displaystyle{g(f(x)) = (x^2)^3}

And what’s the derivative?

\displaystyle{ \frac{dg}{dx} = \frac{dg}{df} \cdot \frac{df}{dx}}

  • f changes its input wiggle by df/dx = 2x
  • g changes its input wiggle by dg/df = 3f^2

The final change is:

\displaystyle{3f^2 \cdot 2x = 3(x^2)^2 \cdot 2x = 3x^4 \cdot 2x = 6x^5}

Chain Rule: Gotchas

Functions treat their inputs like a blob

In the example, g’s derivative (“x^3 = 3x^2″) doesn’t refer to the original “x”, just whatever the input was (foo^3 = 3*foo^2). The input was f, and it treats f as a single value. Later on, we scurry in and rewrite f in terms of x. But g has no involvement with that — it doesn’t care that f can be rewritten in terms of smaller pieces.

In many examples, the variable “x” is the “end of the line”.

Questions ask for df/dx, i.e. “Give me changes from x’s point of view”. Now, x could depend on something deeper variable, but that’s not being asked for. It’s like saying “I want miles per hour. I don’t care about miles per minute or miles per second. Just give me miles per hour”. df/dx means “stop looking at inputs once you get to x”.

How come we multiply derivatives with the chain rule, but add them for the others?

The regular rules are about combining points of view to get an overall picture. What change does f see? What change does g see? Add them up for the total.

The chain rule is about going deeper into a single part (like f) and seeing if it’s controlled by another variable. It’s like looking inside a clock and saying “Hey, the minute hand is controlled by the second hand!”. We’re staying inside the same part.

Sure, eventually this “per-second” perspective of f could be added to some perspective from g. Great. But the chain rule is about diving deeper into “f’s” root causes.

Power Rule: Oft Memorized, Seldom Understood

What’s the derivative of x^4? 4x^3? Great. You brought down the exponent and subtracted one. Now explain why!

Hrm. There’s a few approaches, but here’s my new favorite: x^4 is really x * x * x * x. It’s the multiplication of 4 “independent” variables. Each x doesn’t know about the others, it might as well be x * u * v * w.

Now think about the first x’s point of view:

  • It changes from x to x + dx
  • The change in the overall function is [(x + dx) - x][u * v * w] = dx[u * v * w]
  • The change on a “per dx” basis is [u * v * w]

Similarly,

  • From u’s point of view, it changes by du. It contributes (du/dx)*[x * v * w] on a “per dx” basis
  • v contributes (dv/dx) * [x * u * w]
  • w contributes (dw/dx) * [x * u * v]

The curtain is unveiled: x, u, v, and w are the same! The “point of view” conversion factor is 1 (du/dx = dv/dx = dw/dx = dx/dx = 1), and the total change is

\displaystyle{(x \cdot x \cdot x) + (x \cdot x \cdot x) + (x \cdot x \cdot x) + (x \cdot x \cdot x) = 4 x^3}

In a sentence: the derivative of x^4 is 4x^3 because x^4 has four identical “points of view” which are being combined. Booyeah!

Take A Breather

I hope you’re seeing the derivative in a new light: we have a system of parts, we wiggle our input and see how the whole thing moves. It’s about combining perspectives: what does each part add to the whole?

In the follow-up article, we’ll look at even more powerful rules (exponents, quotients, and friends). Happy math.