You'll spin yourself dizzy trying to reconcile the ideas. So just rename it:

**How about "forwards" and "backwards" numbers?**

The invention of the number line *vastly* improved our understanding. What is "backwards from backwards"? It's forwards again! (I.e., why negative x negative = positive — more visual arithmetic.)

What is "halfway backwards"? Sideways!

The wrong words limit our thoughts.

That this subject [imaginary numbers] has hitherto been surrounded by mysterious obscurity, is to be attributed largely to an ill adapted notation. If, for example, +1, -1, and the square root of -1 had been called direct, inverse and lateral units, instead of positive, negative and imaginary (or even impossible), such an obscurity would have been out of the question. - Carl Gauss

When we want a "count" of something, we aren't being specific enough. We're using too general a name.

There are two types of counts:

- The points that determine the boundaries
- The spans between the boundaries

Assuming there's a single, universal "count" sets ourselves up for off-by-one errors. Was that a "point count" or a "span count"?

"There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors." - Phil Karlton

Even the name "off by one" isn't that helpful. That's the symptom, but what's the root cause? The "Fencepost error" helps identify why the issue happened.

How should we describe an angle? There's two ways to see it:

- Degrees, the
*swivel*an observer went through to follow an object - Radians, the
*distance*the object moved on its path

When physics formulas (sine, cosine, etc.) ask for "angle in radians" they mean "distance the object moved". Aha!

(The laws of motion don't particularly care how much you, the observer, had to tilt your head. Sorry.)

Integrals are usually described as the inverse of differentiation, finding the area under the curve, and so on.

How about this renaming: integrals are fancy multiplication.

That's it. You try to multiply two quantities, but you can't — one of the critters is scurrying around — so you use an integral.

"If I drive an unwavering 30mph for 3 hours, would just multiply it out. But since my speed changes, I'll integrate."

Think "fancy multiplication" not "inverse of differentiation". (Unless you're actually solving differential equations. In that case, have fun.)

Most formulas are named after their inventor. It's good to give credit, but it's not helpful for the student. For the Pythagorean Theorem, there's a few alternate names we can try:

- Triangle Theorem: Given two sides of a right triangle, we can know the third.
- Distance Theorem: Get the distance between points in any number of dimensions. (By imagining they're on a sequence of triangles.)
- Tradeoff Theorem: Find the tradeoff as you move in any direction (how much "x distance" you give up to gain "y distance"). Using this, we can follow the gradient for the optimal direction.

Math gets so much easier with the right phrase in your head. Just use the "distance theorem" for distance, the "tradeoff theorem" to find the best direction to move.

Even if you forget the formula, you know how it's applied. Otherwise, you have the phrase "Pythagorean Theorem" without the understanding of why you'd need it.

A super-common question is "Why is the number e important"? It's all in the name.

e is the "universal growth constant" like c is the "speed of light constant". c is perfect speed (can't improve it further), e^x is perfect growth (can't compound it further).

e^x is what we see when compounding 100% with no delay. Why 100%? Symmetry, baby: our growth rate matches our current amount. (Similarly, we do trig on the unit circle, use 1.0 as our base increment for counting, and so on.)

Once we have perfection, we can modify it for our scenario. Maybe are growing for more time periods, or at a different rate — that's fine, just modify e^x.

Naming e "Euler's constant" isn't descriptive. The "universal growth constant" is more helpful.

I'm on a linear algebra kick. Instead of "Matrix multiplication" (which is a very drab description of what we're doing), how about "running data through operations" (or "running a spreadsheet").

One matrix represents the operations, one matrix represents the data, and we are running the data through a pipeline.

Yes, in *some* cases you can think about "linearly transforming a vector space" but often times we're just transforming data.

And if you have an error? Instead of "You're multiplying it wrong" (gee, thanks) how about "The operations expect different-sized data." Ah! I know what to fix.

They never tell you how much easier math gets when you create your own names for things. I use whatever analogies, metaphors, or plain-English descriptions help. They may feel silly, but it's a much better feeling than confusion.

There's an ancient concept that knowing the name of thing gives you power over it. It seems to be true for math.

A famous mathematician had a great quote on naming:

"Mathematics is the art of giving the same name to different things." -Henri Poincare

Math often finds what things have in common (two birds, two fish => "twoness"). But to internalize a concept, we need several ways to describe the same thing. Use as many names as it takes: each is an antibiotic that can treat our confusion, and maybe one will stick.

I have an intuition cheatsheet with rewordings of many concepts.

Hope you enjoy it!

]]>**1) Matrix multiplication scales/rotates/skews a geometric plane.**

This is useful when first learning about vectors: vectors go in, new ones come out. Unfortunately, this can lead to an over-reliance on geometric visualization.

If 20 families are coming to your BBQ, how do you estimate the hotdogs you need? (*Hrm… 20 families, call it 3 people per family, 2 hotdogs each… about 20 * 3 * 2 = 120 hotdogs.*)

You probably don't think "Oh, I need the volume of a invitation-familysize-hunger prism!". With large matrices I don't think about 500-dimensional vectors, just data to be modified.

**2) Matrix multiplication composes linear operations.**

This is the technically accurate definition: yes, matrix multiplication results in a new matrix that composes the original functions. However, sometimes the matrix being operated on is not a linear operation, but a set of vectors or data points. We need another intuition for what's happening.

I'll put a programmer's viewpoint into the ring:

**3) Matrix multiplication is about information flow, converting data to code and back.**

I think of linear algebra as "math spreadsheets" (if you're new to linear algebra, read this intro):

- We store information in various spreadsheets ("matrices")
- Some of the data are seen as functions to apply, others as data points to use
- We can swap between the vector and function interpretation as needed

Sometimes I'll think of data as geometric vectors, and sometimes I'll see a matrix as a composing functions. But mostly I think about information flowing through a system. (Some purists cringe at reducing beuatiful algebraic structures into frumpy spreadsheets; I sleep OK at night.)

Take your favorite recipe. If you interpret the words as *instructions*, you'll end up with a pie, muffin, cake, etc.

If you interpret the words as *data*, the text is prose that can be tweaked:

- Convert measurements to metric units
- Swap ingredients due to allergies
- Adjust for altitude or different equipment

The result is a new recipe, which can be further tweaked, or executed as instructions to make a different pie, muffin, cake, etc. (Compilers treat a program as text, modify it, and eventually output "instructions" — which could be text for another layer.)

That's Linear Algebra. We take raw information like "3 4 5" treat it as a vector or function, depending on how it's written:

By convention, a vertical column is usually a vector, and a horizontal row is typically a function:

`[3; 4; 5]`

means`x = (3, 4, 5)`

. Here,`x`

is a vector of data (I'm using`;`

to separate each row).`[3 4 5]`

means`f(a, b, c) = 3a + 4b + 5c`

. This is a function taking three inputs and returning a single result.

And the aha! moment: data is code, code is data!

The row containing a horizontal function could really be three data points (each with a single element). The vertical column of data could really be three distinct functions, each taking a single parameter.

Ah. This is getting neat: depending on the desired outcome, we can combine data and code in a different order.

The matrix transpose swaps rows and columns. Here's what it means in practice.

If `x`

was a column vector with 3 entries (`[3; 4; 5]`

), then `x'`

is:

- A function taking 3 arguments (
`[3 4 5]`

) `x'`

can still remain a data vector, but as three separate entries. The transpose "split it up".

Similarly, if `f = [3 4 5]`

is our row vector, then `f'`

can mean:

- A single data vector, in a vertical column.
`f'`

is separated into three functions (each taking a single input).

Let's use this in practice.

When we see `x' * x`

we mean: `x'`

(as a single function) is working on `x`

(a single vector). The result is the ** dot product** (read more). In other words, we've applied the data to itself.

When we see `x * x'`

we mean `x`

(as a set of functions) is working on `x'`

(a set of individual data points). The result is a grid where we've applied each function to each data point. Here, we've mixed the data with itself in every possible permutation.

I think of `xx`

as `x(x)`

. It's the "function x" working on the "vector x". (This helps compute the covariance matrix, a measure of self-similarity in the data.)

Phew! How does this help us? When we see an equation like this (from the Machine Learning class):

I now have an instant feel of what's happening. In the first equation, we're treating theta (which is normally a set of data parameters) as a function, and passing in x as an argument. This should give us a single value.

More complex derivations like this:

can be worked through. In some cases it gets tricky because we store the data as rows (not columns) in the matrix, but now I have much better tools to follow along. You can start estimating when you'll get a single value, or when you'll get a "permutation grid" as a result.

Geometric scaling and linear composition have their place, but here I want to think about information. "The information in x is becoming a function, and we're passing itself the a parameter."

Long story short, don't get locked into a single intuition. Multiplication evolved from repeated addition, to scaling (decimals), to rotations (imaginary numbers), to "applying" one number to another (integrals), and so on. Why not the same for matrix multiplication?

Happy math.

You may be curious why we can't use the other combinations, like `x x`

or `x' x'`

. Simply put, the parameters don't line up: we'd have functions expecting 3 inputs only being passed a single parameter, or functions expecting single inputs getting passed 3.

The dot product `x' * x`

could be seen as the following javascript command:

`(function(a,b,c){ return 3*a + 4*b + 5*c; })(3,4,5)`

We define a function of 3 arguments and pass it the 3 parameters. This returns 50 (the dot product).

The math notation is super-compact, so we can simply write (in Octave/Matlab):

>> [3 4 5] * [3 4 5]' ans = 50

(Remember that `[3 4 5]`

is the function and `[3; 4; 5]`

or `[3 4 5]'`

is how we'd write the data vector.)

This article came about from a TODO in my class notes:

I wanted to explain to myself — in plain English — why we wanted `x' x`

and not the reverse. Now, in plain English: We're treating the information as a function, and passing the same info as the parameter.

- Build a
**lasting intuition**for the key ideas. - During the course, understand it enough to solve problems.
- After the course, enjoy it enough to revisit.

That's why I learn things. Non-goals are transcribing what a teacher says, or cramming only to forget everything. (Yeah, it's a game we play, but we're stepping off the treadmill and only cheating ourselves. Most subjects have useful insights buried somewhere.)

So, here's my strategy when studying:

If an idea clicks, write down the

*Aha!*moment in language you'd use yourself.If it doesn't, write down the

*Huh?*moment. Move on and try again later (such as with the ADEPT method).

Keep it simple, like the KonMari method of organizing: *Look at everything in your house.* *Does it spark joy? Keep what does, thank and donate what doesn't.*

**A simple study plan: Go through the material. Did it click? Write down what helped, otherwise look for a better explanation.**

My current learning project is the Machine Learning Class on Cousera. I've read a smattering of blog posts, the subject is growing, and after my friend asked me to join the class, I had to sign up. (It's great.)

Here's where I'm keeping my notes, Aha, and Huh moments:

Machine Learning Notes on Google Docs

This is one of the best learning experiences I can remember. A few examples:

For the major concepts the course depends on, I keep a 5-second summary in mind. This underlying concept, why does it exist? In plain English, what does it mean?

- Linear Algebra: spreadsheets for your equations. We "pour" data through various operations.
- Natural log: time needed to grow. Helps normalize widely varying numbers.
- e^x: models continuous growth, has a simple derivative.
- Gradient: direction of greatest change, helps optimize.
- Calculus -Art of breaking a system into steps. With the gradient, we can move in the best direction.

I reference these snippets as I encounter new formulas.

There was a formula that I expected to be positive ("cost" should be positive), yet it had a negative sign out front. What gives?

It turns out I had forgotten a part of the derivation, where we expected the natural log to be negative. (This happens when we take the logarithm of numbers less than 1 — in other words, we are going "back in time" and shrinking.)

I would have preferred the equation written another way, and I made a note of this Huh? moment.

Early in the course, we define a "cost" function which tracks the difference between our predictions and the real value.

Why not call this difference something normal, like error?

It turns out "cost" is used because later in the course, we have items to minimize (like the number of variables in our model) which are not directly related to the error. The "cost" captures things outside the model, like the complexity we have. (If two models make equally accurate predictions, prefer the simpler one.)

Ah, "cost" can include fuzzier concepts. (I'd still prefer that laid out up-front.)

As I go through the course, I have a plain-English definition in mind. What's it all about?

**Machine Learning: Create models with Linear Algebra, then improve them with Calculus.**

- Linear Algebra lets us use many (tens, hundreds, thousands) of variables in a "math spreadsheet".
- Calculus lets us improve our spreadsheet via feedback on how well it's working. Using functions like e^x, ln(x), x^2, etc. make it easy to take derivatives. Absolute value, if/then statements, etc. aren't easy to work with.

Now my thinking becomes: What types of predictive models can I make? If Linear Algebra can describe it, let's use it.

After the course is done, you're left with a set of notes that make sense to you: the Ahas, Huhs, and other gotchas. (This website is a running collection of mine.)

Future learning gets that much easier. Remember how you were confused about a topic a few years ago? Well, let's read the explanation *you wrote to yourself* on how to overcome it. Over time you build up a massive collection.

Other tips:

Embrace your confusion. The hesitation you feel when you see a formula is ok. Try to break down each part of the equation, ask what it means, make note of what is confusing and return over time. Every positive sign, every variable, why are they there?

It's ok to forget things - I do all the time. I just want a list of intuitions to load up when needed. Often a single phrase or diagram will bring it all back.

These notes are meant for you. Make them fast and quick. (My notes eventually become articles, but they stay informal and for my own use till then.)

The textbook already exists. Don't simply copy what the teacher/book said, add what

*you need*to make it clear.

This course is among the most fun I've had -- this is what learning should feel like, exploration with constant refinement. I'm curious to see if this approach helps you too.

For your next course, try keeping your notes in a single Google doc. Write down your Aha! and Huh? moments. Send me a link and I'll add them to this list:

- Kalid Azad - Coursera Machine Learning
- [you go here]

I'm curious to see what works for you, feedback is always welcome.

Happy math.

]]>However, the numbers follow a grid, with rules nobody told me (image source, click to enlarge):

Even numbers go East/West (I-90, I-10), and odd numbers go North/South (I-5, I-95). Think "Even" goes "East".

Numbers increase towards the Northeast. (Hey, NYC thinks it's the center of the world, right?) I-5 is on the West coast, I-95 on the East coast. I-10 must be in Texas, I-90 must be in Massachusetts.

Auxiliary interstates connect to the primary ones, and have 3 digits: 290 connects to 90, 495 connects to 95, etc.

- Odd prefixes (190) connect once into the city from the interstate ("spur").
- Even prefixes (495) typically loop around a city. (Being a man-made system, there are exceptions.)

Whoa. There's so much information conveyed in a simple numbering scheme! Without looking at a map, I know I can drive from Seattle to Boston on I-90. Maybe I'll take I-95 South when I'm there and make my way to Florida. On the way I'll take I-10 West, over to LA, then drive up I-5 North back to Seattle.

How does this work?

We have a concept of a number, and all its properties (even/odd, size, number of digits...)

We noticed a real-world object (a highway) that had various properties (North/South, position, major/minor)

We associated the properties of the number to the properties of the object

*This* is thinking mathematically. It's not about doing arithmetic quickly, or memorizing formulas, it's about connecting patterns. Math is a zoo of made-up objects that we relate to ones in the real world. The "usefulness" of the made-up objects depends on our imagination.

Have we used all the interesting properties of a number? How about whether it's a prime number.

Suppose local routes used small prime numbers: Route 2, 3, 5, 7, 11. (Yep, remember that 2 is prime.)

Once the main routes are numbered, smaller roads that *connect* them can follow this rule:

If you connect two routes, use their product. 3 * 11 = 33, so Route 33 connects Route 3 and 11.

If you loop back to the same route, just square it. 3 * 3 = 9, so Route 9 connects Route 3 to itself.

If you connect three roads, it could be Route 66 (connecting routes 2, 3 and 11).

Will this always work? You bet. Any two primes, when multiplied, give a *unique* number. 33 will never be reached by any other combination of primes. (The fancy math phrase: every number has a unique prime factorization.)

See how we're trying to cram a bunch of information into a little number? That's the essence of binary data.

An eight-bit binary number like `01000100`

is essentially eight true/false questions:

- Are you East/West? (1 if yes, 0 otherwise)
- Are you local connection? (1 if yes...)
- Are you a spur road?
- Treating your route number as a set of binary digits...
- Anything in the ones digit?
- Anything in the twos digit?
- Anything in the fours digit?
- Anything in the eights digit?
- Anything in the sixteens digit?

An 8-bit binary number can pack in a bunch of related questions into a single byte, and is what makes binary so efficient.

Numbers have a bunch of properties, right? Aren't we curious to discover more, like the remainder (modular arithmetic)? Maybe Route 12 (which is one set of 11, remainder 1) has some connection to Route 11.

Happy math.

]]>