Admittedly, the hyperbolic functions were tucked into a dark part of my attic. They were defined with strained motivations (*"Need yet another way to build a hyperbola?"*) then crammed into tables of integrals, soon to be forgotten. I couldn't think with them.

After much struggle, I found their purpose:

**What are the hyperbolic functions ($\cosh$ and $\sinh$)?**The even/odd parts of the exponential function ($e^x$) that, funny enough, can build a hyperbola.**Why are parts of the exponential called hyperbolic?**That's the modern name. These functions are*so darn good*at making hyperbolas that they're typecast for that role. (Similarly, sine isn't just about circles, and we shouldn't name it "circular sine"!)**Why are hyperbolic functions useful?**A better framing is: Why are parts of $e^x$ useful? We now have "mini logarithms" and "mini exponentials", with partial versions of $e$'s famous properties.**I can handle it: how do hyperbolas connect to exponentials?**Hyperbolas come from inversions ($xy = 1$ or $y = \frac{1}{x}$). The area under an inversion grows logarithmically, and the corresponding coordinates grow exponentially. If we rotate the hyperbola, we rotate the formula to $(x-y)(x+y) = x^2 - y^2 = 1$. The area/coordinates now follow modified logarithms/exponentials: the hyperbolic functions.**Actually, I couldn't handle it.**That's ok. We'll build up to it. These functions took many years to be discovered, and their behavior is hardly obvious.

This post is fairly technical: we're studying hydrogen, not water. If hyperbolic functions appear in class, you don't have much choice, and may as well get an intuition. If you're studying for fun, don't sweat the details, that's what calculus students are for.

As a prerequisite, have these insights in mind:

- $e^x$ is the process of continuous, 100% growth
- natural log is the time for $e^x$ to grow to a given value

Let's dive in.

Chemical compounds can be separated into constituent atoms; math objects are similar.

The number 13 can be split into an even part (12) and odd part (1). They combine to the whole: 13 = 12 + 1.

This even/odd split works for functions, too:

Functions are tricky to separate because they have multiple values. The separation we look for is between the future values ($x > 0$) and the past ones ($x < 0$).

To see what the future and past have in common, take their average:

To see how the future and past differ, average their gap:

Can we combine these parts to get the original?

$f_\text{even}(x) + f_\text{odd}(x) = \frac{f(x) + f(-x)}{2} + \frac{f(x) - f(-x)}{2} = \frac{f(x) + f(-x) + f(x) - f(-x)}{2} = \frac{2f(x)}{2} = f(x)$

Neat trick. We can split any pattern into its even and odd parts.

The Taylor Series (Math DNA) expresses a function as a polynomial:

The even exponents ($x^0, x^2, ...$) are symmetric in the past and future (for example: $x^2 = (-x)^2$), and the odd exponents are anti-symmetric ($x^3 = - (-x)^3$).

We can quickly extract the even/odd parts by separating the function's Taylor series into even/odd exponents:

Ok, we have our trick. Why not try to split up the famous exponential function?

Instead of the awkward $e^x_{\text{even}}$ and $e^x_{\text{odd}}$, we call the even part $\cosh$ (hyperbolic cosine) and the odd part $\sinh$ (hyperbolic sine). (The pronunciation varies.)

Now, why the adjective "hyperbolic"? Euler used the quantity $(e^{x} + e^{-x})$ without giving it a special name. Lambert later called them "transcendental logarithmic functions", and even later the ability to build hyperbolas was seen. That use case has stuck.

Call me old-fashioned, but parts of $e^x$ are interesting for that reason alone. I don't need a hyperbola to justify their utility.

Let's graph $\cosh$ and $\sinh$ with their parent:

- $e^x$ is our standard exponential, with Taylor series: $e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ...$
- $\cosh(x)$ is a gentle bowl. It's roughly parabolic, then expands exponentially, with Taylor series: $\cosh(x) = 1 + \frac{x^2}{2!} + \frac{x^4}{4!} + ...$
- $\sinh(x)$ looks linear at first, then grows exponentially also: $\sinh(x) = x + \frac{x^3}{3!}...$

Now for a usually-confusing question: **What does the parameter $x$ in $\cosh(x)$ really mean?**

It's nothing special! It's the same $x$ we feed to any exponential, usually the time to grow: $e^x = e^{rt} = e^{100\% t}$.

We can see the hyperbolic trig functions as:

- $\cosh(x)$: What is the
*even part*of $e^x$ after $x$ units of time? - $\sinh(x)$: What is the
*odd part*of $e^x$ after $x$ units of time?

A few gut checks:

- What's $\cosh(0)$? At $x=0$ (aka $\text{time} = 0$), we haven't moved from 1.0 (the exponential starting point). The average is 1, and there's no separation, so $\sinh(0) = 0$.
- As time goes on, the future grows and the past ($e^{-x}$) vanishes. The average value becomes $\frac{e^x + 0}{2}$, and the average difference becomes $\frac{e^x - 0}{2}$. For large $x$, we'd expect $\cosh(x) \sim \sinh(x) \sim 0.5 e^x$.

Now, how about the *inverse* hyperbolic functions like $\text{acosh}$ (also called $\text{arccosh}$ or $\text{arcosh}$)? "Inverse hyperbolic cosine" sounds scary, but think of it like this:

$\ln(a)$: How long until $e^x$ reaches value $a$?

$\text{acosh(a)}$: How long until the

*even*part of $e^x$ reaches $a$? This must take longer than $\ln(x)$, since we're only using the*even*powers in our exponential growth Taylor series.- $\text{asinh(a)}$: How long until the
*odd*part of $e^x$ reaches $a$? Similarly, this must require more time $\ln(x)$.

In the graphs above, $\text{acosh}$ and $\text{asinh}$ require more time (i.e., are above) the natural log.

Intuitively, I see the various hyperbolic functions as modified exponentials and logarithms:

- $\cosh$ and $\sinh$ are partial/delayed exponential curves (with new behavior near zero, where the past still has influence).
- $\text{acosh}$ and $\text{asinh}$ are slower versions of the natural log. It takes $\ln(3) = 1.09$ units for $e^x$ to grow from 1 to 3, but it takes $\text{acosh}(3) = 1.76$ units for just the even part of $e^x$ to do the same.

Since $\cosh$ and $\sinh$ are mini exponentials (those little cherubs!), we'd guess they still have interesting properties.

$e^x$ is the function equal to its own derivative ($f' = f$).

Ok. And Sine and cosine are functions equal to their own *fourth* derivatives ($f'''' = f$).

Repeating after one derivative, repeating after four derivatves... that's a big gap. Anything in-between?

You bet. The hyperbolic functions equal their own *second* derivative ($f'' = f$):

Unlike $\sin$ and $\cos$, there's no awkward negative signs in the cycle, just a toggle:

Neat. The hyperbolic functions are like "half exponentials" because it takes two derivatives to complete the cycle. This is why they're useful in calculus -- not because we care about the coordinates on a hyperbola!

You'll see hyperbolic functions in tables of tricky integrals and derivatives:

Ignore the specifics. Let's see the general pattern without getting lost in the details.

We have our standard relationship:

Since the hyperbolic functions are variations of the exponentials, we'd expect $\frac{d}{dx}\text{asinh}$ to resemble $\frac{1}{x}$. From the table:

When $x$ is large, the "$+ 1$" doesn't matter much, and the derivative becomes:

Ah! The function $\text{asinh}(x)$ behaves like $\ln(x)$, except for small $x$ where the $+1$ term still matters. Without this exponential perspective, I'd have *no clue* what the derivative of $\text{asinh}$ should resemble. (*You want the rate of change of the inverse function that determines the y-coordinate a in a hyperbola? What?*)

Remember, we can now split the exponential function whenever we want:

So, what if we plug in $ix$? We'd get

But Euler's formula says:

Whoa. The even/odds parts of each function must be the same, so:

If we feed the imaginary axis to our everyday exponential function, we get the trig functions which live on a circle. The rotation is happening via the parameter $ix$, vs. in the function $e^{ix}$. (Instead of rotating the steering wheel, we're rotating the engine, so to speak.)

In sine's case, we have an awkward imaginary term to shuffle around:

These connections are more useful in complex analysis (still learning...), normally you'd prefer to pass real parameters into functions.

A few new identities doesn't hurt. With some quick algebra, we can turn regular trig identities into their hyperbolic versions. Starting with:

We swap in $\cos(x) = \cosh(ix)$ and $\sin(x) = -i \sinh(ix)$ to get:

Now, the term $ix$ is just a parameter. To simplify things, just say $z = ix$ and write:

We can leave out the specific parameter and write:

This pattern works in general. To convert a regular trig formula to its hyperbolic equivalent (Osborne's Rule), swap:

- $\cos^2 \Rightarrow \cosh^2$
- $\sin^2 \Rightarrow -\sinh^2$ (due to the $i$ when converting $\sin$ into $\sinh$)

While looking for applications of $\cosh$, I ran across this function, used in Machine Learning:

A direct interpretation is confusing: "Take the natural log of the x-coordinate in a hyperbola". Huh? What does that represent?

The exponential perspective makes it simpler:

- We don't want to take the natural log of the regular exponential, since $\ln(e^x) = x$. A function like $x$ (or rather, the absolute value $|x|$) is too pointy, and doesn't have a clean derivative at zero.
- However, $\cosh(x)$ only
*resembles*the regular exponential function. Its natural log will only*resemble*$f(x) = x$. The resulting function $\ln(\cosh(x))$ looks parabolic near zero, and linear as we grow:

How does this work? For small $x$:

- $\cosh(x) \approx 1 + \frac{x^2}{2}$. These are the first few terms of its Taylor series.
- $\ln(1 + x) \approx x$. The time to grow from 1.0 to 1 + .01 at 100% interest is only .01 units (not enough time for compounding)
- Combining the approximations, we get: $\ln(\cosh(x)) \approx \ln(1 + \frac{x^2}{2}) \approx \frac{x^2}{2}$.

As $x$ increases, we approach an offset line:

Cool! We have a parabola-line hybrid. No need for hyperbolas, we're dealing with variations of the exponential function.

**Bonus: Lining Things Up**

Maybe we can get $\text{logcosh}$ to turn into the line $y = x$, instead of the offset line $y = x - \ln(2)$. We can add a term to undo the $-\ln(2)$:

Notes:

- Our new function (purple) still looks like a parabola for small $x$.
- The $x^2$ in the in $e^{-0.1x^2}$ makes the function symmetric.
- Adjusting the constant $0.1$ changes how fast we approach the line $y=x$.

A short while back, I'd have no idea how to make a parabola gently transition into a line. But seeing $\cosh$ as "parabolic short term, exponential long term" gives us a clue: use the natural log to undo the "exponential long term" behavior, giving us "parabolic short term, linear long term".

**Hyperbolic Tangent**

The hyperbolic tangent is used as a machine learning activation function:

What's its meaning? It's the ratio between the odd and even parts of the exponential function after a given amount of time.

- At $x=0$, the even part dominates (full symmetry between past and future)
- As $x$ grows, the odd part catches up, and the ratio approaches 1 (equal parts symmetry and anti-symmetry)

The function $\tanh$ is nicely centered at 0 and smoothly varies between -1 and 1. As $x$ increases, $\cosh(x) \sim \sinh(x) \sim 0.5e^x$ and $\frac{0.5e^x}{0.5e^x} = 1$.

Ok. We've gone pretty far without talking about hyperbolas. Why?

Well, just look at it. They aren't an everyday shape. What's their purpose? Catenaries, orbital mechanics? That's an *application* but not the reason they exist.

Time to build an intuition.

Hyperbolas are built from inverse functions, like $y = \frac{1}{x}$. It's a simple, useful relationship: what's the inverse of $x=2$? $y=1/2$. Multiply them and we undo all scaling: $xy=1$.

A burning math question might be: sure, we have this inverse relationship, but what's the area underneath?

In calculus terms, this is:

Where

- $\int_1^x$ specifies the boundaries. We start at $x=1$ and go to some upper value for $x$. (We can't start trapping area at $x=0$, since it shoots up to infinity.)
- $\frac{1}{x} dx$ is the area we collect at each x-coordinate as we march along

This integral is hard. Thankfully, one definition of the natural log is:

so we have a ready-made solution. Starting from $x=1$, what upper x-coordinate will trap 5 units of area? We want to solve:

Which means

Wow! That's a large coordinate to trap 5 measly units of area.

In general, we have

And the y-coordinate (inverse) at that position is $y = \frac{1}{x} = \frac{1}{e^{\text{area}}} = e^{-\text{area}}$.

There's multiple ways to make a hyperbolic curve. If we rotate 45 degrees, we get something like this:

How do we rotate the equation $xy=1$? The standard way is with a rotation matrix, but let's do the rotation with complex numbers.

Let's treat points as complex numbers: $(x, y) \rightarrow x + yi$. A sample point $(a + bi)$ is on the *rotated* hyperbola if, after undoing the rotation, we see our original $xy = 1$ relationship.

- Candidate point: ($a + bi$)
- 45-degree counterclockwise rotation: $\frac{(1 + i)}{\sqrt{2}}$

Ok, we turned our candidate back to its original (pre-rotation) point, which is at

When does it have our $xy=1$ relationship?

Ok! Our candidate is on the rotated hyperbola if: $a^2 - b^2 = 2$ or $\text{real magnitude}^2 - \text{imaginary magnitude}^2 = 2$. We can express this requirement in $(x, y)$ notation as:

Almost there. Our original hyperbola ($xy = 1$) contains the point $(1, 1)$, which is a distance of $\sqrt{x^2 + y^2} = \sqrt{1^2 + 1^2} = \sqrt{2}$ from the origin. The constraint equation is really $x^2 - y^2 = r^2$.

If we set the radius to 1, we get a formula for the unit hyperbola:

Tada! A nice, clean equation, but I also see it as $(x - y)(x + y) = 1$, which hints at the inverse relationship.

Ok. Time for the scary diagram you'll see in most textbooks:

We have our rotated hyperbola, and want to trap $a/2$ units of area in red. (The full $a$ units would include the area under the x-axis. What functions determine the coordinates that trap this area?

The solution turns out to be the even and odd parts of the exponential, $\cosh$ and $\sinh$. There's a 1950's pamphlet "Hyperbolic Functions" by V. G. Shervatot, that goes through the derivation. The key intuition is realizing that hyperbolas (generally speaking) trap area logarithmically, so the necessary coordinates grow exponentially:

- $xy = 1$ hyperbola (trap area under curve)
- $\text{x-coordinate} = e^{\text{area}}$
- $\text{y-coordinate} = \frac{1}{e^{\text{area}}} =e^{-\text{area}} $

- $x^2 - y^2 = 1$ hyperbola (trap $\frac{\text{area}}{2}$ per diagram)
- $\text{x-coordinate} = \frac{e^{\text{area}/2} + e^{-\text{area}/2}}{2} = \cosh(x)$
- $\text{y-coordinate} = \frac{e^{\text{area}/2} - e^{-\text{area}/2}}{2} = \sinh(x)$

In case it needs to be said: it's *not obvious* that the even/odd parts of the exponential function determine the coordinates that trap area in a rotated hyperbola.

(Aside: Hyperbolas can be defined in terms of distance to fixed points or a conic section, but this gives no intuition for why exponentials are involved.)

One more confusion is why we need *new* functions to parameterize the hyperbola, when existing trig functions do the trick:

Start with a circle

For any angle $x$, we have coordinates $(x, y)$ = $(\cos(x), \sin(x))$

- If we invert the x coordinate (hey, it's what hyperbolas do), we get $x = \frac{1}{\cos(x)} = \sec(x)$
- If we scale the y coordinate by that
*same inversion*, we get $y = \sin(x) \cdot \frac{1}{\cos(x)} = \tan(x)$

Does this really make a hyperbola? It meets the requirements:

$x^2 - y^2 = 1$ (relationship needed for rotated hyperbola)

$\sec^2(t) - \tan^2(t) = 1$ (rearrangement of the famous $\sec^2(t) = 1 + \tan^2(t)$)

This video shows how the various parameterizations behave (open the calculator):

Our familiar trig functions ($\sec(t), \tan(t))$ trace the *same hyperbola* as the fancy new $(\cosh(t), \sinh(t))$. They just go a different speed. And the parameter $t$ is just an everyday angle we plug into trig functions.

Although there are multiple parameterizations for the hyperbola, $\cosh$ and $\sinh$ are defined with exponentials and are the analog of $\sin$ and $\cos$ in Euler's Formula. They can wear the Official Hyperbolic Parameterization crown.

The inverse hyperbolic functions go from coordinates back to area. Let's say I'm on the unit hyperbola, with an x-coordinate of 5. How much area have I trapped? $\text{acosh}(5) = 2.29$.

The inverse functions are sometimes called $\text{arcosh}$ ("area hyperbolic cosine"). This forces us to think about the coordinate-to-area conversion. I prefer to think about exponentials, and use $\text{acosh}$ ("inverse hyperbolic cosine"). Area is one interpretation, don't force me into it.

We can derive the formulas for the inverse functions by solving $x = \cosh(y)$ and $x = \sinh(y)$:

As expected, these look like modified logarithms. As $x$ grows, they approach $\ln(x + x) = \ln(2x) = \ln(x) + \ln(2)$, or the natural logarithm with an offset.

The main application of the geometric view is that the $\cosh(x)$ is the shape a rope takes when hanging between two fixed points. It's not quite a parabola, it's a catenary curve, with the St. Louis Arch as a famous example. Here's a few more curves (source) that follow $\cosh(x)$:

The process to build this curve is fairly subtle:

- First, create a rotated hyperbola with $x^2 - y^2 = 1$
- Instead of using the hyperbola, make a graph of
*just the x-coordinate*. - This graph of
*just the x-coordinate*makes a new curve, which models how the rope hangs

This convoluted process isn't how $\cosh$ was discovered. There's a differential equation that models the forces inside a hanging rope:

To solve the differential equation, we need the convenient exponential properties of $\cosh$, and wind up with:

It's cute that $\cosh$ parameterizes a hyperbola, but that interpretation has nothing to do with why it's the solution. I think "the catenary follows the even part of the exponential function" not "the catenary follows the x-coordinate of the hyperbola".

The area under the exponential $e^x$ equals the current value (plus a constant). Consider the region from $x=0$ to $x=2$:

- Area under curve: $\int_0^2 e^x = e^2 - e^0 = 7.389 - 1 = 6.389$
- Current value: $e^2 = 7.389$
- Pattern: $\text{current value} = \text{area under curve} - 1$

A pretty clean connection, right? (Don't forget that $+C$)

Now how about $\cosh$?

- Current value: $\cosh(x)$
- Area under curve: $\int \cosh(x) = \sinh(x)$
- Arc length of curve: $\int \sqrt{1 + (\cosh'(x))^2} = \int \sqrt{1 + \sinh^2(x)} = \int \cosh(x) = \sinh(x)$
- Pattern: $\text{area under curve} = \text{arc length} = \sqrt{\text{current value}^2 - 1}$

The current value of $\cosh$ can be swapped in using the identity $\sqrt{\cosh^2 - 1} = \sinh$.

For large $x$, the $-1$ is negligible and $\sqrt{\text{current value}^2 - 1} \sim \text{current value}$. So, for large $x$, we get equality between area, arc length, and current value (imagine the green rope hanging down and just touching the x-axis). It's more connected than regular $e^x$, not bad!

(Intuition for another day: Math deals with unitless quantities. $13 \ \text{cm}$ is not directly comparable with $13 \ \text{cm}^2$. Yet in math class, we can solve $x = 1 + x^2$ and nobody cares that constant, linear and squared terms are used in conjunction.)

The shape of the universe may be a hyperbola, and hyperbolic geometry is used in special relativity (beyond my pay grade). If we do live in a giant hyperbola, I, uh, may be forced to recant my "exponentials first" stance.

The hyperbolic functions can be seen as exponential functions (relating time and growth) or geometric functions (relating area and coordinates). Hyperbolas, generally speaking, have logarithmic area and exponential coordinates.

It's been a long journey, but these functions don't haunt my attic any more.

Happy math.

Convolution is usually introduced with its formal definition:

Yikes. Let's start *without* calculus: **Convolution is fancy multiplication.**

Imagine you manage a hospital treating patients with a single disease. You have:

**A treatment plan:**`[3]`

Every patient gets 3 units of the cure on their first day.**A list of patients:**`[1 2 3 4 5]`

Your patient schedule for the week (1 person Monday, 2 people on Tuesday, etc.).

Question: How much medicine do you use each day? Well, that's just a quick multiplication:

```
Plan * Patients = Daily Usage
[3] * [1 2 3 4 5] = [3 6 9 12 15]
```

Multiplying the plan by the patient list gives the usage for the upcoming days: `[3 6 9 12 15]`

. Everyday multiplication (`3 x 4`

) means using the plan with a single day of patients: `[3] * [4] = [12]`

.

Let's say the disease mutates and requires a multi-day treatment. You create a new plan: `Plan: [3 2 1]`

That means 3 units of the cure on the first day, 2 on the second, and 1 on the third. Ok. Given the same patient schedule ( `[1 2 3 4 5]`

), what's our medicine usage each day?

Uh... shoot. It's not a quick multiplication:

- On Monday, 1 patient comes in. It's her first day, so she gets 3 units.
- On Tuesday, the Monday gal gets 2 units (her second day), but two new patients arrive, who get 3 each (2 * 3 = 6). The total is 2 + (2 * 3) = 8 units.
- On Wednesday, it's trickier: The Monday gal finishes (1 unit, her last day), the Tuesday people get 2 units (2 * 2), and there are 3 new Wednesday people... argh.

The patients are overlapping and it's hard to track. How can we organize this calculation?

An idea: imagine *flipping* the patient list, so the first patient is on the right:

```
Start of line
5 4 3 2 1
```

Next, imagine we have 3 separate rooms where we apply the proper dose:

```
Rooms 3 2 1
```

On your first day, you walk into the first room and get 3 units of medicine. The next day, you want into room #2 and get 2 units. On the last day, you walk into room #3 and get 1 unit. There's no rooms afterwards, and you're done.

To calculate the total medicine usage, line up the patients and walk them through the rooms:

```
Monday
----------------------------
Rooms 3 2 1
Patients 5 4 3 2 1
Usage 3
```

On Monday (our first day), we have a single patient in the first room. She gets 3 units, for a total usage of 3. Makes sense, right?

On Tuesday, everyone takes a step forward:

```
Tuesday
----------------------------
Rooms 3 2 1
Patients -> 5 4 3 2 1
Usage 6 2 = 8
```

The first patient is now in the second room, and there's 2 new patients in the first room. We multiply each room's dose by the patient count, then combine.

Every day we just walk the list forward:

```
Wednesday
----------------------------
Rooms 3 2 1
Patients -> 5 4 3 2 1
Usage 9 4 1 = 14
Thursday
-----------------------------
Rooms 3 2 1
Patients -> 5 4 3 2 1
Usage 12 6 2 = 20
Friday
-----------------------------
Rooms 3 2 1
Patients -> 5 4 3 2 1
Usage 15 8 3 = 26
```

Whoa! It's intricate, but we figured it out, right? We can find the usage for any day by reversing the list, sliding it to the desired day, and combining the doses.

The total day-by-day usage looks like this (don't forget Sat and Sun, since some patients began on Friday):

```
Plan * Patient List = Total Daily Usage
[3 2 1] * [1 2 3 4 5] = [3 8 14 20 26 14 5]
M T W T F M T W T F S S
```

This calculation is the *convolution* of the plan and patient list. It's a fancy multiplication between a set of a numbers and a "program".

Here's a live demo. Try changing `F`

(the plan) or `G`

(the patient list). The convolution $c(t)$ matches our manual calculation above.

(We define functions $f(x)$ and $g(x)$ to pad each list with zero, and adjust for the list index starting at 1).

You can do a quick convolution with Wolfram Alpha:

```
ListConvolve[{3, 2, 1}, {1, 2, 3, 4, 5}, {1, -1}, 0]
{3, 8, 14, 20, 26, 14, 5}
```

(The extra `{1, -1}, 0`

aligns the lists and pads with zero.)

I started this article 5 years ago (intuition takes a while...), but unfortunately the analogy is relevant to today.

Let's use convolution to estimate ventilator usage for incoming patients.

- Set $f(x)$ as the percent of patients needing ventilators. For example,
`[.05 .03 .01]`

means 5% of patients need ventilators the first week, 3% the second week, and 1% the third week. - Set $g(x)$ as the weekly incoming patients, in thousands.
- The convolution $c(t) = f * g$, shows how many ventilators are needed each week (in thousands). $c(5)$ is how many ventilators are needed 5 weeks from now.

Let's try it out:

`F = [.05, .03, .01]`

is the ventilator use percentage by week`G = [10, 20, 30, 20, 10, 10, 10]`

, is the incoming hospitalized patients. It starts at 10k per week, rises to 30k, then decays to 10k.

With these numbers, we expect a max ventilator use of 2.2k in 2 weeks:

The convolution drops to 0 after 9 weeks because the patient list has run out. In this example, we're interested in the peak value the convolution hits, not the long-term total.

Other plans to convolve may be drug doses, vaccine appointments (one today, another a month from now), reinfections, and other complex interactions.

The hospital analogy is the mental model I wish I had when learning. Now that we've tried it with actual numbers, let's pour in the Math Juice and turn the analogy into calculus.

So, what happened in our example? We had a list of patients and a plan. If the plan were simple (single day `[3]`

), regular multiplication would have worked. Because the plan was complex, we had to "convolve" it.

Time for some Fun Facts™:

Convolution is written $f * g$, with an asterisk. Yes, an asterisk usually indicates multiplciation, but in advanced calculus class, it indicates a convolution. Regular multiplication is just implied ($fg$).

The result of a convolution is a new

*function*that gives the total usage for any day ("What was the total usage on day $t=3$?"). We can graph the convolution over time to see the day-by-day totals.

Now the big aha: **Convolution reverses one of the lists!** Here's why.

Let's call our treatment plan $f(x)$. In our example, we used `[3 2 1]`

.

The list of patients (inputs) is $g(x)$. However, we need to *reverse* this list as we slide it, so the earliest patient (Monday) enters the hospital first (first in, first out). This means we need to use $g(-x)$, the horizontal reflection of $g(x)$. `[1 2 3 4 5]`

becomes `[5 4 3 2 1]`

.

Now that we have the reversed list, pick a day to compute ($t = 0, 1, 2...$). To slide our patient list by this much, we use: $g(-x + t)$. That is, we reverse the list ($-x$) and jump to the correct day ($+t$).

We have our scenario:

- $f(x)$ is the plan to use
- $g(-x + t)$ is the list of inputs (flipped and slid to the right day).

To get the total usage on day $t$, we multiply each patient with the plan, and sum the results (an integral). To account for any possible length, we go from -infinity to +infinity.

Now we can describe convolution formally using calculus:

(Like colorized math? There's more.)

Phew! That's quite few symbols. Some notes:

- We use a dummy variable $\tau$ (tau) for the intermediate computation. Imagine $\tau$ as knocking on each room ($\tau={0, 1, 2, 3...}$), finding the dosage [$f(\tau)$], the number of patients [$g(t - \tau)$], multiplying them, and totaling things in the integral. Yowza. The so-called "dummy" variable $\tau$ is like
`i`

in a`for`

loop: it's temporary, but does the work. (By analogy, $t$ is a global variable has a fixed value during the loop.) - In the official definition, you'll see $g(t - \tau)$ instead of $g(- \tau+ t)$. The second version shows the flip ($-\tau$) and slide ($+t$). Writing $g(t - \tau)$ makes it seem like we're interested in the difference between the variables, which confused me.
- The treatment plan (program to run) is called the
*kernel*: you convolve a kernel with an input.

Not too bad, right? The equation is a formal description of the analogy.

We can't discover a new math operation without taking it for a spin. Let's see how it behaves.

In our computation, we flipped the patient list and kept the plan the same. Could we have flipped the plan instead?

You bet. Imagine the patients are immobile, and stay in their rooms: `[1 2 3 4 5]`

. To deliver the medicine, we have 3 medical carts that go to each room and deliver the dose. Each day, they slide forward one position.

```
Carts ->
1 2 3
1 2 3 4 5
Patients
```

As before, though our plan is written `[3 2 1]`

(3 units on the first day), we flip the order of the carts to`[1 2 3]`

. That way, a patient gets 3 units on their first day, as we expect. Checking with Wolfram Alpha, the calculation is the same.

```
ListConvolve[{1, 2, 3, 4, 5}, {3, 2, 1}, {1, -1}, 0]
{3, 8, 14, 20, 26, 14, 5}
```

Cool! It looks like convolution is commutative:

and we can decide to flip either $f$ or $g$ when calculating the integral. Surprising, right?

When all treatments are finished, what was the *total* medicine usage? This is the **integral of the convolution**. (A few minutes ago, that phrase would have you jumping out of a window.)

But it's a simple calculation. Our plan gives each patient `sum([3 2 1]) = 6`

units of medicine. And we have `sum([1 2 3 4 5]) = 15`

patients. The total usage is just `6 x 15 = 90`

units.

Wow, that was easy: the usage for the *entire* convolution is just the product of the subtotals!

I hope this clicks intuitively. Note that this trick works for convolution, but not integrals in general. For example:

If we separate $x \cdot x$ into two integrals we get:

- $ \int (x \cdot x) = \int x^2 = \frac{1}{3} x^3 $
- $\int x \cdot \int x = \frac{1}{2}x^2 \cdot \frac{1}{2}x^2 = \frac{1}{4}x^4$

and those aren't the same. (Calculus would be much easier if we could split integrals like this.) It's strange, but $\int (f * g)$ is probably easier to solve than $\int (fg)$.

What happens if we sent a single patient through the hospital? The convolution would just be that day's plan.

```
Plan * Patients = Convolution
[3 2 1] * [1] = [3 2 1]
```

In other words, convolving with `[1]`

gives us the original plan.

In calculus terms, a spike of `[1]`

(and 0 otherwise) is the Dirac Delta Function. In terms of convolutions, this function acts like the number 1 and returns the original function:

We can delay the delta function by T, which delays the resulting convolution function too. Imagine our single patient shows up a week late ($\delta(t - T)$), so our medicine usage gets delayed for a week too:

The Fourier Transform (written with a fancy $\mathscr{F}$) converts a function $f(t)$ into a list of cyclical ingredients $F(s)$:

As an operator, this can be written $\mathscr{F}\lbrace f \rbrace = F$.

In our analogy, we convolved the plan and patient list with a fancy multiplication. Since the Fourier Transform gives us lists of ingredients, could we get the same result by mixing the *ingredient lists*?

Yep, we can: **Fancy multiplication in the regular world is regular multiplication in the fancy world.**

In math terms, "Convolution in the time domain is multiplication in the frequency (Fourier) domain."

Mathematically, this is written:

or

where $f(x)$ and $g(x)$ are functions to convolve, with transforms $F(s)$ and $G(s)$.

We can prove this theorem with advanced calculus, that uses theorems I don't quite understand, but let's think through the meaning.

Because $F(s)$ is the Fourier Transform of $f(t)$, we can ask for a specific frequency ($s = 2\text{Hz}$) and get the *combined interaction* of every data point with that frequency. Let's suppose:

That means after every data point has been multiplied against the 2Hz cycle, the result is $3 + i$. But we could have kept each interaction separate:

Where $c_t$ is the contribution to the 2Hz frequency from datapoint $t$. Similarly, we can expand $G(s)$ into a list of interactions with the 2Hz ingredient. Let's suppose $G(2) = 7 - i$:

The Convolution Theorem is really saying:

Our convolution in the regular domain involves a lot of cross-multiplications. In the fancy frequency domain, we *still* have a bunch of interactions, but $F(s)$ and $G(s)$ have consolidated them. We can just multiply $F(2)G(2) = (3 + i)(7-i)$ to find the 2Hz ingredient in the convolved result.

By analogy, suppose you want to calculate:

It's a lot of work to cross-multiply every term: $(1 \cdot 5) + (1\cdot 6) + (1\cdot 7) + ...$

It's better to consolidate the groups into $(1 + 2 + 3 + 4) = 10$ and $(5 + 6 + 7 + 8) = 26$, and *then* multiply to get $10 \cdot 26 = 260$.

This nuance caused me a lot of confusion. It seems like $FG$ is a single multiplication, while $f * g$ involves a bunch of intermediate terms. I forgot that $F$ already did the work of merging a bunch of entries into a single one.

Now, we aren't *quite* done.

We can convert $f * g$ in the time domain into $FG$ in the frequency domain, but we probably need it back in the time domain for a usable result:

You have a riddle in English ($f * g$), you translate it to French ($FG$), get your smart French friend to work out that calculation, then convert it back to English ($\mathscr{F}^{-1}$).

The convolution theorem works this way too:

**Regular multiplication in the regular world is fancy multiplication in the fancy world.**

Cool, eh? Instead of multiplying two functions like some cave dweller, put on your monocle, convolve the Fourier Transforms, and and convert to the time domain:

I'm not saying this is fun, just that it's possible. If your French friend has a gnarly calculation they're struggling with, it might look like arithmetic to you.

Remember how we said the integral of a convolution was a multiplication of the individual integrals?

Well, the Fourier Transform is just a very specific integral, right?

So (handwaving), it seems we could swap the general-purpose integral $\int$ for $\mathscr{F}$ and get

which is the convolution theorem. I need a deeper intuition for the proof, but this helps things click.

The trick with convolution is finding a useful "program" (kernel) to apply to your input. Here's a few examples.

Let's say you want a moving average between neighboring items in a list. That is half of each element, added together:

This is a "multiplication program" of `[0.5 0.5]`

convolved with our list:

```
ListConvolve[{1, 4, 9, 16, 25}, {0.5, 0.5}, {1, -1}, 0]
{0.5, 2.5, 6.5, 12.5, 20.5, 12.5}
```

We can perform a moving average with a single operation. Neat!

A 3-element moving average would be `[.33 .33 .33]`

, a weighted average could be `[.5 .25 .25]`

.

The derivative finds the difference between neighboring values. Here's the plan: `[1 -1]`

```
ListConvolve[{1, 2, 3, 4, 5}, {1, -1}, {1, -1}, 0]
{1, 1, 1, 1, 1, -5} // -5 since we ran out of entries
ListConvolve[{1, 4, 9, 16, 25}, {1, -1}, {1, -1}, 0]
{1, 3, 5, 7, 9, -25} // discrete derivative is 2x + 1
```

With a simple kernel, we can find a useful math property on a discrete list. And to get a second derivative, just apply the derivative convolution twice:

`F * [1 -1] * [1 -1]`

As a shortcut, we can precompute the final convolutions (`[1 -1] * [1 -1]`

) and get:

```
ListConvolve[{1, -1}, {1,-1}, {1, -1}, 0]
{1, -2, 1}
```

Now we have a *single* kernel `[1, -2, 1]`

that gets the second derivative of a list:

```
ListConvolve[{1, 4, 9, 16, 25}, {1, -2, 1}, {1, -1}, 0]
{1, 2, 2, 2, 2, -34, 25}
```

Excluding the boundary items, we get the expected second derivative:

An image blur is essentially a convolution of your image with some "blurring kernel":

The blur of our 2D image requires a 2D average:

Can we undo the blur? Yep! With our friend the Convolution Theorem, we can do:

Whoa! We can recover the original image by dividing out the blur. Convolution is a simple multiplication in the frequency domain, and *deconvolution* is a simple division in the frequency domain.

A short while back, the concept of "deblurring by dividing Fourier Transforms" was gibberish to me. While it can be daunting mathematically, it's getting simpler conceptually.

More reading:

- Wikipedia on deconvolution, deblurring.
- Image Restoration Lecture, another one, and example for above
- Blind deconvolution is when you don't know the blur kernel (make a guess)

What is a number? A list of digits:

```
1234 = 1000 + 200 + 30 + 4 = [1000 200 30 4]
5678 = 5000 + 600 + 70 + 8 = [5000 600 70 8]
```

And what is regular, grade-school multiplication? A digit-by-digit convolution! We sweep one list of digits by the other, multiplying and adding as we go:

We can perform the calculation by convolving the lists of digits (wolfram alpha):

```
ListConvolve[{1000, 200, 30, 4}, {8, 70, 600, 5000}, {1, -1}, 0]
{8000, 71600, 614240, 5122132, 1018280, 152400, 20000}
sum {8000, 71600, 614240, 5122132, 1018280, 152400, 20000}
7006652
```

Note that we pre-flip one of the lists (it gets swapped in the convolution later), and the intermediate calculations are a bit different. But, combining the subtotals gives the expected result.

Why convolve instead of doing a regular digit-by-digit multiplication? Well, the convolution theorem lets us substitute convolution with Fourier Transforms:

The convolution ($f * g$) has complexity $O(n^2)$. We have $n$ positions to process, with $n$ intermediate multiplications at each position.

The right side involves:

- Two Fourier Transforms, which are normally $O(n^2)$. However, the Fast Fourier Transform (a divide-and-conquer approach) makes them $O(n\log(n))$.
- Pointwise multiplication of the final result of the transforms ($\sum a_n \cdot b_n$), which is $O(n)$
- An inverse transform, which is $O(n\log(n))$

And the total complexity is: $O(n\log(n)) + O(n\log(n)) + O(n) + O(n\log(n)) = O(n\log(n))$

Regular multiplication in the fancy domain is *faster* than a fancy multiplication in the regular domain. Our French friend is no slouch. (More)

Machine learning is about discovering the math functions that transform input data into a desired result (a prediction, classification, etc.).

Starting with an input signal, we could convolve it with a bunch of kernels:

Given that convolution can do complex math (moving averages, blurs, derivatives...), it seems *some* combination of kernels should turn our input into something useful, right?

Convolutional Neural Nets (CNNs) process an input with layers of kernels, optimizing their weights (plans) to reach a goal. Imagine tweaking the treatment plan to keep medicine usage below some threshold.

CNNs are often used with image classifiers, but 1D data sets work just fine.

- Nice writeup: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
- Digit classifier demo: https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

A linear, time-invariant system means:

- Linear: Scaling and combining inputs scales and combines outputs by the same amount
- Time invariant: Outputs depend on relative time, not absolute time. You get 3 units on
*your*first day, and it doesn't matter if it's Wednesday or Thursday.

A fancy phrase is "A LTI system is characterized by its impulse response". Translation: If we send a *single* patient through the hospital `[1]`

, we'll discover the treatment plan. Then we can predict the usage for *any* sequence of patients by convolving it with the plan.

If the system isn't LTI, we can't extrapolate based on a *single* person's experience. Scaling the inputs may not scale the outputs, and the actual calendar day (not relative day) may impact the result.

From David Greenspan: "Suppose you have a special laser pointer that makes a star shape on the wall. You tape together a bunch of these laser pointers in the shape of a square. The pattern on the wall now is the convolution of a star with a square."

Regular multiplication gives you a single scaled copy of an input. Convolution creates multiple overlapping copies that follow a pattern you've specified.

Real-world systems have squishy, not instantaneous, behavior: they ramp up, peak, and drop down. The convolution lets us model systems that echo, reverb and overlap.

Now it's time for the famous sliding window example. Think of a pulse of inputs (red) sliding through a system (blue), and having a combined effect (yellow): the convolution.

(Source)

Convolution has an advanced technical definition, but the basics can be understood with the right analogy.

Quick rant: I study math for fun, yet it took years to find a satisfying intuition for:

- Why is one function reversed?
- Why is convolution commutative?
- Why does the integral of the convolution = product of integrals?
- Why are the Fourier Transforms multiplied point-by-point, and not overlapped?

Why'd it take so long? Imagine learning multiplication with $f \times g = z$ instead of $3 \times 5 = 15$. Without an example I can explore *in my head*, I could only memorize results, not intuit them. Hopefully this analogy can save you years of struggle.

Happy math.

]]>- Pretend to be asleep (except not in the library again)
- Canned response: "As with any function, the integral of sine is the area under its curve."
- Geometric intuition: "The integral of sine is the
**horizontal distance**along a circular path."

Option 1 is tempting, but let's take a look at the others.

Describing an integral as "area under the curve" is like describing a book as a list of words. Technically correct, but misses the message and I suspect you haven't done the assigned reading.

Unless you're trapped in LegoLand, integrals mean something besides rectangles.

My calculus conundrum was not having an intuition for all the mechanics.

When we see:

$\int \sin(x) dx$

We can call on a few insights:

The integral is just fancy multiplication. Multiplication accumulates numbers that don't change (3 + 3 + 3 + 3). Integrals add up numbers that

*might*change, based on a pattern (1 + 2 + 3 + 4). But if we squint our eyes and pretend items are identical we have a multiplication.$\sin(x)$ just a percentage. Yes, it's also fancy curve with nice properties. But at any point (like 45 degrees), it's a single

*percentage*from -100% to +100%. Just regular numbers.$dx$ is a tiny, infinitesimal part of the path we're taking. 0 to $x$ is the full path, so $dx$ is (intuitively) a nanometer.

Ok. With those 3 intuitions, our rough (rough!) conversion to Plain English is:

The integral of sin(x) multiplies our intended path length (from 0 to x) by a percentage

We *intend* to travel a simple path from 0 to x, but we end up with a smaller percentage instead. (Why? Because $\sin(x)$ is usually less than 100%). So we'd expect something like 0.75x.

In fact, if $\sin(x)$ did have a fixed value of 0.75, our integral would be:

$\int \text{fixedsin}(x) = \int 0.75 \ dx = 0.75 \int dx = 0.75x$

But the real $\sin(x)$, that rascal, changes as we go. Let's see what fraction of our path we really get.

Now let's visualize $\sin(x)$ and its changes:

Here's the decoder key:

$x$ is our current angle in radians. On the unit circle (radius=1), the angle is the distance along the circumference.

$dx$ is a tiny change in our angle, which becomes the same change along the circumference (moving 0.01 units in our angle moves 0.01 along the circumference).

At our tiny scale, a circle is a a polygon with many sides, so we're moving along a

*line segment*of length $dx$. This puts us at a new position.

With me? With trigonometry, we can find the exact change in height/width as we slide along the circle by $dx$.

By similar triangles, our change just just our original triangle, rotated and scaled.

- Original triangle (hypotenuse = 1): height = $\sin(x)$, width = $\cos(x)$
- Change triangle (hypotenuse = dx): height = $\sin(x) dx$, width = $\cos(x) dx$

Now, remember that sine and cosine are functions that return percentages. (A number like 0.75 doesn't have its orientation. It shows up and makes things 75% of their size in whatever direction they're facing.)

So, given how we've drawn our Triangle of Change, $\sin(x) dx$ is our horizontal change. Our plain-English intuition is:

The integral of sin(x) adds up the horizontal change along our path

Ok. Let's graph this bad boy to see what's happening. With our "$\sin(x) dx$ = tiny horizontal change" insight we have:

As we circle around, we have a bunch of $dx$ line segments (in red). When sine is small (around x=0) we barely get any horizontal motion. As sine gets larger (top of circle), we are moving up to 100% horizontally.

Ultimately, the various $\sin(x) dx$ segments move us horizontally from one side of the circle to the other.

A more technical description:

$\int_0^x \sin(x) dx = \text{horizontal distance traveled on arc from 0 to x}$

Aha! That's the meaning. Let's eyeball it. When moving from $x=0$ to $x=\pi$ we move exactly 2 units horizontally. It makes complete sense in the diagram.

Using the Official Calculus Fact that $\int \sin(x) dx = -\cos(x)$ we would calculate:

$ \int_0^\pi \sin(x) dx = -\cos(x) \Big|_0^\pi = -\cos(\pi) - -\cos(0) = -(-1) -(-1) = 1 + 1 = 2$

Yowza. See how awkward it is, those double negations? Why was the visual intuition so much simpler?

Our path along the circle ($x=0$ to $x=\pi$) moves from right-to-left. But the x-axis goes positive from left-to-right. When convert distance along our path into Standard Area™, we have to flip our axes:

Our excitement to put things in the official format stamped out the intuition of what was happening.

We don't really talk about the Fundamental Theorem of Calculus anymore. (Is it something I did?)

Instead of adding up all the tiny segments, just do: end point - start point.

The intuition was staring us in the face: $\cos(x)$ is the anti-derivative, and tracks the horizontal position, so we're just taking a difference between horizontal positions! (With awkward negatives to swap the axes.)

*That's* the power of the Fundamental Theorem of Calculus. Skip the intermediate steps and just subtract endpoints.

Why did I write this? Because I couldn't instantly figure out:

$ \int_0^\pi \sin(x) dx = 2$

This isn't an exotic function with strange parameters. It's like asking someone to figure out $2^3$ without a calculator. If you claim to understand exponents, it should be possible, right?

Now, we can't always visualize things. But for the *most common* functions we owe ourselves a visual intuition. I certainly can't eyeball the 2 units of area from 0 to $\pi$ under a sine curve.

Happy math.

As a fun fact, the "average" efficiency of motion around the top of a circle (0 to $\pi$) is: $ \frac{2}{\pi} = .6366 $

So on average, 63.66% of your path's length is converted to horizontal motion.

It seems weird that height controls the width, and vice-versa, right?

If height controlled height, we'd have runaway exponential growth. But a circle needs to regulate itself.

$e^x$ is the kid who eats candy, grows bigger, and can therefore eat more candy.

$\sin(x) $ is the kid who eats candy, gets sick, waits for an appetite, and eats more candy.

The "area" in our integral isn't literal area, it's a percentage of our length. We visualized the multiplication as a 2d rectangle in our generic integral, but it can be confusing. If you earn money and are taxed, do you visualize 2d area (income * (1 - tax))? Or just a quantity helplessly shrinking?

Area primarily indicates a multiplication happened. Don't let team Integrals Are Literal Area win every battle!

]]>Let's try a broader interpretation: **The Pythagorean Theorem explains how 2D area can be combined.**

Here's what I mean. Suppose we have two lines lying around (the creatively named *Line A* and *Line B*). We can spin them to create area:

Ok, fun enough. Where's the mystery?

Well, what happens if we *combine* the line segments before spinning them?

Whoa. The area swept out seems to change. Should simply *moving* the lines, not lengthening them, change the area?

Eyeballing the diagram above, it sure seems like the area grew. Let's work out the specifics.

As an example, suppose $a = 6$ and $b = 8$. When they're swept into circles ($\text{area} = \pi r^2$) we get:

For a total of $36\pi + 64\pi = 100\pi$.

The combined segment has length $c = a + b = 14$, and when we spin it we get:

Uh oh. That's way more area than before.

What happened? Well, Circle A didn't change. But Circle B is much less than Ring B (just look at it!).

The issue: When Line B spins on its own, it can only reach 8 units out as it sweeps. When we attach Line B to Line A, it reaches out 6 + 8 = 14 units. Now the circular sweep covers more area, meaning Circle B is smaller than Ring B.

Mathematically, here's what happened.

Ignore $\pi$ for a moment since it's a common term. When expanding $c^2 = (a + b)^2 = a^2 + 2ab + b^2$, there's a new $2ab$ term that has to go somewhere. Because Circle A doesn't change, this extra area must appear in Ring B.

It... sort of makes sense that the area changes, but I don't like it. Just moving things around shouldn't have this effect! Can the area ever be the same?

Sure, if we remove the $2ab$ term. The easy fix is to set $a=0$, but that's cheating and you know it.

Let's find a clever solution. Intuitively, the question is: **How can Line A's length not help Line B as it spins?**

Tilt it! As we rotate Line B, there's less benefit from Line A's length. Ladders are useless when lying on the floor, right?

When we go Full Perpendicular™, the $2ab$ term disappears and Circle B = Ring B. (In vector terms, the dot product is zero: $a \cdot b = 0$).

Ah -- that's the meaning of the Pythagorean Theorem. **When line segments are perpendicular, the same area is swept whether the lines are combined or separated.**

It's not a bad idea to make sure the numbers line up.

Since the segments are now perpendicular, we know $c^2 = a^2 + b^2$, so:

Now we can calculate:

Tada! The Ring and Circle sweep the same area.

In our example, we have Circle A = $36\pi$, Circle B = $64 \pi$, $c = \sqrt{36 + 64} = 10$. The ring width is $10 - 6 = 4$.

The Pythagorean Theorem is about more than triangles. When components are perpendicular, the area they make is independent of how they are arranged.

- The Law of Cosines explicitly shows the $2ab$ term which assumed to be zero in the Pythagorean Theorem. The area of Ring B can even be "negative" if we tilt Line B to point inside.
- We can combine area from multiple dimensions ($x^2 + y^2 + z^2 + ...$). As long as they are mutually perpendicular, the area swept by each dimension is the area swept by the total.
- The Pythagorean Theorem is a relationship in the 2D area domain ($c^2 = a^2 + b^2$). We start here and convert this to a relationship in the 1D domain ($c = \sqrt{a^2 + b^2}$). The conversion happens so often we forget where it began.
- More on sweeping area: https://www.cut-the-knot.org/Curriculum/Geometry/PythFromRing.shtml

Happy math.

]]>