Understanding Bayes Theorem With Ratios

My first intuition about Bayes Theorem was “take evidence and account for false positives”. Does a lab result mean you’re sick? Well, how rare is the disease, and how often do healthy people test positive? Misleading signals must be considered.

This helped me muddle through practice problems, but I couldn’t think with Bayes. The big obstacles:

Percentages are hard to reason with. Odds compare the relative frequency of scenarios (A:B) while percentages use a part-to-whole “global scenario” [A/(A+B)]. A coin has equal odds (1:1) or a 50% chance of heads. Great. What happens when heads are 18x more likely? Well, the odds are 18:1, can you rattle off the decimal percentage? (I’ll wait…) Odds require less computation, so let’s start with them.

Equations miss the big picture. Here’s Bayes Theorem, as typically presented:

$\displaystyle{\displaystyle{\Pr(\mathrm{A}|\mathrm{X}) = \frac{\Pr(\mathrm{X}|\mathrm{A})\Pr(\mathrm{A})}{\Pr(\mathrm{X|A})\Pr(A)+ \Pr(\mathrm{X|\sim A})\Pr(\sim A)}}}$

It reads right-to-left, with a mess of conditional probabilities. How about this version:

original odds * evidence adjustment = new odds

Bayes is about starting with a guess (1:3 odds for rain:sunshine), taking evidence (it’s July in the Sahara, sunshine 1000x more likely), and updating your guess (1:3000 chance of rain:sunshine). The “evidence adjustment” is how much better, or worse, we feel about our odds now that we have extra information (if it were December in Seattle, you might say rain was 1000x as likely).

Let’s start with ratios and sneak up to the complex version.

Caveman Statistician Og

Og just finished his CaveD program, and runs statistical research for his tribe:

He saw 50 deer and 5 bears overall (50:5 odds)
At night, he saw 10 deer and 4 bears (10:4 odds)

What can he deduce? Well,

original odds * evidence adjustment = new odds

evidence adjustment = new odds / original odds

At night, he realizes deer are 1/4 as likely as they were previously:

10:4 / 50:5 = 2.5 / 10 = 1/4

(Put another way, bears are 4x as likely at night)

Let’s cover ratios a bit. A:B describes how much A we get for every B (imagine miles per gallon as the ratio miles:gallon). Compare values with division: going from 25:1 to 50:1 means you doubled your efficiency (50/25 = 2). Similarly, we just discovered how our “deers per bear” amount changed.

Og happily continues his research:

By the river, bears are 20x more likely (he saw 2 deer and 4 bears, so 2:4 / 50:5 = 1:20)
In winter, deer are 3x as likely (30 deer and 1 bear, 30:1 / 50:5 = 3:1)

He takes a scenario, compares it to the baseline, and computes the evidence adjustment.

Caveman Clarence subscribes to Og’s journal, and wants to apply the findings to his forest (where deer:bears are 25:1). Suppose Clarence hears an animal approaching:

His general estimate is 25:1 odds of deer:bear
It’s at night, with bears 4x as likely => 25:4
It’s by the river, with bears 20x as likely => 25:80
It’s in the winter, with deer 3x more likely => 75:80

Clarence guesses “bear” with near-even odds (75:80) and tiptoes out of there.

That’s Bayes. In fancy language:

Start with a prior probability, the general odds before evidence
Collect evidence, and determine how much it changes the odds
Compute the posterior probability, the odds after updating

Bayesian Spam Filter

Let’s build a spam filter based on Og’s Bayesian Bear Detector.

First, grab a collection of regular and spam email. Record how often a word appears in each:

             spam      normal
hello          3         3
darling        1         5
buy            3         2
viagra         3         0
...

(“hello” appears equally, but “buy” skews toward spam)

We compute odds just like before. Let’s assume incoming email has 9:1 chance of spam, and we see “hello darling”:

A generic message has 9:1 odds of spam:regular
Adjust for “hello” => keep the 9:1 odds (“hello” is equally-likely in both sets)
Adjust for “darling” => 9:5 odds (“darling” appears 5x as often in normal emails)
Final chances => 9:5 odds of spam

We’re learning towards spam (9:5 odds). However, it’s less spammy than our starting odds (9:1), so we let it through.

Now consider a message like “buy viagra”:

Prior belief: 9:1 chance of spam
Adjust for “buy”: 27:2 (3:2 adjustment towards spam)
Adjust for (“viagra”): …uh oh!

“Viagra” never appeared in a normal message. Is it a guarantee of spam?

Probably not: we should intelligently adjust for new evidence. Let’s assume there’s a regular email, somewhere, with that word, and make the “viagra” odds 3:1. Our chances become 27:2 * 3:1 = 81:2.

Now we’re geting somewhere! Our initial 9:1 guess shifts to 81:2. Now is it spam?

Well, how horrible is a false positive?

81:2 odds imply for every 81 spam messages like this, we’ll incorrectly block 2 normal emails. That ratio might be too painful. With more evidence (more words or other characteristics), we might wait for 1000:1 odds before calling a message spam.

Exploring Bayes Theorem

We can check our intuition by seeing if we naturally ask leading questions:

Is evidence truly independent? Are there links between animal behavior at night and in the winter, or words that appear together? Sure. We “naively” assume evidence is independent (and yet, in our bumbling, create effective filters anyway).
How much evidence is enough? Is seeing 2 deer & 1 bear the same 2:1 evidence adjustment as 200 deer and 100 bears?
How accurate were the starting odds in the first place? Prior beliefs change everything. (“A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”)
Do absolute probabilities matter? We usually need the most-likely theory (“Deer or bear?”), not the global chance of this scenario (“What’s the probability of deers at night in the winter by the river vs. bears at night in the winter by the river?”). Many Bayesian calculations ignore the global probabilities, which cancel when dividing, and essentially use an odds-centric approach.
Can our filter be tricked? A spam message might add chunks of normal text to appear innocuous and “poison” the filter. You’ve probably seen this yourself.
What evidence should we use? Let the data speak. Email might have dozens of characteristics (time of day, message headers, country of origin, HTML tags…). Give every characteristic a likelihood factor and let Bayes sort ’em out.

Thinking With Ratios and Percentages

The ratio and percentage approaches ask slightly different questions:

Ratios: Given the odds of each outcome, how does evidence adjust them?

$bayes theorem ratio examples$

The evidence adjustment just skews the initial odds, piece-by-piece.

Percentages: What is the chance of an outcome after supporting evidence is found?

$bayes theorem ratio example as percent$

In the percentage case,

“% Bears” is the overall chance of a bear appearing anywhere
“% Bears Going to River” is how likely a bear is to trigger the “river” data point
“% Bear at River” is the combined chance of having a bear, and it going to the river. In stats terms, P(event and evidence) = P(event) * P(event implies evidence) = P(event) * P(evidence|event). I see conditional probabilities as “Chances that X implies Y” not the twisted “Chances of Y, given X happened”.

Let’s redo the original cancer example:

1% of the population has cancer
9.6% of healthy people test positive, 80% of people with cancer do

If you see a positive result, what’s the chance of cancer?

Ratio Approach:

Cancer:Healthy ratio is 1:99
Evidence adjustment: 80/100 : 9.6/100 = 80:9.6 (80% of sick people are “at the river”, and 9.6% of healthy people are).
Final odds: 1:99 * 80:9.6 = 80:950.4 (roughly 1:12 odds of cancer, ~7.7% chance)

The intuition: the initial 1:99 odds are pretty skewed. Even with a 8.3x (80:9.6) boost from a positive test result, cancer remains unlikely.

Percentage Approach:

Cancer chance is 1%
Chance of true positive = 1% * 80% = .008
Chance of false positive = 99% * 9.6% = .09504
Chance of having cancer = .008 / (.008 + .09504) = 7.7%

When written with percentages, we start from absolute chances. There’s a global 0.8% chance of finding a sick patient with a positive result, and a global 9.504% chance of a healthy patient with a positive result. We then compute the chance these global percentages indicate something useful.

Let the approaches be complements: percentages for a bird’s-eye view, and ratios for seeing how individual odds are adjusted. We’ll save the myriad other interpretations for another day.

Happy math.

Topic Reference

Bayes' Theorem

Math

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

A Brief Introduction to Probability & Statistics

I've studied probability and statistics without experiencing them. What's the difference? What are they trying to do?

This analogy helped:

Probability is starting with an animal, and figuring out what footprints it will make.
Statistics is seeing a footprint, and guessing the animal.

Probability is straightforward: you have the bear. Measure the foot size, the leg length, and you can deduce the footprints. "Oh, Mr. Bubbles weighs 400lbs and has 3-foot legs, and will make tracks like this." More academically: "We have a fair coin. After 10 flips, here are the possible outcomes."

Statistics is harder. We measure the footprints and have to guess what animal it could be. A bear? A human? If we get 6 heads and 4 tails, what're the chances of a fair coin?

The Usual Suspects

Here's how we "find the animal" with statistics:

Get the tracks. Each piece of data is a point in "connect the dots". The more data, the clearer the shape (1 spot in connect-the-dots isn't helpful. One data point makes it hard to find a trend.)

Measure the basic characteristics. Every footprint has a depth, width, and height. Every data set has a mean, median, standard deviation, and so on. These universal, generic descriptions give a rough narrowing: "The footprint is 6 inches wide: a small bear, or a large man?"

Find the species. There are dozens of possible animals (probability distributions) to consider. We narrow it down with prior knowledge of the system. In the woods? Think horses, not zebras. Dealing with yes/no questions? Consider a binomial distribution.

Look up the specific animal. Once we have the distribution ("bears"), we look up our generic measurements in a table. "A 6-inch wide, 2-inch deep pawprint is most likely a 3-year-old, 400-lbs bear". The lookup table is generated from the probability distribution, i.e. making measurements when the animal is in the zoo.

Make additional predictions. Once we know the animal, we can predict future behavior and other traits ("According to our calculations, Mr. Bubbles will poop in the woods."). Statistics helps us get information about the origin of the data, from the data itself.

Ok! The metaphor isn't perfect, but more palatable than "Statistics is the study of the collection, organization, analysis, and interpretation of data". Need proof? Let's see if we can ask intuitive "I tasted it!" questions:

What are the most common species? (Common distributions)
Are new ones being discovered?
Can we predict the next footprint? (Extrapolation)
Are the tracks following a path? (Regression / trend line)
Here's two tracks, which animal was faster? Bigger? (Data from two drug trials: which was more effective?)
Is one animal moving in the same direction as another? (Correlation)
Are two animals tracking a common source? (Causation: two bears chasing the same rabbit)

These questions are much deeper than what I pondered when first learning stats. Every dry procedure now has a context: are we learning a new species? How to take the generic footprint measurements? How to make a table from a probability distribution? How to lookup measurements in a table?

Having an analogy for the statistics process makes later data crunching click. Happy math.

PS. The forwards-backwards difference between probability and statistics shows up all over math. Some procedures are easy to do (derivatives) but difficult to undo (integrals). (Thanks Denis)

Math

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Understanding the Monty Hall Problem

The Monty Hall problem is a counter-intuitive statistics puzzle:

There are 3 doors, behind which are two goats and a car.
You pick a door (call it door A). You’re hoping for the car of course.
Monty Hall, the game show host, examines the other doors (B & C) and opens one with a goat. (If both doors have goats, he picks randomly.)

Here’s the game: Do you stick with door A (original guess) or switch to the unopened door? Does it matter?

Surprisingly, the odds aren’t 50-50. If you switch doors you’ll win 2/3 of the time!

Today let’s get an intuition for why a simple game could be so baffling. The game is really about re-evaluating your decisions as new information emerges.

Play the game

You’re probably muttering that two doors mean it’s a 50-50 chance. Ok bub, let’s play the game:

Try playing the game 50 times, using a “pick and hold” strategy. Just pick door 1 (or 2, or 3) and keep clicking. Click click click. Look at your percent win rate. You’ll see it settle around 1/3.

Now reset and play it 20 times, using a “pick and switch” approach. Pick a door, Monty reveals a goat (grey door), and you switch to the other. Look at your win rate. Is it above 50% Is it closer to 60%? To 66%?

There’s a chance the stay-and-hold strategy does decent on a small number of trials (under 20 or so). If you had a coin, how many flips would you need to convince yourself it was fair? You might get 2 heads in a row and think it was rigged. Just play the game a few dozen times to even it out and reduce the noise.

Understanding Why Switching Works

That’s the hard (but convincing) way of realizing switching works. Here’s an easier way:

If I pick a door and hold, I have a 1/3 chance of winning.

My first guess is 1 in 3 — there are 3 random options, right?

If I rigidly stick with my first choice no matter what, I can’t improve my chances. Monty could add 50 doors, blow the other ones up, do a voodoo rain dance — it doesn’t matter. The best I can do with my original choice is 1 in 3. The other door must have the rest of the chances, or 2/3.

The explanation may make sense, but doesn’t explain why the odds “get better” on the other side. (Several readers have left their own explanations in the comments — try them out if the 1/3 stay vs 2/3 switch doesn’t click).

Understanding The Game Filter

Let’s see why removing doors makes switching attractive. Instead of the regular game, imagine this variant:

There are 100 doors to pick from in the beginning
You pick one door
Monty looks at the 99 others, finds the goats, and opens all but 1

Do you stick with your original door (1/100), or the other door, which was filtered from 99? (Try this in the simulator game; use 10 doors instead of 100).

It’s a bit clearer: Monty is taking a set of 99 choices and improving them by removing 98 goats. When he’s done, he has the top door out of 99 for you to pick.

Your decision: Do you want a random door out of 100 (initial guess) or the best door out of 99? Said another way, do you want 1 random chance or the best of 99 random chances?

We’re starting to see why Monty’s actions help us. He’s letting us choose between a generic, random choice and a curated, filtered choice. Filtered is better.

But… but… shouldn’t two choices mean a 50-50 chance?

Overcoming Our Misconceptions

Assuming that “two choices means 50-50 chances” is our biggest hurdle.

Yes, two choices are equally likely when you know nothing about either choice. If I picked two random Japanese pitchers and asked “Who is ranked higher?” you’d have no guess. You pick the name that sounds cooler, and 50-50 is the best you can do. You know nothing about the situation.

Now, let’s say Pitcher A is a rookie, never been tested, and Pitcher B won the “Most Valuable Player” award the last 10 years in a row. Would this change your guess? Sure thing: you’ll pick Pitcher B (with near-certainty). Your uninformed friend would still call it a 50-50 situation.

Information matters.

The more you know…

Here’s the general idea: The more you know, the better your decision.

With the Japanese baseball players, you know more than your friend and have better chances. Yes, yes, there’s a chance the new rookie is the best player in the league, but we’re talking probabilities here. The more you test the old standard, the less likely the new choice beats it.

This is what happens with the 100 door game. Your first pick is a random door (1/100) and your other choice is the champion that beat out 98 other doors (aka the MVP of the league). The odds are the champ is better than your door, too.

Visualizing the probability cloud

Here’s how I visualize the filtering process. At the start, every door has an equal chance — I imagine a pale green cloud, evenly distributed among all the doors.

As Monty starts removing the bad candidates (in the 99 you didn’t pick), he “pushes” the cloud away from the bad doors to the good ones on that side. On and on it goes — and the remaining doors get a brighter green cloud.

After all the filtering, there’s your original door (still with a pale green cloud) and the “Champ Door” glowing nuclear green, containing the probabilities of the 98 doors.

Here’s the key: Monty does not try to improve your door!

He is purposefully not examining your door and trying to get rid of the goats there. No, he is only “pulling the weeds” out of the neighbor’s lawn, not yours.

Generalizing the game

The general principle is to re-evaluate probabilities as new information is added. For example:

A Bayesian Filter improves as it gets more information about whether messages are spam or not. You don’t want to stay static with your initial training set of data.
Evaluating theories. Without any evidence, two theories are equally likely. As you gather additional evidence (and run more trials) you can increase your confidence interval that theory A or B is correct. One aspect of statistics is determining “how much” information is needed to have confiidence in a theory.

These are general cases, but the message is clear: more information means you re-evaluate your choices. The fatal flaw of the Monty Hall paradox is not taking Monty’s filtering into account, thinking the chances are the same before and after he filters the other doors.

Summary

Here’s the key points to understanding the Monty Hall puzzle:

Two choices are 50-50 when you know nothing about them
Monty helps us by “filtering” the bad choices on the other side. It’s a choice of a random guess and the “Champ door” that’s the best on the other side.
In general, more information means you re-evaluate your choices.

The fatal flaw in the Monty Hall paradox is not taking Monty’s filtering into account, thinking the chances are the same before and after. But the goal isn’t to understand this puzzle — it’s to realize how subsequent actions & information challenge previous decisions. Happy math.

Appendix

Let’s think about other scenarios to cement our understanding:

Your buddy makes a guess

Suppose your friend walks into the game after you’ve picked a door and Monty has revealed a goat — but he doesn’t know the reasoning that Monty used.

He sees two doors and is told to pick one: he has a 50-50 chance! He doesn’t know why one door or the other should be better (but you do). The main confusion is that we think we’re like our buddy — we forget (or don’t realize) the impact of Monty’s filtering.

Monty goes wild

Monty reveals the goat, and then has a seizure. He closes the door and mixes all the prizes, including your door. Does switching help?

No. Monty started to filter but never completed it — you have 3 random choices, just like in the beginning.

Multiple Monty

Monty gives you 6 doors: you pick 1, and he divides the 5 others into a group of 2 and 3. He then removes goats until each group has 1 door remaining. What do you switch to?

The group that originally had 3. It has 3 doors “collapsed” into 1, for 3/6 = 50% chance. Your original guess has 1/6 (16%), and the group that had 2 has a 2/6 = 33% of being right.

Math

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

How To Analyze Data Using the Average

The average is a simple term with several meanings. The type of average to use depends on whether you’re adding, multiplying, grouping or dividing work among the items in your set.

Quick quiz: You drove to work at 30 mph, and drove back at 60 mph. What was your average speed?

Hint: It’s not 45 mph, and it doesn’t matter how far your commute is. Read on to understand the many uses of this statistical tool.

But what does it mean?

Let’s step back a bit: what is the “average” all about?

To most of us, it’s “the number in the middle” or a number that is “balanced”. I’m a fan of taking multiple viewpoints, so here’s another interpretation of the average:

The average is the value that can replace every existing item, and have the same result. If I could throw away my data and replace it with one “average” value, what would it be?

One goal of the average is to understand a data set by getting a “representative” sample. But the calculation depends on how the items in the group interact. Let’s take a look.

The Arithmetic Mean

The arithmetic mean is the most common type of average:

$\displaystyle{\text{average} = \frac{\text{sum}}{\text{number}}}$

Let’s say you weigh 150 lbs, and are in an elevator with a 100lb kid and 350lb walrus. What’s the average weight?

The real question is “If you replaced this merry group with 3 identical people and want the same load in the elevator, what should each clone weigh?”

In this case, we’d swap in three people weighing 200 lbs each [(150 + 100 + 350)/3], and nobody would be the wiser.

Pros:

It works well for lists that are simply combined (added) together.
Easy to calculate: just add and divide.
It’s intuitive — it’s the number “in the middle”, pulled up by large values and brought down by smaller ones.

Cons:

The average can be skewed by outliers — it doesn’t deal well with wildly varying samples. The average of 100, 200 and -300 is 0, which is misleading.

The arithmetic mean works great 80% of the time; many quantities are added together. Unfortunately, there’s always those 20% of situations where the average doesn’t quite fit.

Median

The median is “the item in the middle”. But doesn’t the average (arithmetic mean) imply the same thing? What gives?

Humor me for a second: what’s the “middle” of these numbers?

1, 2, 3, 4, 100

Well, 3 is the middle of the list. And although the average (22) is somewhere in the “middle”, 22 doesn’t really represent the distribution. We’re more likely to get a number closer to 3 than to 22. The average has been pulled up by 100, an outlier.

The median solves this problem by taking the number in the middle of a sorted list. If there’s two middle numbers (even number of items), just take their average. Outliers like 100 only tug the median along one item in the sorted list, instead of making a drastic change: the median of 1 2 3 4 is 2.5.

Pros:

Handles outliers well — often the most accurate representation of a group
Splits data into two groups, each with the same number of items

Cons:

Can be harder to calculate: you need to sort the list first
Not as well-known; when you say “median”, people may think you mean “average”

Some jokes run along the lines of “Half of all drivers are below average. Scary, isn’t it?”. But really, in your head, you know they should be saying “half of all drivers are below median“.

Figures like housing prices and incomes are often given in terms of the median, since we want an idea of the middle of the pack. Bill Gates earning a few billion extra one year might bump up the average income, but it isn’t relevant to how a regular person’s wage changed. We aren’t interested in “adding” incomes or house prices together — we just want to find the middle one.

Again, the type of average to use depends on how the data is used.

Mode

The mode sounds strange, but it just means take a vote. And sometimes a vote, not a calculation, is the best way to get a representative sample of what people want.

Let’s say you’re throwing a party and need to pick a day (1 is Monday and 7 is Sunday). The “best” day would be the option that satisfies the most people: an average may not make sense. (“Bob likes Friday and Alice likes Sunday? Saturday it is!”).

Similarly, colors, movie preferences and much more can be measured with numbers. But again, the ideal choice may be the mode, not the average: the “average” color or “average” movie could be… unsatisfactory (Rambo meets Pride and Prejudice).

Pros:

Works well for exclusive voting situations (this choice or that one; no compromise)
Gives a choice that the most people wanted (whereas the average can give a choice that nobody wanted).
Simple to understand

Cons:

Requires more effort to compute (have to tally up the votes)
“Winner takes all” — there’s no middle path

The term “mode” isn’t that common, but now you know what button to look for when playing around with your favorite statistics program.

Geometric Mean

The “average item” depends on how we use our existing elements. Most of the time, items are added together and the arithmetic mean works fine. But sometimes we need to do more. When dealing with investments, area and volume, we don’t add factors, we multiply them.

Let’s try an example. Which portfolio do you prefer, i.e. which has a better typical year?

Portfolio A: +10%, -10%, +10%, -10%
Portfolio B: +30%, -30%, +30%, -30%

They look pretty similar. Our everyday average (arithmetic mean) tells us they’re both rollercoasters, but should average out to zero profit or loss. And maybe B is better because it seems to gain more in the good years. Right?

Wrongo! Talk like that will get you burned on the stock market: investment returns are multiplied, not added! We can’t be all willy-nilly and use the arithmetic mean — we need to find the actual rate of return:

Portfolio A:
- Return: 1.1 * .9 * 1.1 * .9 = .98 (2% loss)
- Year-over-year average: (.98)^(1/4) = 0.5% loss per year (this happens to be about 2%/4 because the numbers are small).
Portfolio B:
- 1.3 * .7 * 1.3 * .7 = .83 (17% loss)
- Year-over-year average: (.83)^(1/4) = 4.6% loss per year.

A 2% vs 17% loss? That’s a huge difference! I’d stay away from both portfolios, but would choose A if forced. We can’t just add and divide the returns — that’s not how exponential growth works.

Some more examples:

Inflation rates: You have inflation of 1%, 2%, and 10%. What was the average inflation during that time? (1.01 * 1.02 * 1.10)^(1/3) = 4.3%
Coupons: You have coupons for 50%, 25% and 35% off. Assuming you can use them all, what’s the average discount? (i.e. What coupon could be used 3 times?). (.5 * .75 * .65)^(1/3) = 37.5%. Think of coupons as a “negative” return — for the store, anyway.
Area: You have a plot of land 40 × 60 yards. What’s the “average” side — i.e., how large would the corresponding square be? (40 * 60)^(0.5) = 49 yards.
Volume: You’ve got a shipping box 12 × 24 × 48 inches. What’s the “average” size, i.e. how large would the corresponding cube be? (12 * 24 * 48)^(1/3) = 24 inches.

I’m sure you can find many more examples: the geometric mean finds the “typical element” when items are multiplied together. You take a set of numbers, multiply them, and take the Nth root (where N is the number of items you're considering).

I had wondered for a long time why the geometric mean was useful — now we know.

Harmonic Mean

The harmonic mean is more difficult to visualize, but is still useful. (By the way, “harmonics” refer to numbers like 1/2, 1/3 — 1 over anything, really.) The harmonic mean helps us calculate average rates when several items are working together. Let’s take a look.

If I have a rate of 30 mph, it means I get some result (going 30 miles) for every input (driving 1 hour). When averaging the impact of multiple rates (X & Y), you need to think about outputs and inputs, not the raw numbers.

average rate = total output/total input

If we put both X and Y on a project, each doing the same amount of work, what is the average rate? Suppose X is 30 mph and Y is 60 mph. If we have them do similar tasks (drive a mile), the reasoning is:

X takes 1/X time (1 mile = 1/30 hour)
Y takes 1/Y time (1 mile = 1/60 hour)

Combining inputs and outputs we get:

Total output: 2 miles (X and Y each contribute “1″)
Total input: 1/X + 1/Y (each takes a different amount of time; imagine a relay race)

And the average rate, output/input, is:

$\displaystyle{\frac{2}{ \frac{1}{X} + \frac{1}{Y} }}$

If we had 3 items in the mix (X, Y and Z) the average rate would be:

$\displaystyle{\frac{3}{ \frac{1}{X} + \frac{1}{Y} + \frac{1}{Z} }}$

It’s nice to have this shortcut instead of doing the algebra each time — even finding the average of 5 rates isn’t so bad. With our example, we went to work at 30mph and came back at 60mph. To find the average speed, we just use the formula.

But don’t we need to know how far work is? Nope! No matter how long the route is, X and Y have the same output; that is, we go R miles at speed X, and another R miles at speed Y. The average speed is the same as going 1 mile at speed X and 1 mile at speed Y:

$\displaystyle{\frac{2R}{\frac{R}{30} + \frac{R}{60}} = \frac{2}{\frac{1}{30} + \frac{1}{60}} = 40}$

It makes sense for the average to be skewed towards the slower speed (closer to 30 than 60). After all, we spend twice as much time going 30mph than 60mph: if work is 60 miles away, it’s 2 hours there and 1 hour back.

Key idea: The harmonic mean is used when two rates contribute to the same workload. Each rate is in a relay race and contributing the same amount to the output. For example, we’re doing a round trip to work and back. Half the result (distance traveled) is from the first rate (30mph), and the other half is from the second rate (60mph).

The gotcha: Remember that the average is a single element that replaces every element. In our example, we drive 40mph on the way there (instead of 30) and drive 40 mph on the way back (instead of 60). It’s important to remember that we need to replace each “stage” with the average rate.

A few examples:

Data transmission: We’re sending data between a client and server. The client sends data at 10 gigabytes/dollar, and the server receives at 20 gigabytes/dollar. What’s the average cost? Well, we average 2 / (1/10 + 1/20) = 13.3 gigabytes/dollar for each part. That is, we could swap the client & server for two machines that cost 13.3 gb/dollar. Because data is both sent and received (each part doing “half the job”), our true rate is 13.3 / 2 = 6.65 gb/dollar.
Machine productivity: We’ve got a machine that needs to prep and finish parts. When prepping, it runs at 25 widgets/hour. When finishing, it runs at 10 widgets/hour. What’s the overall rate? Well, it averages 2 / (1/25 + 1/10) = 14.28 widgets/hour for each stage. That is, the existing times could be replaced with two phases running at 14.28 widgets/hour for the same effect. Since a part goes through both phases, the machine completes 14.28/2 = 7.14 widgets/hour.
Buying stocks. Suppose you buy \$1000 worth of stocks each month, no matter the price (dollar cost averaging). You pay \$25/share in Jan, \$30/share in Feb, and \$35/share in March. What was the average price paid? It is 3 / (1/25 + 1/30 + 1/35) = \$29.43 (since you bought more at the lower price, and less at the more expensive one). And you have \$3000 / 29.43 = 101.94 shares. The “workload” is a bit abstract — it’s turning dollars into shares. Some months use more dollars to buy a share than others, and in this case a high rate is bad.

Again, the harmonic mean helps measure rates working together on the same result.

Yikes, that was tricky

The harmonic mean is tricky: if you have separate machines running at 10 parts/hour and 20 parts/hour, then your average really is 15 parts/hour since each machine is independent and you are adding the capabilities. In that case, the arithmetic mean works just fine.

Sometimes it’s good to double-check to make sure the math works out. In the machine example, we claim to produce 7.14 widgets/hour. Ok, how long would it take to make 7.14 widgets?

Prepping: 7.14 / 25 = .29 hours
Finishing: 7.14 / 10 = .71 hours

And yes, .29 + .71 = 1, so the numbers work out: it does take 1 hour to make 7.14 widgets. When in doubt, try running a few examples to make sure your average rate really is what you calculated.

Conclusion

Even a simple idea like the average has many uses — there are more uses we haven’t covered (center of gravity, weighted averages, expected value). The key point is this:

The “average item” can be seen as the item that could replace all the others
The type of average depends on how existing items are used (Added? Multiplied? Used as rates? Used as exclusive choices?)

It surprised me how useful and varied the different types of averages were for analyzing data. Happy math.

Math

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

An Intuitive (and Short) Explanation of Bayes’ Theorem

Bayes’ theorem was the subject of a detailed article. The essay is good, but over 15,000 words long — here’s the condensed version for Bayesian newcomers like myself:

Tests are not the event. We have a cancer test, separate from the event of actually having cancer. We have a test for spam, separate from the event of actually having a spam message.
Tests are flawed. Tests detect things that don’t exist (false positive), and miss things that do exist (false negative). People often use test results without adjusting for test errors.
False positives skew results. Suppose you are searching for something really rare (1 in a million). Even with a good test, it’s likely that a positive result is really a false positive on somebody in the 999,999.
People prefer natural numbers. Saying “100 in 10,000″ rather than “1%” helps people work through the numbers with fewer errors, especially with multiple percentages (“Of those 100, 80 will test positive” rather than “80% of the 1% will test positive”).
Even science is a test. At a philosophical level, scientific experiments are “potentially flawed tests” and need to be treated accordingly. There is a test for a chemical, or a phenomenon, and there is the event of the phenomenon itself. Our tests and measuring equipment have a rate of error to be accounted for.

Bayes’ theorem converts the results from your test into the real probability of the event. For example, you can:

Correct for measurement errors. If you know the real probabilities and the chance of a false positive and false negative, you can correct for measurement errors.
Relate the actual probability to the measured test probability. Given mammogram test results and known error rates, you can predict the actual chance of having cancer given a positive test. In technical terms, you can find Pr(H|E), the chance that a hypothesis H is true given evidence E, starting from Pr(E|H), the chance that evidence appears when the hypothesis is true.

Anatomy of a Test

The article describes a cancer testing scenario:

1% of women have breast cancer (and therefore 99% do not).
80% of mammograms detect breast cancer when it is there (and therefore 20% miss it).
9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).

Put in a table, the probabilities look like this:

$bayes table$

How do we read it?

1% of people have cancer
If you already have cancer, you are in the first column. There’s an 80% chance you will test positive. There’s a 20% chance you will test negative.
If you don’t have cancer, you are in the second column. There’s a 9.6% chance you will test positive, and a 90.4% chance you will test negative.

How Accurate Is The Test?

Now suppose you get a positive test result. What are the chances you have cancer? 80%? 99%? 1%?

Here’s how I think about it:

Ok, we got a positive result. It means we’re somewhere in the top row of our table. Let’s not assume anything — it could be a true positive or a false positive.
The chances of a true positive = chance you have cancer * chance test caught it = 1% * 80% = .008
The chances of a false positive = chance you don’t have cancer * chance test caught it anyway = 99% * 9.6% = 0.09504

The table looks like this:

And what was the question again? Oh yes: what’s the chance we really have cancer if we get a positive result. The chance of an event is the number of ways it could happen given all possible outcomes:

$\displaystyle{ \text{Probability} = \frac{\text{desired event}}{\text{all possibilities}} }$

The chance of getting a real, positive result is .008. The chance of getting any type of positive result is the chance of a true positive plus the chance of a false positive (.008 + 0.09504 = .10304).

So, our chance of cancer is .008/.10304 = 0.0776, or about 7.8%.

Interesting — a positive mammogram only means you have a 7.8% chance of cancer, rather than 80% (the supposed accuracy of the test). It might seem strange at first but it makes sense: the test gives a false positive 9.6% of the time (quite high), so there will be many false positives in a given population. For a rare disease, most of the positive test results will be wrong.

Let’s test our intuition by drawing a conclusion from simply eyeballing the table. If you take 100 people, only 1 person will have cancer (1%), and they’re most likely going to test positive (80% chance). Of the 99 remaining people, about 10% will test positive, so we’ll get roughly 10 false positives. Considering all the positive tests, just 1 in 11 is correct, so there’s a 1/11 chance of having cancer given a positive test. The real number is 7.8% (closer to 1/13, computed above), but we found a reasonable estimate without a calculator.

Bayes’ Theorem

We can turn the process above into an equation, which is Bayes’ Theorem. It lets you take the test results and correct for the “skew” introduced by false positives. You get the real chance of having the event. Here’s the equation:

And here’s the decoder key to read it:

Pr(H|E) = Chance of having cancer (H) given a positive test (E). This is what we want to know: How likely is it to have cancer with a positive result? In our case it was 7.8%.
Pr(E|H) = Chance of a positive test (E) given that you had cancer (H). This is the chance of a true positive, 80% in our case.
Pr(H) = Chance of having cancer (1%).
Pr(not H) = Chance of not having cancer (99%).
Pr(E|not H) = Chance of a positive test (E) given that you didn’t have cancer (not H). This is a false positive, 9.6% in our case.

Try it with any number:

It all comes down to the chance of a true positive divided by the chance of any positive. We can simplify the equation to:

$\displaystyle{\Pr(\mathrm{H}|\mathrm{E}) = \frac{\Pr(\mathrm{E}|\mathrm{H})\Pr(\mathrm{H})}{\Pr(\mathrm{E})}}$

Pr(E) tells us the chance of getting any positive result, whether a true positive in the cancer population (1%) or a false positive in the non-cancer population (99%). In acts like a weighting factor, adjusting the odds towards the more likely outcome.

Forgetting to account for false positives is what makes the low 7.8% chance of cancer (given a positive test) seem counter-intuitive. Thank you, normalizing constant, for setting us straight!

Intuitive Understanding: Shine The Light

The article mentions an intuitive understanding about shining a light through your real population and getting a test population. The analogy makes sense, but it takes a few thousand words to get there :).

Consider a real population. You do some tests which “shines light” through that real population and creates some test results. If the light is completely accurate, the test probabilities and real probabilities match up. Everyone who tests positive is actually “positive”. Everyone who tests negative is actually “negative”.

But this is the real world. Tests go wrong. Sometimes the people who have cancer don’t show up in the tests, and the other way around.

Bayes’ Theorem lets us look at the skewed test results and correct for errors, recreating the original population and finding the real chance of a true positive result.

Bayesian Spam Filtering

One clever application of Bayes’ Theorem is in spam filtering. We have

Event A: The message is spam.
Test X: The message contains certain words (X)

Plugged into a more readable formula (from Wikipedia):

$\displaystyle{\Pr(\mathrm{spam}|\mathrm{words}) = \frac{\Pr(\mathrm{words}|\mathrm{spam})\Pr(\mathrm{spam})}{\Pr(\mathrm{words})}}$

Bayesian filtering allows us to predict the chance a message is really spam given the “test results” (the presence of certain words). Clearly, words like “viagra” have a higher chance of appearing in spam messages than in normal ones.

Spam filtering based on a blacklist is flawed — it’s too restrictive and false positives are too great. But Bayesian filtering gives us a middle ground — we use probabilities. As we analyze the words in a message, we can compute the chance it is spam (rather than making a yes/no decision). If a message has a 99.9% chance of being spam, it probably is. As the filter gets trained with more and more messages, it updates the probabilities that certain words lead to spam messages. Advanced Bayesian filters can examine multiple words in a row, as another data point.

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Understanding the Birthday Paradox

23 people. In a room of just 23 people there’s a 50-50 chance of at least two people having the same birthday. In a room of 75 there’s a 99.9% chance of at least two people matching.

Put down the calculator and pitchfork, I don’t speak heresy. The birthday paradox is strange, counter-intuitive, and completely true. It’s only a “paradox” because our brains can’t handle the compounding power of exponents. We expect probabilities to be linear and only consider the scenarios we’re involved in (both faulty assumptions, by the way).

Let’s see why the paradox happens and how it works.

Problem 1: Exponents aren’t intuitive

We’ve taught ourselves mathematics and statistics, but let’s not kid ourselves: it’s not natural.

Here’s an example: What’s the chance of getting 10 heads in a row when flipping coins? The untrained brain might think like this:

“Well, getting one head is a 50% chance. Getting two heads is twice as hard, so a 25% chance. Getting ten heads is probably 10 times harder… so about 50%/10 or a 5% chance.”

And there we sit, smug as a bug on a rug. No dice bub.

After pounding your head with statistics, you know not to divide, but use exponents. The chance of 10 heads is not .5/10 but $.5^10$, or about .001.

But even after training, we get caught again. At 5% interest we’ll double our money in 14 years, rather than the “expected” 20. Did you naturally infer the Rule of 72 when learning about interest rates? Probably not. Understanding compound exponential growth with our linear brains is hard.

Problem 2: Humans are a tad bit selfish

Take a look at the news. Notice how much of the negative news is the result of acting without considering others. I’m an optimist and do have hope for mankind, but that’s a separate discussion :).

In a room of 23, do you think of the 22 comparisons where your birthday is being compared against someone else’s? Probably.

Do you think of the 231 comparisons where someone who is not you is being checked against someone else who is not you? Do you realize there are so many? Probably not.

The fact that we neglect the 10 times as many comparisons that don’t include us helps us see why the “paradox” can happen.

Ok, fine, humans are awful: Show me the math!

The question: What are the chances that two people share a birthday in a group of 23?

Sure, we could list the pairs and count all the ways they could match. But that’s hard: there could be 1, 2, 3 or even 23 matches!

It’s like asking “What’s the chance of getting one or more heads in 23 coin flips?” There are so many possibilities: heads on the first throw, or the 3rd, or the last, or the 1st and 3rd, the 2nd and 21st, and so on.

How do we solve the coin problem? Flip it around (Get it? Get it?). Rather than counting every way to get heads, find the chance of getting all tails, our “problem scenario”.

If there’s a 1% chance of getting all tails (more like .5^23 but work with me here), there’s a 99% chance of having at least one head. I don’t know if it’s 1 head, or 2, or 15 or 23: we got heads, and that’s what matters. If we subtract the chance of a problem scenario from 1 we are left with the probability of a good scenario.

The same principle applies for birthdays. Instead of finding all the ways we match, find the chance that everyone is different, the “problem scenario”. We then take the opposite probability and get the chance of a match. It may be 1 match, or 2, or 20, but somebody matched, which is what we need to find.

Explanation: Counting Pairs (Approximate Formula)

With 23 people we have 253 pairs:

$\displaystyle{\frac{23 \cdot 22}{2} = 253}$

(Brush up on combinations and permutations if you like).

The chance of 2 people having different birthdays is:

$\displaystyle{1 - \frac{1}{365} = \frac{364}{365} = .997260}$

Makes sense, right? When comparing one person's birthday to another, in 364 out of 365 scenarios they won't match. Fine.

But making 253 comparisons and having them all be different is like getting heads 253 times in a row -- you had to dodge "tails" each time. Let's get an approximate solution by pretending birthday comparisons are like coin flips. (See Appendix A for the exact calculation.)

We use exponents to find the probability:

$\displaystyle{\left(\frac{364}{365}\right)^{253} = .4995}$

Our chance of getting a single miss is pretty high (99.7260%), but when you take that chance hundreds of times, the odds of keeping up that streak drop. Fast.

The chance we find a match is: 1 – 49.95% = 50.05%, or just over half! If you want to find the probability of a match for any number of people n the formula is:

$\displaystyle{p(n) = 1 - \left(\frac{364}{365}\right)^{C(n,2)} = 1 - \left(\frac{364}{365}\right)^{n(n-1)/2} }$

Interactive Example

I didn’t believe we needed only 23 people. The math works out, but is it real?

You bet. Try the example below: Pick a number of items (365), a number of people (23) and run a few trials. You’ll see the theoretical match and your actual match as you run your trials. Go ahead, click the button (or see the full page).

As you run more and more trials (keep clicking!) the actual probability should approach the theoretical one.

Examples and Takeaways

Here are a few lessons from the birthday paradox:

$\sqrt{n}$ is roughly the number you need to have a 50% chance of a match with n items. $\sqrt{365}$ is about 20. This comes into play in cryptography for the birthday attack.
Even though there are 2¹²⁸ (1e38) GUIDs, we only have 2⁶⁴ (1e19) to use up before a 50% chance of collision. And 50% is really, really high.
You only need 13 people picking letters of the alphabet to have 95% chance of a match. Try it above (people = 13, items = 26).
Exponential growth rapidly decreases the chance of picking unique items (aka it increases the chances of a match). Remember: exponents are non-intuitive and humans are selfish!

After thinking about it a lot, the birthday paradox finally clicks with me. But I still check out the interactive example just to make sure.

Appendix A: Repeated Multiplication Explanation (Exact Formula)

Remember how we assumed birthdays are independent? Well, they aren’t.

If Person A and Person B match, and Person B and C match, we know that A and C must match also. The outcome of matching A and C depends on their results with B, so the probabilities aren’t independent. (If truly independent, A and C would have a 1/365 chance of matching, but we know it's a 100% guaranteed match.)

When counting pairs, we treated birthday matches like coin flips, multiplying the same probability over and over. This assumption isn’t strictly true but it’s good enough for a small number of people (23) compared to the sample size (365). It’s unlikely to have multiple people match and screw up the independence, so it’s a good approximation.

It’s unlikely, but it can happen. Let’s figure out the real chances of each person picking a different number:

The first person has a 100% chance of a unique number (of course)
The second has a (1 – 1/365) chance (all but 1 number from the 365)
The third has a (1 – 2/365) chance (all but 2 numbers)
The 23rd has a (1 – 22/365) (all but 22 numbers)

The multiplication looks pretty ugly:

$\displaystyle{p(\text{different}) = 1 \cdot \left(1-\frac{1}{365}\right) \cdot \left(1-\frac{2}{365}\right) \cdots \left(1-\frac{22}{365}\right)}$

But there’s a shortcut we can take. When x is close to 0, a coarse first-order Taylor approximation for $e^x$ is:

$\displaystyle{e^x \approx 1 + x}$

$\displaystyle{ 1 - \frac{1}{365} \approx e^{-1/365}}$

Using our handy shortcut we can rewrite the big equation to:

$\displaystyle{p(\text{different}) \approx 1 \cdot e^{-1/365} \cdot e^{-2/365} \cdots e^{-22/365}}$

$\displaystyle{p(\text{different}) \approx e^{(-1 -2 -3 ... -22)/365}}$

$\displaystyle{p(\text{different}) \approx e^{-(1 + 2 + ... 22)/365}}$

But we remember that adding the numbers 1 to n = n(n + 1)/2. Don’t confuse this with n(n-1)/2, which is C(n,2) or the number of pairs of n items. They look almost the same!

Adding 1 to 22 is (22 * 23)/2 so we get:

$\displaystyle{p(\text{different}) \approx e^{-((23 \cdot 22) /(2 \cdot 365))} = .499998}$

Phew. This approximation is very close, plug in your own numbers below:

Good enough for government work, as they say. If you simplify the formula a bit and swap in n for 23 you get:

$\displaystyle{p(\text{different}) \approx e^{-(n^2 / (2 \cdot 365))}}$

and

$\displaystyle{p(\text{match}) = 1 - p(\text{different}) \approx 1 - e^{-(n^2 / (2 \cdot 365))}}$

With the exact formula, 366 people has a guaranteed collision: we multiply by $1 - 365/365 = 0$, which eliminates $p(\text{different})$ and makes $p(\text{match}) = 1$. With the approximation formula, 366 has a near-guarantee, but is not exactly 1: $1 - e^{-365^2 / (2 \cdot 365)} \approx 1$ .

Appendix B: The General Birthday Formula

Let’s generalize the formula to picking n people from T total items (instead of 365):

$\displaystyle{p(\text{different}) \approx e^{-(n^2 / 2 \cdot T)}}$

If we choose a probability (like 50% chance of a match) and solve for n:

$\displaystyle{p(\text{different}) \approx e^{-(n^2 / 2 \cdot T)}}$

$\displaystyle{1 - p(\text{match}) \approx e^{-(n^2 / 2 \cdot T)}}$

$\displaystyle{1 - .5 \approx e^{-(n^2 / 2 \cdot T)}}$

$\displaystyle{-2\ln(.5)\cdot T \approx n^2}$

$\displaystyle{n \approx 1.177 \sqrt{T}}$

Voila! If you take $\sqrt{T}$ items (17% more if you want to be picky) then you have about a 50-50 chance of getting a match. If you plug in other numbers you can solve for other probabilities:

$\displaystyle{n \approx \sqrt{-2\ln(1-m)} \cdot \sqrt{T}}$

Remember that m is the desired chance of a match (it’s easy to get confused, I did it myself). If you want a 90% chance of matching birthdays, plug m=90% and T=365 into the equation and see that you need 41 people.

Wikipedia has even more details to satisfy your inner nerd. Go forth and enjoy.

Math

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Caveman Statistician Og

Bayesian Spam Filter

Exploring Bayes Theorem

Thinking With Ratios and Percentages

Topic Reference

Bayes' Theorem

Join 450k Monthly Readers

The Usual Suspects

Join 450k Monthly Readers

Play the game

Understanding Why Switching Works

Understanding The Game Filter

Overcoming Our Misconceptions

The more you know…

Visualizing the probability cloud

Generalizing the game

Summary

Appendix

Join 450k Monthly Readers

But what does it mean?

The Arithmetic Mean

Median

Mode

Geometric Mean

Harmonic Mean

Yikes, that was tricky

Conclusion

Join 450k Monthly Readers

Anatomy of a Test

How Accurate Is The Test?

Bayes’ Theorem

Intuitive Understanding: Shine The Light

Bayesian Spam Filtering

Further Reading

Topic Reference

Bayes' Theorem

Join 450k Monthly Readers

Problem 1: Exponents aren’t intuitive

Problem 2: Humans are a tad bit selfish

Ok, fine, humans are awful: Show me the math!

Explanation: Counting Pairs (Approximate Formula)

Interactive Example

Examples and Takeaways

Appendix A: Repeated Multiplication Explanation (Exact Formula)

Appendix B: The General Birthday Formula

Join 450k Monthly Readers