A Simple Introduction To Computer Networking

Most networking discussions are a jumble of acronyms. Forget the configuration details — what are the insights?

Networking is about communication
Text is the simplest way to communicate
Protocols are standards for reading and writing text

Beneath the details, networking is an IM conversation. Here’s what I wish someone told me when learning how computers communicate.

TCP: The Text Layer

The Transmission Control Protocol (TCP) provides the handy illusion that we can “just” send text between two computers. TCP relies on lower levels and can send binary data, but ignore that for now:

TCP lets us Instant Message between computers

We IM with Telnet, the ‘notepad’ of networking: telnet sends and receives plain text using TCP. It’s a chat client peacefully free of ads and unsolicited buddy requests.

Let’s talk to Google using telnet (or putty, a better utility):

telnet google.com 80
[connecting...]
Hello Mr. Google!

We connect to google.com on port 80 (the default for web requests) and send the message “Hello Mr. Google!”. We press Enter a few times and await the reply:

<html>
...
<h1>Bad Request</h1>
Your client has issued a malformed or illegal request
...
</html>

Malformed? Illegal? The mighty Google is not pleased. It didn’t understand us and sent HTML telling the same.

But, we had a conversation: text went in, and text came back. In other words:

Protocols: The Forms To Fill Out

Unstructured chats are too carefree — how does the server know what we want to do? We need a protocol (standard way of communicating) if we’re going to make sense.

We use protocols all the time

Putting “to” and “from” addresses in special places on an envelope
Filling out bank forms (special place for account number, deposit amount, etc.)
Saying “Roger” or “10-4” to indicate a radio request was understood

Protocols make communication clear.

Case Study: The HTTP Protocol

We see HTTP in every url: http://google.com/. What does it mean?

Connect to server google.com (Using TCP, port 80 by default)
Ask for the resource “/” (the default resource)
Format the request using the Hypertext Transfer Protocol

HTTP is the “form to fill out” when asking for the resource. Using the HTTP format, the above request looks like this:

GET / HTTP/1.0

Remember, it’s just text! We’re asking for a file, through an IM session, using the format: [Command] [Resource] [Protocol Name/Version].

This command is “IM’d” to the server (your browser adds extra info, a detail for another time). Google’s server returns this response:

HTTP/1.0 200 OK
Cache-Control: private, max-age=0
Date: Sun, 15 Mar 2009 03:13:39 GMT
Expires: -1
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=5cc6…
Server: gws
Connection: Close

<html>
(Google web page, search box, and cute logo)
</html>

Yowza. The bottom part is HTML for the browser to display. But why the junk up top?

Well, suppose we just got the raw HTML to display. But what about errors: if the server crashed, the file wasn’t there, or google just didn’t like us?

Some metadata (data about data) is useful. When we order a book from Amazon we expect a packing slip describing the order: the intended recipient, price, return information, etc. You don’t want a naked book just thrown on your doorstep.

Protocols are similar: the recipient wants to know if everything was OK. Here we see infamous status codes like 404 (resource not found) or 200 (everything OK). These headers aren’t the real data — they’re the packing slip from the server.

Insights From Protocols

Studying existing, popular systems is a great way to understand engineering decisions. Here are a few:

Binary vs Plain Text

Binary data is more efficient than text, but more difficult to debug and generate (how many hex editors do you know to use?). Lower-level protocols, the backbone of the internet, use binary data to maintain performance. Application-level protocols (HTTP and above) use text data for ease of interoperability. You don’t have religious wars about endian issues with HTTP.

Stateful vs. Stateless

Some protocols are stateful, which means the server remembers the chat with the client. With SMTP, for example, the client opens a connection and issues commands one at a time (such as adding recipients to an email), and closes the connection. Stateful communication is useful in transactions that have many steps or conditions.

Stateless communication is simpler: you send the entire transaction as one request. Each “instant message” stands on its own and doesn’t need the others. HTTP is stateless: you can request a webpage without introducing yourself to the server.

Extensibility

We can’t think of everything beforehand. How do we extend old protocols for new users?

HTTP has a simple and effective “header” structure: a metadata preamble that looks like “Header:Value”.

If you don’t recognize the header sent (new client, old server) just ignore it. If you were expecting a header but don’t see it (old client, new server), just use a default. It’s like having an “Anything else to tell us?” section in a survey.

Error Correction & Reliability

It’s the job of lower-level protocols like TCP to make sure data is transmitted reliably. But higher-level protocols (like HTTP) need to make sure it’s the right data. How are errors handled and communicated? Can the client just retry or does the server need to reset state?

HTTP comes with its own set of error codes to handle a variety of situations.

Availability

The neat thing about networking is that works on one computer. Memcached is a great service to cache data. And guess what? It uses plain-old text commands (over TCP) to save and retrieve data.

You don’t need complex COM objects or DLLs – you start a Memcached server, send text in, and get text out. It’s language-neutral and easy to access because any decent OS supports networking. You can even telnet into Memcached to debug it.

Wireless routers are similar: they have a control panel available through HTTP. There’s no “router configuration program” — you just connect to it with your browser. The router serves up webpages, and when you submit data it makes the necessary configuration changes.

Protocols like HTTP are so popular you can assume the user has a client.

Layering Protocols

Protocols can be layered. We might write a resume, which is part of a larger application, which is stuffed into an envelope. Each segment has its own format, blissfully unaware of the others. Your envelope doesn’t care about the resume — it just wants the to: and from: addresses written correctly.

Many protocols rely on HTTP because it’s so widely used (rather than starting from scratch, like Memcached, which needs efficiency). HTTP has well-understood methods to define resources (URLs) and commands (GET and POST), so why not use them?

Web services do just that. The SOAP protocol crams XML inside of HTTP commands. The REST protocol embraces HTTP and uses the existing verbs as much as possible.

Remember: It’s All Made Up

Networking involves human conventions. Because plain text is ubiquitous and easy to use, it is the basis for most protocols. And TCP is the simplest, most-supported way to exchange text.

Remembering that everything is a plain text IM conversation helps me wrap my head around the inevitable networking issues. And sometimes you need to jump into HTTP to understand compression and caching.

Don’t just memorize the details; see protocols as strategies to solve communication problems. Happy networking.

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

The Quick Guide to GUIDs

Our world is numbered. Books have ISBNs and products have barcodes. Cars have VINs, even people have social security numbers.

Numbers help us reference items unambiguously. “John Smith” may be many people, but Social Security Number 123-45-6789 refers to one person exactly.

A GUID (globally unique identifier) is a bigger, badder version of this type of ID number. You may see the term UUID tossed about (universally unique identifier), a nitpicky word for those whose numbers are unique not only within the globe, but throughout the entire universe.

Any way you title it, GUIDs or UUIDs are just big, gigantic ID numbers.

The Problem With Counting

“We don’t need no stinkin’ GUIDs,” you may be thinking between gulps of Top Ramen, “I’ll just use regular numbers and start counting up from 1.”

Sure, it sounds easy. Just start with ISBN #1 and add one for each new book. But problems arise:

Who does the counting? A central authority?
Who handles simultaneous requests and eliminates duplicates?
Can IDs be shared between products? Is Social Security Number 1 different from ISBN 1?
Can people guess what the next ID will be? How many IDs have been issued?

The problem with counting is that we want to create ID numbers without the management headache.

GUIDs to the Rescue

GUIDs are large, enormous numbers that are nearly guaranteed to be unique. They are usually 128 bits long and look like this in hexadecimal:

30dd879c-ee2f-11db-8314-0800200c9a66

The format is a well-defined sequence of 32 hex digits grouped into chunks of 8-4-4-4-12. This gives us 2^128 or about 10^38 numbers.

Here’s the thinking behind GUIDs:

If you pick a huge random number (39 digits long), it’s really unlikely that someone will pick the same one.
GUIDs are not tied to a product. A GUID can be used for people, cars, files, webpages, colors, anything. With regular registration numbers, you start counting at 1 and numbers can overlap. Social Security Number 123-45-6789 is different from ISBN 123456789 which is different from barcode 123456789. This isn’t an issue with GUIDs.
It’s up to the person reading the GUID to figure out the context of the GUID. There are so many GUIDs that you can use them to number everything and not run out.

GUIDs give you a unique serial number that can be used on any item in the universe.

The Great GUID Shortage

When learning about GUIDs, it feels like 39 measly digits aren’t enough. Won’t we run out if people get GUID-crazy, assigning them for everything from their pets to their favorite bubble gum flavor?

Let’s see. Think about how big the Internet is: Google has billions of web pages in its index. Let’s call it a trillion (10^12) for kicks. Think about every wikipedia article, every news item on CNN, every product in Amazon, every blog post from any author. We can assign a GUID for each of these documents.

Now let’s say everyone on Earth gets their own copy of the internet, to keep track of their stuff. Even crazier, let’s say each person gets their own copy of the internet every second. How long can we go on?

Over a billion years.

Let me say that again. Each person gets a personal copy of the internet, every second, for a billion years.

It’s a mind-boggling amount of items, and it’s hard to get our heads around it. Trust me, we won’t run out of GUIDs anytime soon. And if we do? We’ll start using GUIDs with more digits.

Using GUIDs

If you want to create GUIDs, try the

Online GUID Generator
GUID libraries for PHP, Perl, Ruby, Python, .NET
usesguid plugin for Ruby on Rails: Use a GUID instead of an integer as a primary key in your database.

There are several ways to create GUIDs (RFC 4122 describes the conventions), but you want to avoid that mess and use a library. The general types of GUIDs are:

Random: Just use the system’s random-number generator to create a 128-bit number.
Time-based: Create a GUID based on the current time.
Hardware-based: Make a GUID with certain portions based on hardware features, such as the MAC address of a network card. This isn’t great because the GUID isn’t “anonymous” and can be partially traced to the machine that created it.
Content-based (MD5 or SHA-1 hash of data): Create a GUID based on a hash of the file contents. Files with the same contents will get the same GUID. You can also seed the hash with a unique namespace (like your URL).

You can mix-and-match techniques above. If you want duplicate files to have the same GUID, then use GUIDs based on the contents. If you want GUIDs to be unique, even if the contents are the same, then create them randomly or with a combination of file contents and a random number.

GUID Examples

Here’s a few things you can do with GUIDs:

Unique primary key in databases. This lets database items created on separate machines be merged later without conflict, and without the need for a central server to manage IDs.
Unique filename for uploaded files. If each version of the file gets its own GUID, you can set a long cache expiration time.
Unique name for resources (del.icio.us URL for instacalc: http://del.icio.us/url/6c5ff0ed608e75724df94a52b05dd6a8)
Allow vendors to create and register unique IDs without contacting a central authority (like class IDs in COM)

The Tradeoffs with GUIDs

Like most things in life, GUIDs have benefits and drawbacks. Weigh the features to see if they make sense:

Pros:

No central authority: You avoid the need for management, but can’t keep track of what’s been assigned. A compromise is to generate GUIDs internally and then hand them out.
Easily combined: You can merge sets of GUIDs from different data sources with a microscopic chance of conflict.

Cons:

Appear random: Users cannot easily guess the ID for an object they don’t know. This is good for security, difficult for debugging.
GUID overhead: GUIDs are an example of the time-space tradeoff. You save time in merging but have to use space to store the large (16-byte) GUID. It may not make sense to have a 16-byte GUID keeping track of a 4-byte item in your database.

GUIDs are not a GUARantee

There’s one giant caveat for GUIDs: collisions are still possible.

First, the birthday paradox shows us the chance of a collision as GUIDs are used. It’s very, very unlikely that GUIDs will collide, but as more are assigned, there are fewer left to choose from.

Second, a malicious user could try hijacking GUIDs that he knows will be used (assuming the user can assign their own GUIDs), or resubmitting different content to a previous GUID (submitting file A under the hash of file B).

If you are writing software, program defensively and detect cases where the GUID already exists. Give the user an error or even better, recover, create a new GUID on the server side and try again. GUIDs are great, but they aren’t a magic bullet.

As always, we’re never done learning. Read more about GUIDs here:

A Universally Unique IDentifier (UUID) URN Namespace
Coding Horror: Primary Keys vs. GUIDs
Wikipedia on GUIDs and UUIDs

Programming

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Understanding Quake’s Fast Inverse Square Root

An article and research paper describe a fast, seemingly magical way to compute the inverse square root ($1/\sqrt{x}$), used in the game Quake.

I'm no graphics expert, but appreciate why square roots are useful. The Pythagorean theorem computes distance between points, and dividing by distance helps normalize vectors. (Normalizing is often just a fancy term for division.)

3D games like Quake divide by distance zillions (yes zillions) of times each second, so "minor" performance improvements help immensely. We don't want to take the square root and divide the regular way: exponentiation and division are really, really expensive for the CPU.

Given these conditions, here's the magic formula to get $1/\sqrt{x}$, as found in Quake (my comments inserted):


    float InvSqrt(float x){
        float xhalf = 0.5f * x;
        int i = *(int*)&x;            // store floating-point bits in integer
        i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
        x = *(float*)&i;              // convert new bits into float
        x = x*(1.5f - xhalf*x*x);     // One round of Newton's method
        return x;
    }

Yowza! Somehow, this code gets $1/\sqrt{x}$ using only multiplication and bit-shift operations. There's no division or exponents involved -- how does it work?

My Understanding: This incredible hack estimates the inverse root using Newton's method of approximation, and starts with a great initial guess.

To make the guess, it takes floating-point number in scientific notation, and negates & halves the exponent to get something close the the inverse square root. It then runs a round of Newton's approximation method to further refine the estimate and tada, we've got something near the inverse square root.

Newton's Method of Approximation

Newton's method can be used to find approximate roots of any function. You can keep iterating the method to get closer and closer to the root, but this function only uses 1 step! Here's a crash-course on Newton's method (it was new to me):

Let's say you have a function f(x) and you want to find its root (aka where f(x) = 0). Let's call your original guess "g". Newton's method gives you a way to get a new, better guess for the root:

$\displaystyle{\text{new guess} = g - \frac{f(g)}{f'(g)}}$

You can keep repeating this process (plugging in your new guess into the formula) and get closer approximations for your root. Eventually you have a "new guess" that makes f(new guess) really, really close to zero -- it's a root! (Or close enough for government work, as they say).

In our case, we want the inverse square function. Let's say we have a number $i$ (that's all we start with, right?) and want to find the inverse square root: $1/\sqrt{i}$. If we make a guess "x" as the inverse root, the error between our original number and our guess "x" is:

$\displaystyle{\text{error}(x) = \frac{1}{x^2} - i}$

This is because x is roughly $1/\sqrt{i}$. If we square x we get $1/i$, and if we take the inverse we should get something close to $i$. If we subtract these two values, we can find our error.

Clearly, we want to make our error as small as possible. That means finding the "x" that makes error(x) = 0, which is the same as finding the root of the error equation. If we plug error(x) into Newton's approximation formula:

$\displaystyle{\text{newguess} = g - \frac{\text{error}(g)}{\text{error}'(g)}}$

and take the proper derivatives:

$\displaystyle{\text{error}(g)= g^{-2} - i}$

$\displaystyle{\text{error}'(g)= -2g^{-3}}$

we can plug them in to get the formula for a better guess:

$\displaystyle{\text{newguess} = g - \frac{g^{-2} - i}{-2g^{-3}} }$

$\displaystyle{\text{newguess} = g - (-0.5g + 0.5ig^3) }$

$\displaystyle{\text{newguess} = 1.5g - 0.5ig^3}$

$\displaystyle{\text{newguess} = g (1.5 - 0.5ig^2)}$

Which is exactly the equation you see in the code above, remembering that x is our new guess (g) and "xhalf" is half of the original value ($0.5 i$):

x = x*(1.5f - xhalf*x*x);

With this formula, we can start with a guess "g" and repeat the formula to get better guesses. Try this demo for using multiple iterations to find the inverse square:

In this demo, we start by guessing the square root is half the number: $\sqrt{n} \sim \frac{n}{2}$, which means $\frac{1}{\sqrt{n}} \sim \frac{2}{n}$. Running a few rounds of Newton's Method quickly converges on the real result. (Try n=2, 4, 10, etc.)

So my friends, the question becomes: "How can we make a good initial guess?"

Making a Good Guess

What's a good guess for the inverse square root? It's a bit of a trick question -- our best guess for the inverse square root is the inverse square root itself!

Ok hotshot, you ask, how do we actually get $1/\sqrt{x}$?

This is where the magic kicks in. Let's say you have a number in exponent form or scientific notation:

$\displaystyle{10^6 = \text{1 million}}$

Now, if you want to find the regular square root, you'd just divide the exponent by 2:

$\displaystyle{\sqrt{10^6} = 10^{\frac{6}{2}} = 10^3 = 1,000}$

And if you want the inverse square root, divide the exponent by -2 to flip the sign:

$\displaystyle{\frac{1}{\sqrt{10^6}} = 10^{\frac{6}{-2}} = 10^{-3} = \frac{1}{1,000}}$

So, how can we get the exponent of a number without other expensive operations?

Floats are stored in mantissa-exponent form

Well, we're in luck. Floating-point numbers are stored by computers in mantissa-exponent form, so it's possible to extract and divide the exponent!

But instead of explicitly doing division (expensive for the CPU), the code uses another clever hack: it shifts bits. Right-shifting by one position is the same as dividing by two (you can try this for any power of 2, but it will truncate the remainder). And if you want to get a negative number, instead of multiplying by -1 (multiplications are expensive), just subtract the number from "0" (subtractions are cheap).

So, the code converts the floating-point number into an integer. It then shifts the bits by one, which means the exponent bits are divided by 2 (when we eventually turn the bits back into a float). And lastly, to negate the exponent, we subtract from the magic number 0x5f3759df. This does a few things: it preserves the mantissa (the non-exponent part, aka 5 in: $5 \cdot 10^6$), handles odd-even exponents, shifting bits from the exponent into the mantissa, and all sorts of funky stuff. The paper has more details and explanation, I didn't catch all of it the first time around. As always, feel free to comment if you have a better explanation of what's happening.

The result is that we get an initial guess that is really close to the real inverse square root! We can then do a single round of Newton's method to refine the guess. More rounds are possible (at an additional computational expense), but one round is all that's needed for the precision needed.

So, why the magic number?

The great hack is how integers and floating-point numbers are stored. Floating-point numbers like $5.4 \cdot 10^6$ store their exponent in a separate range of bits than "5.4". When you shift the entire number, you divide the exponent by 2, as well as dividing the number (5.4) by 2 as well. This is where the magic number comes in -- it does some cool corrections for this division, that I don't quite understand. However, there are several magic numbers that could be used -- this one happens to minimize the error in the mantissa.

The magic number also corrects for even/odd exponents; the paper mentions you can also find other magic numbers to use.

Resources

There's further discussion on reddit (user pb_zeppelin) and slashdot:

Programming

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Swap two variables using XOR

Most people would swap two variables x and y using a temporary variable, like this:

tmp = x
x = y
y = tmp

Here’s a neat programming trick to swap two values without needing a temp:

x = x xor y
y = x xor y
x = x xor y

Don’t believe me? Try it out – write in any initial value for x and y:

It almost seems like magic – the the same statement is repeated 3 times and voila – the values magically swap? Let’s take a closer look.

How it works

To understand this trick, break the statements into unique values:

x1 = x xor y
y1 = x1 xor y
x2 = x1 xor y1

According to our code, x2 should have y’s original value. Let’s work out the details for the last step:

x2 = x1 xor y1
x2 = x1 xor (x1 xor y)   // replace y1
x2 = (x1 xor x1) xor y   // regroup parenthesis - order does not matter for XOR
x2 = 0 xor y             // a xor a => 0
x2 = y                   // 0 xor a => a; x2 now has y's original value

Wow – x2 really does equal y! The swap happened. Now let’s try it out for y1:

y1 = x1 xor y
y1 = (x xor y) xor y
y1 = x xor (y xor y)
y1 = x xor 0
y1 = x                  // y1 == x's original value

And tada the trick worked again. x2 and y1 have the swapped answers.

Intuitive Understanding

Ok, sure, the boolean algebra works out great — but that’s not satisfying. I want to understand it deep down and have it make sense, not be some artifact of the properties of XOR. Let’s take another look:

1:   x = x xor y
2:   y = x xor y
3:   x = x xor y

On line 1 we combine x and y (using XOR) to get this “hybrid” and we store it back in x. XOR is a great way to save information, because you can remove it by doing an XOR again.

So, this is exactly what we do on line 2. We XOR the hybrid with y, which cancels out all the y information, leaving us only with x. We save this result back into y, so now they have swapped.

On the last line, x still has the hybrid value. We XOR it yet again with y (now with x’s original value) to remove all traces of x out of the hybrid. This leaves us with y, and the swap is complete!

Would you really use this?

No way. This is a cool trick, but don’t write this as an actual swap function. If you have to debug it in 6 months you’ll be in for some fun. Let me show you why:

Suppose x and y are pointers or references to objects, and both point to the same location. We’d expect our swapping function to just switch the values around, with no net change, right?

Well, take a look at what happens if we expand out line 1:

x = x xor y
x = x xor x  // x and y are equal
x = 0

Wow! So x becomes 0 right at the get-go. That’s ok by itself, but because x and y are at the same location, we just made y zero as well! We’ve lost the original values, a problem known as aliasing: changing one variable has an indirect effect on another.

So, would you have caught this bug? I wouldn’t have, and it would have been a nightmare figuring out why an innocuous swap function was causing data loss. Cute tricks like these can be pretty dangerous. As Brian Kernighan said:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

This applies so well here – we wrote the code pretty much as cleverly as we could :). So, treat this as an intellectual exercise that brings up a few points:

XOR is a great way to combine information and later extract it. XOR-based encryption uses this technique. Also, XOR can combine N items together, not just 2.
There are new ways to perform even the simplest operations.

Even more hairy details

Now, how does this work on a CPU-level?

The computer actually has an implicit “temp” variable that stores intermediate results before writing them back to a register. For example, if you add 3 to a register (in machine-language pseudocode):

ADD 3 A  // add 3 to register A

The ALU (Arithmetic Logic Unit) is actually what executes the instruction 3+A. It takes the inputs (3,A) and creates a result (3 + A), which the CPU then stores back into A’s original register. So, we used the ALU as temporary scratch space before we had the final answer.

We take the ALU’s implicit temporary data for granted, but it’s always there. In a similar way, the ALU can return the intermediate result of the XOR in the case of x = x xor y, at which point the CPU stores it into x’s original register.

Because we aren’t used to thinking about the poor, neglected ALU, the XOR swap seems magical because it doesn’t have an explicit temporary variable. Some machines have a 1-step exchange XCHG instruction to swap two registers.

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Number Systems and Bases

Base systems like binary and hexadecimal seem a bit strange at first. The key is understanding how different systems “tick over” like an odometer when they are full. Base 10, our decimal system, “ticks over” when it gets 10 items, creating a new digit. We wait 60 seconds before “ticking over” to a new minute. Hex and binary are similar, but tick over every 16 and 2 items, respectively.

Try converting numbers to hex and binary here:

Way back when: Unary Numbers

Way back in the day, we didn’t have base systems! It was uphill both ways, through the snow and blazing heat. When you wanted to count one, you’d write:

When you wanted 5, you’d write

lllll

And clearly, 1 + 5 = 6

l + lllll = llllll

This is the simplest way of counting.

Enter the Romans

In Roman numerals, two was one, twice. Three was one, thrice:

one = I
two = II
three = III

However, they decided they could do better than the old tradition of lines in the sand. For five, we could use V to represent lllll and get something like

l + V = Vl

Not bad, eh? And of course, there are many more symbols (L, C, M, etc.) you can use.

The key point is that V and lllll are two ways of encoding the number 5.

Give each number a name

Another breakthrough was realizing that each number can be its own distinct concept. Rather than represent three as a series of ones, give it its own symbol: “3″. Do this from one to nine, and you get the symbols:

1 2 3 4 5 6 7 8 9

The Romans were close, so close, but only gave unique symbols to 5, 10, 50, 100, 1000, etc.

Use your position

Now clearly, you can’t give every number its own symbol. There’s simply too many.

But notice one insight about Roman numerals: they use position of symbols to indicate meaning.

IV means “subtract 1 from 5″

and VI means “add 1 to 5″.

In our number system, we use position in a similar way. We always add and never subtract. And each position is 10 more than the one before it.

So, 35 means “add 3*10 to 5*1″ and 456 means 4*100 + 5*10 + 6*1. This “positional decimal” setup is the Hindu-Arabic number system we use today.

Our choice of base 10

Why did we choose to multiply by 10 each time? Most likely because we have 10 fingers.

One point to realize is you need enough digits to “fill up” until you hit the next number. Let me demonstrate.

If we want to roll the odometer over every 10, so to speak, we need symbols for numbers one through nine; we haven’t reached ten yet. Imagine numbers as ticking slowly upward – at what point do you flip over the next unit and start from nothing?

Enter zero

And what happens when we reach ten? How do we show we want exactly one “ten” and nothing in the “ones” column?

We use zero, the number that doesn’t exist. Zero is quite a concept, it’s a placeholder, a blank, a space, and a whole lot more. Suffice it to say, Zero is one of the great inventions of all time.

Zero allows us to have an empty placeholder, something the Romans didn’t have. Look how unwieldly their numbers are without it.

George Orwell’s famous novel “1984″ would be “MCMLXXXIV”! Rolls right off the tongue, doesn’t it?

Considering other bases

Remember that we chose to roll over our odometer every ten. Our counting looks like this:

1
2
3
4
5
6
7
8
9 (uh oh, I’m getting full!)
10 (ticked over – start a new digit)

What if we ticked over at 60 when we counted, like we do for seconds and minutes?

1 second
2
3
4
5
…
58
59
1:00 (60 seconds aka 1 minute. We’ve started a new digit.)

Everything OK so far, right? Note that we use the colon (:) indicate that we are at a new “digit”. In base 10, each digit can stand on its own.

Try Base 16

If we want base 16, we could do something similar:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 (we’re getting full)
1:00 (16 – we’ve started a new digit)

However, we don’t want to write hexadecimal numbers with the colon notation (even though we could). We’d rather cook up separate symbols for 10-15 so we can just write numbers like we’re used to. We’ve run out of numbers (1-9 already used, with 0 as a placeholder) so we need some other symbols. We could use some squiggly lines or other shapes, but the convenions is to use letters, Roman style. Just like 5 became V, programmers use letters A-F to get enough digits up to 16. That is,


1
2
3
4
5
6
7
8
9
A (10 – we’re using the symbol “A”)
B (11)
C (12)
D (13)
E (14)
F (15 – uh oh, we’re getting full)
10 (16 – we start a new digit)

Ahah! Now we can use one digit per “place”, and we know that 10 actually means we’ve “ticked over to 16″ once.

20 means we’ve ticked over to 16 twice (32).

25 means we’ve ticked over to 16 twice (giving us 32) and gone an extra 5. The total is 32 + 5 = 37.

Quick review

With me so far? This is pretty cool, right? We can count in any system we want. Also notice that base 16 is more “space efficient” in the sense we can write a number like 11 in a single digit: B.

Base 16 really isn’t that different from base 10, we just take longer to fill up.

The wonderful world of binary

We’ve seen plenty of base systems, from over-simple unary, to the unwiedly Roman numerals, the steady-going base 10 and the compact base 16.

What’s great about binary? In the spirit of keeping things simple, it’s the simplest number system that has the concept of “ticking over”. Unary, where we just write 1, 11, 111… just goes on forever. Binary, with two options (1 and 0) looks like this:


1: 1
2: 10 (we’re full – tick over)
3: 11
4: 100 (we’re full again – tick over)
5: 101
6: 110
7: 111
8: 1000 (tick over again)
…

and so on.

Because binary is so simple, it’s very easy to build in hardware. You just need things that can turn on or off (representing 1 and 0), rather than things that have 10 possible states (to represent decimal).

Because it’s so simple, binary is also resistant to errors. If your signal is “partially on” (let’s say 0.4), you can assume that’s a zero. And if it’s mostly on (say 0.8), then you can assume it’s a 1. If you’re using a system with 10 possible states, it’s difficult to tell when an error occurred. This is one reason digital signals are so resilient to noise.

Other examples of bases

We use other bases all the time, even dynamically changing bases. We usually don’t think of it that way:

Hours, minutes, seconds: 1:32:04

We know this is 1 hour, 32 minutes, 4 seconds. In seconds, this is 16060 + 32*60 + 4.

Feet and inches: 3′ 5″

This is 3 feet, 5 in or 3 * 12 + 5 inches.

Pounds and ounces: 8 lbs, 5 oz

Since a pound is 16 oz, This is 8 * 16 + 5 oz. We’ve been using a base 16 number system all along!

Parting thoughts

“10″ in any number system indicates the base, and means we’ve ticked over once. 10 in binary means two, 10 in decimal means ten, and 10 in hexadecimal is sixteen.

How do you keep these numbers apart? Programmers will often write “0b” in front of binary numbers. So 2 in binary is

0b10

Similarly, they’ll write 0x in front of hex numbers. So 16 in hex is:

0×10

If there aren’t any symbols (0b or 0x) in front, we assume it’s base 10, a regular number.

Now go forth and enjoy your new knowledge!

Programming

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Sorting Algorithms

(Still a work-in progress; I want to revisit with intuitive explanations and playing-card examples)

Sorting is a key to CS theory, but easy to forget. I had an itch to review the algorithms in Wikipedia (strange, I know), and here are my notes:

High-level thoughts

Some algorithms (selection, bubble, heapsort) work by moving elements to their final position, one at a time. You sort an array of size N, put 1 item in place, and continue sorting an array of size N – 1 (heapsort is slightly different).
Some algorithms (insertion, quicksort, counting, radix) put items into a temporary position, close(r) to their final position. You rescan, moving items closer to the final position with each iteration.
One technique is to start with a “sorted list” of one element, and merge unsorted items into it, one at a time.
Complexity and running time
- Factors: algorithmic complexity, startup costs, additional space requirements, use of recursion (function calls are expensive and eat stack space), worst-case behavior, assumptions about input data, caching, and behavior on already-sorted or nearly-sorted data
- Worst-case behavior is important for real-time systems that need guaranteed performance. For security, you want the guarantee that data from an attacker does not have the ability to overwhelm your machine.
- Caching — algorithms with sequential comparisons take advantage of spatial locality and prefetching, which is good for caching.
- Algorithmic time vs. real time — The simple algorithms may be O(N^2), but have low overhead. They can be faster for sorting small data sets (< 10 items). One compromise is to use a different sorting method depending on the input size.
- “Comparison sorts” make no assumptions on the data and compare all elements against each other (majority of sorts). O(N lg N) time is the ideal “worst-case” scenario (if that makes sense — O(N lg N) is the smallest penalty you can hope for in the worst case). Heapsort has this behavior.
- O(N) time is possible if we make assumptions about the data and don’t need to compare elements against each other (i.e., we know the data falls into a certain range or has some distribution). O(N) clearly is the minimum sorting time possible, since we must examine every element at least once (how can you sort an item you do not even examine?).

Notes

Assume we are sorting a list or array of N elements
Once sorted, smaller items are on the left (first item) and larger items are on the right (last item)

Bubble Sort [Best: O(n), Worst:O(N^2)]

Starting on the left, compare adjacent items and keep “bubbling” the larger one to the right (it’s in its final place). Bubble sort the remaining N -1 items.

Though “simple” I found bubble sort nontrivial. In general, sorts where you iterate backwards (decreasing some index) were counter-intuitive for me. With bubble-sort, either you bubble items “forward” (left-to-right) and move the endpoint backwards (decreasing), or bubble items “backward” (right-to-left) and increase the left endpoint. Either way, some index is decreasing.
You also need to keep track of the next-to-last endpoint, so you don’t swap with a non-existent item.

Selection Sort [Best/Worst: O(N^2)]

Scan all items and find the smallest. Swap it into position as the first item. Repeat the selection sort on the remaining N-1 items.

I found this the most intuitive and easiest to implement — you always iterate forward (i from 0 to N-1), and swap with the smallest element (always i).

Insertion Sort [Best: O(N), Worst:O(N^2)]

Start with a sorted list of 1 element on the left, and N-1 unsorted items on the right. Take the first unsorted item (element #2) and insert it into the sorted list, moving elements as necessary. We now have a sorted list of size 2, and N -2 unsorted elements. Repeat for all elements.

Like bubble sort, I found this counter-intuitive because you step “backwards”
This is a little like bubble sort for moving items, except when you encounter an item smaller than you, you stop. If the data is reverse-sorted, each item must travel to the head of the list, and this becomes bubble-sort.
There are various ways to move the item leftwards — you can do a swap on each iteration, or copy each item over its neighbor

Quicksort [Best: O(N lg N), Avg: O(N lg N), Worst:O(N^2)]

There are may versions of Quicksort, which is one of the most popular sorting methods due to its speed (O(N lgN) average, but O(N^2) worst case). Here’s a few:

Using external memory:

Pick a “pivot” item
Partition the other items by adding them to a “less than pivot” sublist, or “greater than pivot” sublist
The pivot goes between the two lists
Repeat the quicksort on the sublists, until you get to a sublist of size 1 (which is sorted).
Combine the lists — the entire list will be sorted

Using in-place memory:

Pick a pivot item and swap it with the last item. We want to partition the data as above, and need to get the pivot out of the way.
Scan the items from left-to-right, and swap items greater than the pivot with the last item (and decrement the “last” counter). This puts the “heavy” items at the end of the list, a little like bubble sort.
Even if the item previously at the end is greater than the pivot, it will get swapped again on the next iteration.
Continue scanning the items until the “last item” counter overlaps the item you are examining – it means everything past the “last item” counter is greater than the pivot.
Finally, switch the pivot into its proper place. We know the “last item” counter has an item greater than the pivot, so we swap the pivot there.
Phew! Now, run quicksort again on the left and right subset lists. We know the pivot is in its final place (all items to left are smaller; all items to right are larger) so we can ignore it.

Using in-place memory w/two pointers:

Pick a pivot and swap it out of the way
Going left-to-right, find an oddball item that is greater than the pivot
Going right-to-left, find an oddball item that is less than the pivot
Swap the items if found, and keep going until the pointers cross — re-insert the pivot
Quicksort the left and right partitions
Note: this algorithm gets confusing when you have to keep track of the pointers and where to swap in the pivot

Notes

If a bad pivot is chosen, you can imagine that the “less” subset is always empty. That means we are only creating a subset of one item smaller each time, which gives us O(N^2) behavior in the worst case.
If you choose the first item, it may be the smallest item in a sorted list and give worst-case behavior. You can choose a random item, or median-of-three (front, middle, end).
Quicksort is fast because it uses spatial locality — it walks neighboring elements, comparing them to the pivot value (which can be stored in a register). It makes very effective use of caching.
The pivot is often swapped to the front, so it is out of the way during the pivoting. Afterwards, it is swapped into place (with a pivot item that is less than or equal to it, so the pivot is preserved).
The quicksort algorithm is complicated, and you have to pass left and right boundary variables

Heapsort [Best/Avg/Worst: O(N lg N)]

Add all items into a heap. Pop the largest item from the heap and insert it at the end (final position). Repeat for all items.

Heapsort is just like selection sort, but with a better way to get the largest element. Instead of scanning all the items to find the max, it pulls it from a heap. Heaps have properties that allow heapsort to work in-place, without additional memory.
Creating the heap is O(N lg N). Popping items is O(1), and fixing the heap after the pop is lgN. There are N pops, so there is another O(N lgN) factor, which is O(N lg N) overall.
Heapsort has O(N lgN) behavior, even in the worst case, making it good for real-time applications

Counting sort [Best/Avg/Worst: O(N)]

Assuming the data are integers, in a range of 0-k. Create an array of size K to keep track of how many items appear (3 items with value 0, 4 items with value 1, etc). Given this count, you can tell the position of an item — all the 1’s must come after the 0’s, of which there are 3. Therefore, the 1’s start at item #4. Thus, we can scan the items and insert them into their proper position.

Creating the count array is O(N)
Inserting items into their proper position is O(N)
I oversimplified here — there is a summation of the counts, and a greatest-to-least ordering which keeps the sort stable.

Radix sort [Best/Avg/Worst: O(N)]

Get a series of numbers, and sort them one digit at a time (moving all the 1000’s ahead of the 2000’s, etc.). Repeat the sorting on each set of digits.

Radix sort uses counting sort for efficient O(N) sorting of the digits (k = 0…9)
Actually, radix sort goes from least significant digit (1’s digit) to most significant, for reasons I’ll explain later (see CLRS book)
Radix & counting sort are fast, but require structured data, external memory and do not have the caching benefits of quicksort.

Actually doing the sorts

For practice, I wrote most of the sorts above in C, based on the pseudocode. Findings

Even “easy” sorts like bubble sort get complicated with decrements, off-by-one errors, > vs >= as you try to avoid walking off the end of the array with a swap.
Mocking up the problem on paper is crucial, just like writing the code to swap items in a linked list. Don’t have it all in your head.
I found and fixed bugs in all of my initial sorts. Create a good test harness that makes it easy to test.
- I separated my sorting routines into a DLL (I’m learning how to do Windows programming — it’s pretty different from Unix)
- I created a simple command-line .exe that took a list of numbers, turned them into an array, and called my sorting function, printing the result. This type of testing was encouraged by Kernighan — the tests are easy, do not require compilation (such as hard-coding a “testing” program)
Because testing was easy, I made every test case I could think of: Pre-sorted forward, backwards, 1 element, 2 elements, even and odd items, etc.
For debugging, I printed the intermediate array at each stage of the sort.

References

Programming

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Unicode and You

I’m a Unicode newbie. But like many newbies, I had an urge to learn once my interest was piqued by an introduction to Unicode.

Unicode isn’t hard to understand, but it does cover some low-level CS concepts, like byte order. Reading about Unicode is a nice lesson in design tradeoffs and backwards compatibility.

My thoughts are below. Read them alone, or as a follow-up to Joel’s unicode article above. If you’re like me, you’ll get an itch to read about the details in the Unicode specs or in Wikipedia. Really, it can be cool, I swear.

Key concepts

Let’s level set on some ideas:

Ideas and data are different. The concept of “A” is something different than marks on paper, the sound “aaay” or the number 65 stored inside a computer.

One idea has many possible encodings. An encoding is just a method to transform an idea (like the letter “A”) into raw data (bits and bytes). The idea of “A” can be encoded many different ways. Encodings differ in efficiency and compatibility.

Know thy encoding. When reading data, you must know the encoding used in order to interpret it properly. This is a simple but important concept. If you see the number 65 in binary, what does it really mean? “A” in ASCII? Your age? Your IQ? Unless there is some context, you’d never know. Imagine if someone came up to you and said “65”. You’d have no idea what they were talking about. Now imagine they came up and said “The following number is an ASCII character: 65”. Weird, yes, but see how much clearer it is?

Embrace the philosophy that a concept and the data that stores it are different. Let it rustle around in your mind…

Got it? Let’s dive in.

Back to ASCII and Code Pages

You’ve probably heard of the ASCII/ANSI characters sets. They map the numeric values 0-127 to various Western characters and control codes (newline, tab, etc.). Note that values 0-127 fit in the lower 7 bits in an 8-bit byte. ASCII does not explicitly define what values 128-255 map to.

Now, ASCII encoding works great for English text (using Western characters), but the world is a big place. What about Arabic, Chinese and Hebrew?

To solve this, computer makers defined “code pages” that used the undefined space from 128-255 in ASCII, mapping it to various characters they needed. Unfortunately, 128 additional characters aren’t enough for the entire world: code pages varied by country (Russian code page, Hebrew code page, etc.).

If people with the same code page exchanged data, all was good. Character #200 on my machine was the same as Character #200 on yours. But if codepages mixed (Russian sender, Hebrew receiver), things got strange.

The character mapped to #200 was different in Russian and Hebrew, and you can imagine the confusion that caused for things like email and birthday invitations. It’s a big IF whether or not someone will read your message using the same codepage you authored your text. If you visit an international website, for example, your browser could try to guess the codepage if it was not specified (“Hrm… this text has a lot of character #213 and #218… probably Hebrew”). But clearly this method was error-prone: codepages needed to be rescued.

Unicode to the Rescue

The world had a conundrum: they couldn’t agree on what numbers mapped to what letters in ASCII. The Unicode group went back to the basics: Letters are abstract concepts. Unicode labeled each abstract character with a “code point”. For example, “A” mapped to code point U+0041 (this code point is in hex; code point 65 in decimal).

The Unicode group did the hard work of mapping each character in every language to some code point (not without fierce debate, I am sure). When all was done, the Unicode standard left room for over 1 million code points, enough for all known languages with room to spare for undiscovered civilizations. For fun, you can browse the codepoints with the charmap utility (Start Menu > Run > Charmap) or online at Unicode.org.

This brings us to our first design decision: compatibility.

For compatibility with ASCII, code points U+0000 to U+007F (0-127) were the same as ASCII. Purists probably didn’t like this, because the full Latin character sets were defined elsewhere, and now one letter had 2 codepoints. Also, this put Western characters “first”, whereas Chinese, Arabic and the “nonstandard” languages were stuck in the non-sleek codepoints that require 2 bytes to store.

However, this design was necessary – ASCII was a standard, and if Unicode was to be adopted by the Western world it needed to be compatible, without question. Now, the majority of common languages fit into the first 65535 codepoints, which can be stored as 2 bytes.

Phew. The world was a better place, and everyone agreed on what codepoint mapped to what character.

But the question remained: How do we store a codepoint as data?

Encoding to the Rescue

From above, encoding turns an idea into raw data. In this case, the idea is a codepoint.

For example, let’s look at the ASCII “encoding” scheme to store Unicode codepoints. The rules are pretty simple:

Code points from U+0000 to U+007F are stored in a single byte
Code points above U+0080 are dropped on the floor, never to be seen again

Simple, right?

As you can see, ASCII isn’t great for storing Unicode – in fact, it ignores most Unicode codepoints altogether. If you have a Unicode document and save it as ASCII -wham- all your special characters are gone. You’ll often see this as a warning in some text editors when you save Unicode data in a file originally saved as ASCII.

But the example has a purpose. An encoding is a system to convert an idea into data. In this case, the conversion can be politely called “lossy”.

I did Unicode experiments with Notepad (can read/write Unicode) and Programmer’s Notepad, a hex editor. I wanted to see the raw bytes that notepad was saving. Try the examples for yourself:

Open notepad and type “Hello”
Save file separately as ANSI, Unicode, Unicode Big Endian, UTF-8
Open file with Programmer’s Notepad and do View > View Hex

All about ASCII

Let’s write “Hello” in notepad, save as ANSI (ASCII) and open it in a hex editor. It looks like this:

Byte:     48 65 6C 6C 6F
Letter:   H  e  l  l  o

ASCII is important because many tools and communication protocols only accept ASCII characters. It’s a generally accepted minimum bar for text. Because of its universal acceptance, some Unicode encodings will transform codepoints into series of ASCII characters so they can be transmitted without issue.

Now, in the example above, we know the data is text because we authored it. If we randomly found the file, we could assume it was ASCII text given its contents, but it might be an account number or other data for all we know, that happens to look like “Hello” in ASCII.

Usually, we can make a good guess about what data is supposed to be, based on certain headers or “Magic Numbers” (special character sequences) that appear in certain places. But you can never be sure, and sometimes you can guess wrong.

Don’t believe me? Ok, do the following

Open notepad
Write “this program can break”
Save the file as “blah.txt”
Open the file in notepad

Wow… whoa… what happened? I’ll leave this as an exercise for the reader.

UCS-2 / UTF-16

This is the encoding I first thought of when I heard “Unicode” – store every character as 2 bytes (what a waste!). At a base level, this can handle codepoints 0x0000 to 0xFFFF, or 0-65535 for you humans out there. And 65,535 should be enough characters for anybody (there are ways to store codepoints above 65535, but read the spec for more details).

Storing data in multiple bytes leads to my favorite conundrum: byte order! Some computers store the little byte first, others the big byte.

To resolve the problem, we can do the following:

Option 1: Choose a convention that says all text data must be big or little-endian. This won’t happen – computers on the wrong side of the decision would suffer inefficiency every time they opened a file, since they cannot convert it to the other byte order.
Option 2: Everyone agrees to a byte order mark (BOM), a header at the top of each file. If you open a file and the BOM is backwards, it means it was encoded in a different byte order and needs to be converted.

The solution was the BOM header: UCS-2 encodings could write codepoint U+FEFF as a file header. If you open a UCS-2 string and see FEFF, the data is in the right byte order and can be used directly. If you see FFFE, the data came from another type of machine, and needs to be converted to your architecture. This involves swapping every byte in the file.

But unfortunately, things are not that simple. The BOM is actually a valid Unicode character – what if someone sent a file without a header, and that character was actually part of the file?

This is an open issue in Unicode. The suggestion is to avoid U+FEFF except for headers, and use alternative characters instead (there are equivalents).

This opens up design observation #2: Multi-byte data will have byte order issues!

ASCII never had to worry about byte order – each character was a single byte, and could not be misinterpreted. But realistically, if you see bytes 0xFEFF or 0xFFEE at the start of a file, it’s a good chance it’s a BOM in a Unicode text file. It’s probably an indication of byte order. Probably.

(Aside: UCS-2 stores data in a flat 16-bit chunk. UTF-16 allows up to 20 bits split between 2 16-bit characters, known as a surrogate pair. Each character in the surrogate pair is an invalid unicode character by itself, but together a valid one can be extracted.)

UCS-2 Example

Type “Hello” in notepad and save it as Unicode (little-endian UCS-2 is the native format on Windows):

Hello-little-endian:

FF FE  4800 6500 6C00 6c00 6F00
header H    e    l    l    o

Save it again as Unicode Big Endian, and you get:

Hello-big-endian:

FE FF  0048 0065 006C 006C 006F
header H    e    l    l    o

Observations

The header BOM (U+FEFF) shows up as expected: FF FE for little-endian, FEFF for big
Letters use 2 bytes no matter what: “H” is 0x48 in ASCII, and 0x0048 in UCS-2
Encoding is simple. Take the codepoint in hex and write it out in 2 bytes. No extra processing is required.
The encoding is too simple. It wastes space for plain ASCII text that does not use the high-order byte. And ASCII text is very common.
The encoding inserts null bytes (0x00) which can be a problem. Old-school ASCII programs may think the Unicode string has ended when it gets to the null byte. On a little-endian machine, reading one byte at a time, you’d get to H (H = 0x4800) and then hit the null and stop. On a big endian machine, you’d hit the null first (H is 0x0048) and not even see the H in ASCII. Not good.

Design observation #3: Consider backwards compatibility. How will an old program read new data? Ignoring new data is good. Breaking on new data is bad.

UTF-8

UCS-2 / UTF-16 is nice and simple, but boy it does waste some bits. Not only does it double ASCII, but the converted ASCII might not even be readable due to the null characters.

Enter UTF-8. Its goal is to encode Unicode characters in single byte where possible (ASCII), and not break ASCII applications by having null characters. It is the default encoding for XML.

Read the UTF-8 specs for more detail, but at a high level:

Code points 0 – 007F are stored as regular, single-byte ASCII.
Code points 0080 and above are converted to binary and stored (encoded) in a series of bytes.
The first “count” byte indicates the number of bytes for the codepoint, including the count byte. These bytes start with 11..0:

110xxxxx (The leading “11” is indicates 2 bytes in sequence, including the “count” byte)

1110xxxx (1110 -> 3 bytes in sequence)

11110xxx (11110 -> 4 bytes in sequence)
Bytes starting with 10… are “data” bytes and contain information for the codepoint. A 2-byte example looks like this

110xxxxx 10xxxxxx

This means there are 2 bytes in the sequence. The X’s represent the binary value of the codepoint, which needs to squeeze in the remaining bits.

Observations about UTF-8

No null bytes. All ASCII characters (0-127) are the same. Non-ASCII characters all start with “1” as the highest bit.
ASCII text is stored identically and efficiently.
Unicode characters start with “1” as the high bit, and can be ignored by ASCII-only programs (however, they may be discarded in some cases! See UTF-7 for more details).
There is a time-space tradeoff. There is processing to be done on every Unicode character, but this is a reasonable tradeoff.

Design principle #4

UTF-8 addresses the 80% case well (ASCII), while making the other cases possible (Unicode). UCS-2 addresses all cases equally, but is inefficient in the 80% case for solve for the 99% case. But UCS-2 is less processing-intensive than UTF-8, which requires bit manipulation on all Unicode characters.
Why does XML store data in UTF-8 instead of UCS-2? Is space or processing power more important when reading XML documents?
Why does Windows XP store strings as UCS-2 natively? Is space or processing power more important for the OS internals?

In any case, UTF-8 still needs a header to indicate how the text was encoded. Otherwise, it could be interpreted as straight ASCII with some codepage to handle values above 127. It still uses the U+FEFF codepoint as a BOM, but the BOM itself is encoded in UTF-8 (clever, eh?).

UTF-8 Example

Hello-UTF-8:

EF BB BF 48 65 6C 6C 6F
header   H  e  l  l  o

Again, the ASCII text is not changed in UTF-8. Feel free to use charmap to copy in some Unicode characters and see how they are stored in UTF-8. Or, you can experiment online.

UTF-7

While UTF-8 is great for ASCII, it still stores Unicode data as non-ASCII characters with the high-bit set. Some email protocols do not allow non-ASCII values, so UTF-8 data would not be sent properly. Systems that can handle data with anything in the high bit are “8-bit clean”; systems that require data have values 0-127 (like SMTP) are not. So how do we send Unicode data through them?

Enter UTF-7. The goal is to encode Unicode data in 7 bits (0-127), which is compatible with ASCII. UTF-7 works like this

Codepoints in the ASCII range are stored as ASCII, except for certain symbols (+, -) that have special meaning
Codepoints above ASCII are converted to binary, and stored in base64 encoding (stores binary information in ASCII)

How do you know which ASCII letters are real ASCII, and which are base64 encoded? Easy. ASCII characters between the special symbols “+” and “-“ are considered base64 encoded.

“-” acts like an escape suffix character. If it follows a character, that item is interpreted literally. So, “+-“ is interpreted as “+” without any special encoding. This is how you store an actual “+” symbol in UTF-7.

UTF-7 Example

Wikipedia has some UTF-7 examples, as Notepad can’t save as UTF-7.

“Hello” is the same as ASCII — we are using all ASCII characters and no special symbols:

Byte:     48 65 6C 6C 6F
Letter:   H  e  l  l  o

“£1” (1 British pound) becomes:

+AKM-1

The characters “+AKM-” means AKM should be decoded in base64 and converted to a codepoint, which maps to 0x00A3 or the British pound symbol. The “1” is kept the same, since it is a ASCII character.

UTF is pretty clever, eh? It’s essentially a Unicode to ASCII conversion that removes any characters that have their highest-bit set. Most ASCII characters will look the same, except for the special characters (- and +) that need to be escaped.

Wrapping it up – what I’ve learned

I’m still a newbie but have learned a few things about Unicode:

Unicode does not mean 2 bytes. Unicode defines code points that can be stored in many different ways (UCS-2, UTF-8, UTF-7, etc.). Encodings vary in simplicity and efficiency.
Unicode has more than 65,535 (16 bits) worth of characters. Encodings can specify more characters, but the first 65535 cover most of the common languages.
You need to know the encoding to correctly read a file. You can often guess that a file is Unicode based on the Byte Order Mark (BOM), but confusion can still arise unless you know the exact encoding. Even text that looks like ASCII could actually be encoded with UTF-7; you just don’t know.

Unicode is an interesting study. It opened my eyes to design tradeoffs, and the importance of separating the core idea from the encoding used to save it.

Programming

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

A little diddy about binary file formats

Understanding the nature of file formats and escape characters has been an itch of mine. I recently found a few useful explanations that inspired me to write my understanding of binary files.

How computers represent data

Everything is bits and bytes, 1’s and 0’s to the computer. Humans understand text, so we have programs that convert a series of 1’s and 0’s into something we can understand.

In the ASCII character scheme, a single byte (a sequence of eight 1’s or 0’s, or a number from 0-255) can be converted into a character. For example, the character ‘A’ is the number 65 in decimal, 41 in hex, or 01000001 in binary. ‘B’ is the number 66 in decimal, and so on (see a full chart).

Don’t believe me? Mini-example time.

Create a file in notepad with the single letter “A” (any filename will do — “sample.txt”).

Save the file, right-click and look the properties — it should be 1 byte: notepad stores characters in ASCII, with one byte per character. The “size on disk” may be larger because the computer allocates space in fixed blocks (of 4 kilobytes, for example).

Find a hex editor (here’s a free one) and open the file you just saved. (On Linux/Unix, use “od -x sample.txt”).

You’ll see only the single number “41” in hexadecimal (65 in decimal), and the hex editor may show the character “A” on a side screen (the ASCII representation of the byte you are examining). The “0” on the left is the address of the byte — programmers love counting from zero.

The hex editor displays all data as ASCII text, which it is in our case. If you open up a non-ASCII file, the data inside will be displayed as ASCII characters, though it may not always make sense.

Try opening a random .exe to see what ASCII strings are embedded inside — you can usually find a few in the beginning portions of the file. All DOS executables start with the header “MZ”, the initials of the programmer who came up with the file format.

Cool, eh? These headers or “magic numbers” are one way for a program to determine what type of file it’s seeing. If you open a PNG image you’ll see the PNG header, which includes the ASCII letters “PNG”.

What’s going on?

Inside the memory of the computer, only ’65’ (41 in hex or 01000001 in binary) is stored in sample.txt. Given the context of the information (i.e., notepad is expecting a text file) the computer knows to display the ASCII character ‘A’ on the screen.

Now consider how a human would store the actual numeric value of 65 if you told them to write it down. As humans, we would write it as two characters, a ‘6’ and then a ‘5’, which takes 2 ASCII characters or 2 bytes (again, the “letter” 6 can be stored in ASCII).

A computer would store the number “65” as 65 in binary, the same as ‘a’. Except this time, software would know that the ’65’ was not the code for a letter, it was actually the number itself.

Now, suppose we wanted to store the number 4,000,000,000 (4 billion). As humans, we would write it as 4000000000, or 10 ASCII characters (10 bytes). How would a computer do it?

A single byte has 8 bits, or 2^8 (256) possible values. 4 bytes gives us 2^32 bits, or roughly 4 billion values. So, we could store the number 4 billion in only 4 bytes.

As you can see, storing numeric data in the computer’s format saves space. It also saves computational effort — the computer does not have to convert a number between binary and ASCII.

So, why not use binary formats?

If binary formats are more efficient, why not use them all the time?

Binary files are difficult for humans to read. When a person sees a sequence of 4 bytes, he has no idea what it means (it could be a 4-letter word stored in ASCII). If he sees the 10 ASCII letters 4000000000, he knows it is a number.
Binary files are difficult to edit. In the same manner, if a person wants to change 4 Billion to 2 billion, he needs to know the binary representation. With the ASCII representation, he can simply put in a “2” instead of the “4”.
Binary files are difficult to manipulate. The UNIX tradition has several simple, elegant tools to manipulate text. By storing files in the standard text format, you get the power of these tools without having to create special editors to modify your binary file.
Binary files can get confusing. Problems happen when computers have different ways of reading data. There’s something called the “NUXI” or byte-order problem, which happens when 2 computers with different architectures (PowerPC Macs and x86 PCs, for example) try to transfer binary data. Regular text stored in single bytes is unambiguous, but be careful with unicode.
The efficiency gain usually isn’t tremendous. Representing numbers in binary can ideally save you a factor of 3 (a 4 byte number can represent 10 bytes of text). However, this assumes that the numbers you are representing are large (a 3-digit number like 999 is better represented in ASCII than as a 4-byte number). Lastly, ASCII actually only uses 7 bits per byte, so you an theoretically pack ASCII together to get an 1/8 or 12% gain. However, storing text in this way is typically not worth the hassle.

One reason binary files are efficient is because they can use all 8 bits in a byte, while most text is constrained to certain fixed patterns, leaving unused space. However, by compressing your text data you can reduce the amount of space used and make text more efficient.

Marshalling and Unmarshalling Data

Aside: Marshalling always makes me thinks of Sheriff Marshals and thus cowboys. Cowboys have nothing to do with the CS meaning of “marshal”.

Sometimes computers have complex internal data structures, with chains of linked items that need to be stored in a file. Marshalling is the process of taking the internal data of a program and saving it to a flat, linear file. Unmarshalling is the process of reading that that linear data and recreating the complex internal data structure the computer originally had.

Notepad has it easy – it just needs to store the raw text so no marshalling is needed. Microsoft Word, however, must store the text along with other document information (page margins, font sizes, embedded images, styles, etc.) in a single, linear file. Later, it must read that file and recreate the original setup the user had.

You can marshal data into a binary or text format — the word “marshal” does not indicate how the data is stored.

So when are binary file formats useful?

There are situations where you may want to use binary file formats. PNG images use a binary format because efficiency is important in creating small image files. However, PNG does binary formats right: it specifies byte orders and word lengths to avoid the NUXI problem.

There are often business reasons to use binary formats. The main reason is that they are more difficult to reverse engineer (humans have to guess how the computer is storing its data), which can help maintain a competitive advantage.

Programming

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Understanding Big and Little Endian Byte Order

Problems with byte order are frustrating, and I want to spare you the grief I experienced. Here's the key:

Problem: Computers speak different languages, like people. Some write data "left-to-right" and others "right-to-left".
- A machine can read its own data just fine - problems happen when one computer stores data and a different type tries to read it.
Solutions
- Agree to a common format (i.e., all network traffic follows a single format), or
- Always include a header that describes the format of the data. If the header appears backwards, it means data was stored in the other format and needs to be converted.

Numbers vs. Data

The most important concept is to recognize the difference between a number and the data that represents it.

A number is an abstract concept, such as a count of something. You have ten fingers. The idea of "ten" doesn't change, no matter what representation you use: ten, 10, diez (Spanish), ju (Japanese), 1010 (binary), X (Roman numeral)... these representations all point to the same concept of "ten".

Contrast this with data. Data is a physical concept, a raw sequence of bits and bytes stored on a computer. Data has no inherent meaning and must be interpreted by whoever is reading it.

Data is like human writing, which is simply marks on paper. There is no inherent meaning in these marks. If we see a line and a circle (like this: |O) we may interpret it to mean "ten".

But we assumed the marks referred to a number. They could have been the letters "IO", a moon of Jupiter. Or perhaps the Greek goddess. Or maybe an abbreviation for Input/Output. Or someone's initials. Or the number 2 in binary ("10"). The list of possibilities goes on.

The point is that a single piece of data (|O) can be interpreted in many ways, and the meaning is unclear until someone clarifies the intent of the author.

Computers face the same problem. They store data, not abstract concepts, and do so using a sequence of 1's and 0's. Later, they read back the 1's and 0's and try to recreate the abstract concept from the raw data. Depending on the assumptions made, the 1's and 0's can mean very different things.

Why does this problem happen? Well, there's no rule that all computers must use the same language, just like there's no rule all humans need to. Each type of computer is internally consistent (it can read back its own data), but there are no guarantees about how another type of computer will interpret the data it created.

Basic concepts

Data (bits and bytes, or marks on paper) is meaningless; it must be interpreted to create an abstract concept, like a number.
Like humans, computers have different ways to store the same abstract concept. (i.e., we have many ways to say "ten": ten, 10, diez, etc.)

Storing Numbers as Data

Thankfully, most computers agree on a few basic data formats (this was not always the case). This gives us a common starting point which makes our lives a bit easier:

A bit has two values (on or off, 1 or 0)
A byte is a sequence of 8 bits
- The "leftmost" bit in a byte is the biggest. So, the binary sequence 00001001 is the decimal number 9. 00001001 = (2³ + 2⁰ = 8 + 1 = 9).
- Bits are numbered from right-to-left. Bit 0 is the rightmost and the smallest; bit 7 is leftmost and largest.

We can use these basic agreements as a building block to exchange data. If we store and read data one byte at a time, it will work on any computer. The concept of a byte is the same on all machines, and the idea of which byte is first, second, third (Byte 0, Byte 1, Byte 2...) is the same on all machines.

If computers agree on the order of every byte, what's the problem?

Well, this is fine for single-byte data, like ASCII text. However, a lot of data needs to be stored using multiple bytes, like integers or floating-point numbers. And there is no agreement on how these sequences should be stored.

Byte Example

Consider a sequence of 4 bytes, named W X Y and Z - I avoided naming them A B C D because they are hex digits, which would be confusing. So, each byte has a value and is made up of 8 bits.

 Byte Name:    W       X       Y       Z
 Location:     0       1       2       3
 Value (hex):  0x12    0x34    0x56    0x78

For example, W is an entire byte, 0x12 in hex or 00010010 in binary. If W were to be interpreted as a number, it would be "18" in decimal (by the way, there's nothing saying we have to interpret it as a number - it could be an ASCII character or something else entirely).

With me so far? We have 4 bytes, W X Y and Z, each with a different value.

Understanding Pointers

Pointers are a key part of programming, especially the C programming language. A pointer is a number that references a memory location. It is up to us (the programmer) to interpret the data at that location.

In C, when you cast a pointer to certain type (such as a char * or int *), it tells the computer how to interpret the data at that location. For example, let's declare

void *p = 0; // p is a pointer to an unknown data type
             // p is a NULL pointer -- do not dereference
char *c;     // c is a pointer to a char, usually a single byte

Note that we can't get the data from p because we don't know its type. p could be pointing at a single number, a letter, the start of a string, your horoscope, an image -- we just don't know how many bytes to read, or how to interpret what's there.

Now, suppose we write

c = (char *)p;

Ah -- now this statement tells the computer to point to the same place as p, and interpret the data as a single character (char is typically a single byte, use uint8_t if not true on your machine). In this case, c would point to memory location 0, or byte W. If we printed c, we'd get the value in W, which is hex 0x12 (remember that W is a whole byte).

This example does not depend on the type of computer we have -- again, all computers agree on what a single byte is (in the past this was not the case).

The example is helpful, even though it is the same on all computers -- if we have a pointer to a single byte (char *, a single byte), we can walk through memory, reading off a byte at a time. We can examine any memory location and the endian-ness of a computer won't matter -- every computer will give back the same information.

So, What's The Problem?

Problems happen when computers try to read multiple bytes. Some data types contain multiple bytes, like long integers or floating-point numbers. A single byte has only 256 values, so can store 0 - 255.

Now problems start - when you read multi-byte data, where does the biggest byte appear?

Big endian machine: Stores data big-end first. When looking at multiple bytes, the first byte (lowest address) is the biggest.
Little endian machine: Stores data little-end first. When looking at multiple bytes, the first byte is smallest.

The naming makes sense, eh? Big-endian thinks the big-end is first. (By the way, the big-endian / little-endian naming comes from Gulliver's Travels, where the Lilliputans argue over whether to break eggs on the little-end or big-end. Sometimes computer debates are just as meaningful :-))

Again, endian-ness does not matter if you have a single byte. If you have one byte, it's the only data you read so there's only one way to interpret it (again, because computers agree on what a byte is).

Now suppose we have our 4 bytes (W X Y Z) stored the same way on a big-and little-endian machine. That is, memory location 0 is W on both machines, memory location 1 is X, etc.

We can create this arrangement by remembering that bytes are machine-independent. We can walk memory, one byte at a time, and set the values we need. This will work on any machine:

c = 0;     // point to location 0 (won't work on a real machine!)
*c = 0x12; // Set W's value
c = 1;     // point to location 1
*c = 0x34; // Set X's value
...        // repeat for Y and Z; details left to reader

This code will work on any machine, and we have both set up with bytes W, X, Y and Z in locations 0, 1, 2 and 3.

Interpreting Data

Now let's do an example with multi-byte data (finally!). Quick review: a "short int" is a 2-byte (16-bit) number, which can range from 0 - 65535 (if unsigned). Let's use it in an example:

short *s; // pointer to a short int (2 bytes)
s = 0;    // point to location 0; *s is the value

So, s is a pointer to a short, and is now looking at byte location 0 (which has W). What happens when we read the value at s?

Big endian machine: I think a short is two bytes, so I'll read them off: location s is address 0 (W, or 0x12) and location s + 1 is address 1 (X, or 0x34). Since the first byte is biggest (I'm big-endian!), the number must be 256 * byte 0 + byte 1, or 256*W + X, or 0x1234. I multiplied the first byte by 256 (2^8) because I needed to shift it over 8 bits.
Little endian machine: I don't know what Mr. Big Endian is smoking. Yeah, I agree a short is 2 bytes, and I'll read them off just like him: location s is 0x12, and location s + 1 is 0x34. But in my world, the first byte is the littlest! The value of the short is byte 0 + 256 * byte 1, or 256*X + W, or 0x3412.

Keep in mind that both machines start from location s and read memory going upwards. There is no confusion about what location 0 and location 1 mean. There is no confusion that a short is 2 bytes.

But do you see the problem? The big-endian machine thinks s = 0x1234 and the little-endian machine thinks s = 0x3412. The same exact data gives two different numbers. Probably not a good thing.

Yet another example

Let's do another example with 4-byte integer for "fun":

int *i; // pointer to an int (4 bytes on 32-bit machine)
i = 0;  // points to location zero, so *i is the value there

Again we ask: what is the value at i?

Big endian machine: An int is 4 bytes, and the first is the largest. I read 4 bytes (W X Y Z) and W is the largest. The number is 0x12345678.
Little endian machine: Sure, an int is 4 bytes, but the first is smallest. I also read W X Y Z, but W belongs way in the back -- it's the littlest. The number is 0x78563412.

Same data, different results - not a good thing. Here's an interactive example using the numbers above, feel free to plug in your own:

The NUXI Problem

Issues with byte order are sometimes called the NUXI problem: UNIX stored on a big-endian machine can show up as NUXI on a little-endian one.

Suppose we want to store 4 bytes (U, N, I and X) as two shorts: UN and IX. Each letter is a entire byte, like our WXYZ example above. To store the two shorts we would write:

short *s; // pointer to set shorts
s = 0;    // point to location 0
*s = UN;  // store first short: U * 256 + N (fictional code)
s = 2;    // point to next location
*s = IX;  // store second short: I * 256 + X

This code is not specific to a machine. If we store "UN" on a machine and ask to read it back, it had better be "UN"! I don't care about endian issues, if we store a value on one machine and read it back on the same machine, it must be the same value.

However, if we look at memory one byte at a time (using our char * trick), the order could vary. On a big endian machine we see:

Byte:     U N I X
Location: 0 1 2 3

Which make sense. U is the biggest byte in "UN" and is stored first. The same goes for IX: I is the biggest, and stored first.

On a little-endian machine we would see:

Byte:     N U X I
Location: 0 1 2 3

And this makes sense also. "N" is the littlest byte in "UN" and is stored first. Again, even though the bytes are stored "backwards" in memory, the little-endian machine knows it is little endian, and interprets them correctly when reading the values back. Also, note that we can specify hex numbers such as x = 0x1234 on any machine. Even a little-endian machine knows what you mean when you write 0x1234, and won't force you to swap the values yourself (you specify the hex number to write, and it figures out the details and swaps the bytes in memory, under the covers. Tricky.).

This scenario is called the "NUXI" problem because byte sequence UNIX is interpreted as NUXI on the other type of machine. Again, this is only a problem if you exchange data -- each machine is internally consistent.

Exchanging Data Between Endian Machines

Computers are connected - gone are the days when a machine only had to worry about reading its own data. Big and little-endian machines need to talk and get along. How do they do this?

Solution 1: Use a Common Format

The easiest approach is to agree to a common format for sending data over the network. The standard network order is actually big-endian, but some people get uppity that little-endian didn't win... we'll just call it "network order".

To convert data to network order, machines call a function hton (host-to-network). On a big-endian machine this won't actually do anything, but we won't talk about that here (the little-endians might get mad).

But it is important to use hton before sending data, even if you are big-endian. Your program may be so popular it is compiled on different machines, and you want your code to be portable (don't you?).

Similarly, there is a function ntoh (network to host) used to read data off the network. You need this to make sure you are correctly interpreting the network data into the host's format. You need to know the type of data you are receiving to decode it properly, and the conversion functions are:

 htons() - "Host to Network Short"
 htonl() - "Host to Network Long"
 ntohs() - "Network to Host Short"
 ntohl() - "Network to Host Long"

Remember that a single byte is a single byte, and order does not matter.

These functions are critical when doing low-level networking, such as verifying the checksums in IP packets. If you don't understand endian issues correctly your life will be painful - take my word on this one. Use the translation functions, and know why they are needed.

Solution 2: Use a Byte Order Mark (BOM)

The other approach is to include a magic number, such as 0xFEFF, before every piece of data. If you read the magic number and it is 0xFEFF, it means the data is in the same format as your machine, and all is well.

If you read the magic number and it is 0xFFFE (it is backwards), it means the data was written in a format different from your own. You'll have to translate it.

A few points to note. First, the number isn't really magic, but programmers often use the term to describe the choice of an arbitrary number (the BOM could have been any sequence of different bytes). It's called a byte-order mark because it indicates the byte order the data was stored in.

Second, the BOM adds overhead to all data that is transmitted. Even if you are only sending 2 bytes of data, you need to include a 2-byte BOM. Ouch!

Unicode uses a BOM when storing multi-byte data (some Unicode character encodings can have 2, 3 or even 4-bytes per character). XML avoids this mess by storing data in UTF-8 by default, which stores Unicode information one byte at a time. And why is this cool?

(Repeated for the 56th time) "Because endian issues don't matter for single bytes".

Right you are.

Again, other problems can arise with BOM. What if you forget to include the BOM? Do you assume the data was sent in the same format as your own? Do you read the data and see if it looks "backwards" (whatever that means) and try to translate it? What if regular data includes the BOM by coincidence? These situations are not fun.

Why Are There Endian Issues at All? Can't We Just Get Along?

Ah, what a philosophical question.

Each byte-order system has its advantages. Little-endian machines let you read the lowest-byte first, without reading the others. You can check whether a number is odd or even (last bit is 0) very easily, which is cool if you're into that kind of thing. Big-endian systems store data in memory the same way we humans think about data (left-to-right), which makes low-level debugging easier.

But why didn't everyone just agree to one system? Why do certain computers have to try and be different?

Let me answer a question with a question: Why doesn't everyone speak the same language? Why are some languages written left-to-right, and others right-to-left?

Sometimes communication systems develop independently, and later need to interact.

Epilogue: Parting Thoughts

Endian issues are an example of the general encoding problem - data needs to represent an abstract concept, and later the concept needs to be created from the data. This topic deserves its own article (or series), but you should have a better understanding of endian issues. More information:

Programming

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.