- BetterExplained - https://betterexplained.com -

A little diddy about binary file formats

Understanding the nature of file formats and escape characters has been an itch of mine. I recently found a few useful explanations that inspired me to write my understanding of binary files.

How computers represent data

Everything is bits and bytes, 1’s and 0’s to the computer. Humans understand text, so we have programs that convert a series of 1’s and 0’s into something we can understand.

In the ASCII character scheme, a single byte (a sequence of eight 1’s or 0’s, or a number from 0-255) can be converted into a character. For example, the character ‘A’ is the number 65 in decimal, 41 in hex, or 01000001 in binary. ‘B’ is the number 66 in decimal, and so on (see a full chart).

Don’t believe me? Mini-example time.

Create a file in notepad with the single letter “A” (any filename will do — “sample.txt”).

Save the file, right-click and look the properties — it should be 1 byte: notepad stores characters in ASCII, with one byte per character. The “size on disk” may be larger because the computer allocates space in fixed blocks (of 4 kilobytes, for example).

notepad_A.png notepad_A_size.png

Find a hex editor (here’s a free one) and open the file you just saved. (On Linux/Unix, use “od -x sample.txt”).

You’ll see only the single number “41” in hexadecimal (65 in decimal), and the hex editor may show the character “A” on a side screen (the ASCII representation of the byte you are examining). The “0” on the left is the address of the byte — programmers love counting from zero.

notepad_A_hexedit.PNG

The hex editor displays all data as ASCII text, which it is in our case. If you open up a non-ASCII file, the data inside will be displayed as ASCII characters, though it may not always make sense.

Try opening a random .exe to see what ASCII strings are embedded inside — you can usually find a few in the beginning portions of the file. All DOS executables start with the header “MZ”, the initials of the programmer who came up with the file format.

hexedit_notepad.PNG

Cool, eh? These headers or “magic numbers” are one way for a program to determine what type of file it’s seeing. If you open a PNG image you’ll see the PNG header, which includes the ASCII letters “PNG”.

What’s going on?

Inside the memory of the computer, only ’65’ (41 in hex or 01000001 in binary) is stored in sample.txt. Given the context of the information (i.e., notepad is expecting a text file) the computer knows to display the ASCII character ‘A’ on the screen.

Now consider how a human would store the actual numeric value of 65 if you told them to write it down. As humans, we would write it as two characters, a ‘6’ and then a ‘5’, which takes 2 ASCII characters or 2 bytes (again, the “letter” 6 can be stored in ASCII).

A computer would store the number “65” as 65 in binary, the same as ‘a’. Except this time, software would know that the ’65’ was not the code for a letter, it was actually the number itself.

Now, suppose we wanted to store the number 4,000,000,000 (4 billion). As humans, we would write it as 4000000000, or 10 ASCII characters (10 bytes). How would a computer do it?

A single byte has 8 bits, or 2^8 (256) possible values. 4 bytes gives us 2^32 bits, or roughly 4 billion values. So, we could store the number 4 billion in only 4 bytes.

As you can see, storing numeric data in the computer’s format saves space. It also saves computational effort — the computer does not have to convert a number between binary and ASCII.

So, why not use binary formats?

If binary formats are more efficient, why not use them all the time?

One reason binary files are efficient is because they can use all 8 bits in a byte, while most text is constrained to certain fixed patterns, leaving unused space. However, by compressing your text data you can reduce the amount of space used and make text more efficient.

Marshalling and Unmarshalling Data

Aside: Marshalling always makes me thinks of Sheriff Marshals and thus cowboys. Cowboys have nothing to do with the CS meaning of “marshal”.

Sometimes computers have complex internal data structures, with chains of linked items that need to be stored in a file. Marshalling is the process of taking the internal data of a program and saving it to a flat, linear file. Unmarshalling is the process of reading that that linear data and recreating the complex internal data structure the computer originally had.

Notepad has it easy – it just needs to store the raw text so no marshalling is needed. Microsoft Word, however, must store the text along with other document information (page margins, font sizes, embedded images, styles, etc.) in a single, linear file. Later, it must read that file and recreate the original setup the user had.

You can marshal data into a binary or text format — the word “marshal” does not indicate how the data is stored.

So when are binary file formats useful?

There are situations where you may want to use binary file formats. PNG images use a binary format because efficiency is important in creating small image files. However, PNG does binary formats right: it specifies byte orders and word lengths to avoid the NUXI problem.

There are often business reasons to use binary formats. The main reason is that they are more difficult to reverse engineer (humans have to guess how the computer is storing its data), which can help maintain a competitive advantage.

Other Posts In This Series

  1. Number Systems and Bases
  2. The Quick Guide to GUIDs
  3. Understanding Quake's Fast Inverse Square Root
  4. A Simple Introduction To Computer Networking
  5. Swap two variables using XOR
  6. Understanding Big and Little Endian Byte Order
  7. Unicode and You
  8. A little diddy about binary file formats
  9. Sorting Algorithms