A little diddy about binary file formats

Get the Math, Better Explained eBook and turn Huh? to Aha!

Understanding the nature of file formats and escape characters has been an itch of mine. I recently found a few useful explanations that inspired me to write my understanding of binary files.

How computers represent data

Everything is bits and bytes, 1′s and 0′s to the computer. Humans understand text, so we have programs that convert a series of 1′s and 0′s into something we can understand.

In the ASCII character scheme, a single byte (a sequence of eight 1′s or 0′s, or a number from 0-255) can be converted into a character. For example, the character ‘A’ is the number 65 in decimal, 41 in hex, or 01000001 in binary. ‘B’ is the number 66 in decimal, and so on (see a full chart).

Don’t believe me? Mini-example time.

Create a file in notepad with the single letter “A” (any filename will do — “sample.txt”).

Save the file, right-click and look the properties — it should be 1 byte: notepad stores characters in ASCII, with one byte per character. The “size on disk” may be larger because the computer allocates space in fixed blocks (of 4 kilobytes, for example).

notepad_A.png notepad_A_size.png

Find a hex editor (here’s a free one) and open the file you just saved. (On Linux/Unix, use “od -x sample.txt”).

You’ll see only the single number “41″ in hexadecimal (65 in decimal), and the hex editor may show the character “A” on a side screen (the ASCII representation of the byte you are examining). The “0″ on the left is the address of the byte — programmers love counting from zero.

notepad_A_hexedit.PNG

The hex editor displays all data as ASCII text, which it is in our case. If you open up a non-ASCII file, the data inside will be displayed as ASCII characters, though it may not always make sense.

Try opening a random .exe to see what ASCII strings are embedded inside — you can usually find a few in the beginning portions of the file. All DOS executables start with the header “MZ”, the initials of the programmer who came up with the file format.

hexedit_notepad.PNG

Cool, eh? These headers or “magic numbers” are one way for a program to determine what type of file it’s seeing. If you open a PNG image you’ll see the PNG header, which includes the ASCII letters “PNG”.

What’s going on?

Inside the memory of the computer, only ’65′ (41 in hex or 01000001 in binary) is stored in sample.txt. Given the context of the information (i.e., notepad is expecting a text file) the computer knows to display the ASCII character ‘A’ on the screen.

Now consider how a human would store the actual numeric value of 65 if you told them to write it down. As humans, we would write it as two characters, a ’6′ and then a ’5′, which takes 2 ASCII characters or 2 bytes (again, the “letter” 6 can be stored in ASCII).

A computer would store the number “65″ as 65 in binary, the same as ‘a’. Except this time, software would know that the ’65′ was not the code for a letter, it was actually the number itself.

Now, suppose we wanted to store the number 4,000,000,000 (4 billion). As humans, we would write it as 4000000000, or 10 ASCII characters (10 bytes). How would a computer do it?

A single byte has 8 bits, or 2^8 (256) possible values. 4 bytes gives us 2^32 bits, or roughly 4 billion values. So, we could store the number 4 billion in only 4 bytes.

As you can see, storing numeric data in the computer’s format saves space. It also saves computational effort — the computer does not have to convert a number between binary and ASCII.

So, why not use binary formats?

If binary formats are more efficient, why not use them all the time?

  • Binary files are difficult for humans to read. When a person sees a sequence of 4 bytes, he has no idea what it means (it could be a 4-letter word stored in ASCII). If he sees the 10 ASCII letters 4000000000, he knows it is a number.
  • Binary files are difficult to edit. In the same manner, if a person wants to change 4 Billion to 2 billion, he needs to know the binary representation. With the ASCII representation, he can simply put in a “2″ instead of the “4″.
  • Binary files are difficult to manipulate. The UNIX tradition has several simple, elegant tools to manipulate text. By storing files in the standard text format, you get the power of these tools without having to create special editors to modify your binary file.
  • Binary files can get confusing. Problems happen when computers have different ways of reading data. There’s something called the “NUXI” or byte-order problem, which happens when 2 computers with different architectures (PowerPC Macs and x86 PCs, for example) try to transfer binary data. Regular text stored in single bytes is unambiguous, but be careful with unicode.
  • The efficiency gain usually isn’t tremendous. Representing numbers in binary can ideally save you a factor of 3 (a 4 byte number can represent 10 bytes of text). However, this assumes that the numbers you are representing are large (a 3-digit number like 999 is better represented in ASCII than as a 4-byte number). Lastly, ASCII actually only uses 7 bits per byte, so you an theoretically pack ASCII together to get an 1/8 or 12% gain. However, storing text in this way is typically not worth the hassle.

One reason binary files are efficient is because they can use all 8 bits in a byte, while most text is constrained to certain fixed patterns, leaving unused space. However, by compressing your text data you can reduce the amount of space used and make text more efficient.

Marshalling and Unmarshalling Data

Aside: Marshalling always makes me thinks of Sheriff Marshals and thus cowboys. Cowboys have nothing to do with the CS meaning of “marshal”.

Sometimes computers have complex internal data structures, with chains of linked items that need to be stored in a file. Marshalling is the process of taking the internal data of a program and saving it to a flat, linear file. Unmarshalling is the process of reading that that linear data and recreating the complex internal data structure the computer originally had.

Notepad has it easy – it just needs to store the raw text so no marshalling is needed. Microsoft Word, however, must store the text along with other document information (page margins, font sizes, embedded images, styles, etc.) in a single, linear file. Later, it must read that file and recreate the original setup the user had.

You can marshal data into a binary or text format — the word “marshal” does not indicate how the data is stored.

So when are binary file formats useful?

There are situations where you may want to use binary file formats. PNG images use a binary format because efficiency is important in creating small image files. However, PNG does binary formats right: it specifies byte orders and word lengths to avoid the NUXI problem.

There are often business reasons to use binary formats. The main reason is that they are more difficult to reverse engineer (humans have to guess how the computer is storing its data), which can help maintain a competitive advantage.

Kalid Azad loves sharing Aha! moments. BetterExplained is dedicated to learning with intuition, not memorization, and is honored to serve 250k readers monthly.

Enjoy this article? Try the site guide or join the newsletter:
Math, Better Explained is a highly-regarded Amazon bestseller. This 12-part book explains math essentials in a friendly, intuitive manner.

"If 6 stars were an option I'd give 6 stars." -- read more reviews

23 Comments

  1. Thanks Kalid for the wonderful explanation. I wasn’t sure that the funny characters represented next to hex values were their ascii equivalents and that they are not part of the actually file.

  2. thanx for this useful and awesome article

    it’s realy useful for reduces size of jar and file also.
    thanx

  3. that was fun.. brought me back to my comp. sci days. How about an article on basic number theory. What the number 123 (base 10) really means. How to convert from say binary to hex.. why xmas (dec 25) is equal to halloween (oct 31). Why we can’t divide by zero.. what zero really means etc… that could lead into an article on what inductive proofs really mean etc.. love this stuff!!!

  4. Thanks Mr. Rose, glad you liked it :). Yep, I’m planning on covering number systems and why things turn out the way they do (from my point of view). I’m starting to view numbers a “software simulation” that tries to model the world, but has bugs / breaks down in certain circumstances (division by zero). Of course, the bugs are with our model — we have to come up with new and better ways to represent what’s going on in the world.

  5. Kalid, this is a really great site. Thank you! I will use it as a reference in the future and I will tell others about it.

    I appreciate your focus on enhancing the reader’s comprehension. Also, you have excellent screenshots and visuals.

  6. @Kamil: Awesome, really glad it helped! I feel that sharing information isn’t really “learning” unless you also try to share the understanding that makes them click.

  7. Hi Kalid.. Thanks again..

    Hey .. i am not able to grasp this line completely:

    ——————————————-
    Representing numbers in binary can ideally save you a factor of 3 (a 4 byte number can represent 10 bytes of text).
    ——————————————-

    A lil more explanation on this will surely help.

    Thanks.

  8. Hi Kalid. General comment. Just want to say I’ve been on your site a bunch of times now and it’s turned out to be an invaluable resource. The difference is honestly like night and day sometimes. Amazing stuff. You truly have an immeasurable gift. Thank you for sharing it with us.

  9. @Juan-Carlos: Thanks for the kind words, and for coming back to check out the site. You’re more than welcome — writing is a way for me to solidify what I “think” I’m learning… sometimes it’s not until it’s on paper and I’m working through some self-made examples does it really click.

  10. Hello

    I have a question that is a little bit related with the article, and maybe you have some literature to recommended me about.
    I have to write about the problems that this operation:

    memcpy(&str1,”\x00\x00\x00\x00\x00\x00\x00\x00″,8);
    with struct like that, where I know the size but never the concrete fields that will contain, could be arrays, chars or integers:

    str1{
    int a;
    int b;
    }
    The problem is that it will not be portable into another arquitecture, now I’m in Linux,but I should explain what will happen if I change to use it in another machine. The thing is that i don’t know where to look for explain properly the problem
    Maybe you could recommended me where to look because i need to do it in a formal way.
    Thanks in advance :)

  11. Hi Romi, unfortunately I’m not that familiar with how structs are laid out on different machines. I assume that they’d be allocated in order [first bytes go to int a; next set of bytes go to int b;] but on a big or little-endian system, the order of the bytes within int a and int b could change.

  12. Cool article. I have a little software business near Redmond, Wa. I was looking up how to recover a corrupted FLA. I don’t recall how I got to this page. The wonders of the internet. Anyway, this was a refreshing little article. Thanks.

  13. After looking over a handful of the blog posts on your
    website, I really like your way of blogging. I saved it to my bookmark
    site list and will be checking back in the near future.
    Please visit my website too and let me know your opinion.

Your feedback is welcome -- leave a reply!

Your email address will not be published.

LaTeX: $$e=mc^2$$