ASCII vs Unicode

Understanding why they were created in the first place


26 May 2018 View Comments
#computer #ascii #unicode #encoding

ASCII, Origins

When people created their character representation, they needed a number of unique identifier to represent each characters. That’s when they came up with ASCII which uses 7 bits to represent a character unique. By using 7 bits, there are a maximum of 2\(^7\) (= 128) distinct combinations. This means that a maximum of 128 characters can be represented.

One might ask, why 7 bits? Why not just 1 byte (8 bits)? The last bit (8th) is reserved as the parity bit to avoid errors in communication.

ASCII characters includes:

  • alphbets such as abc, ABC
  • numbers such as 123
  • symbols such as ?&*
  • control characters such as carriage return, line feed, tab, ESC, etc.

See below the binary representation of a few example characters represented in ASCII:

0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)

As you may have noticed, ASCII is designed to support only English since the center of the computer industry was in America at that time. As a consequence, they didn’t need to support other latin characters such as á, ü, ç, ñ, etc. (aka diacritics).

ASCII Extended

In the need of other latin characters, a group of people started using the last 8th bit (instead of using it as Parity bit) to encode more characters (For example, “á”). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2\(^8\) = 256 characters) instead of 2\(^7\) (128) as before.

See below binary representation of a new character set which uses the last 8th bit:

10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)

The name for this “ASCII extended to 8 bits and not 7 bits as before” could be just referred as “extended ASCII” or “8-bit ASCII”.

To note, there are variations of the 8-bit ASCII table because people use it for their different purpose. For example, the ISO 8859-1, also called ISO Latin-1.

Unicode, The Rise

ASCII Extended solved the problem for languages that are based on the Latin alphabet. What about the other languages needing a completely different looking character? Korean? Chinese? Japanese? (CJK) Russian and the likes?

To encode them and display properly, we needed an entirely a new character set. That’s the rationale behind Unicode. Unicode doesn’t have every character from every language, but it sure contains a gigantic amount of characters (see this table, especially Chinese as their characters are to be unique themselves).

There is no such thing as “save as” Unicode because Unicode is an abstract representation of the text. You need to “encode” this abstract representation. That’s where a character encoding comes into play.

Variety Unicode encodings today: UTF-8 vs UTF-16 vs UTF-32

UTF-8 and UTF-16 are variable-length encodings. For UTF-8 characters, if a character can be represented using a single byte (because its code point is a very small number), UTF-8 will encode it with a single byte. If it requires two bytes, it will use two bytes and so on. Similar concept for UTF-16 which requires a minimum of 16 bits though. So it will start with 16 bits and 16 bits more if needed. However, UTF-32 are fixed to 4 bytes to represent a character.

Please take a look at the following table, you should have a better understanding of each after.

bits encoding characters
01000001 UTF-8 A
00000000 01000001 UTF-16 A
00000000 00000000 00000000 01000001 UTF-32 A
11100011 10000001 10000010 UTF-8
00110000 01000010 UTF-16
00000000 00000000 00110000 01000010 UTF-32

The ingenious thing about UTF-8 is that it’s binary compatible with ASCII, which is the de-facto baseline for all encodings. Since UTF-16 and UTF-32 use their minimum requirements of 2/4 bytes (16/32 bits). This makes it incompatible with ASCII because ASCIIs are represented using 7-8 bits.

TL;DR

There are many encodings which have a slight variation to above ASCII/Unicode. Any character can be encoded in many different bit sequences and any particular bit sequence can represent many different characters. It all depends on which encoding is used to read or write them.

See below table for few of the encodings that are used out there.

  • fixed bits (with different encoding):
bits encoding characters
11000100 01000010 Windows Latin 1 ÄB
11000100 01000010 Mac Roman ƒB
11000100 01000010 GB18030
  • fixed characters (with different encoding):
bits encoding characters
01000110 11111000 11110110 Windows Latin 1 Føö
01000110 10111111 10011010 Mac Roman Føö
01000110 11000011 10111000 11000011 10110110 UTF-8 Føö

This is the reason why you see garbled texts when you open up something in a text editor or even those text bits can be corrupted to show weird characters. As long as you know what encoding a certain piece of text, that is, a certain byte sequence, is in, then the text will be interpreted well with that encoding.

Share this post

Me

I am a passionate programmer working in Vancouver. I strongly believe in art of algorithms and together with it to write clean and efficient software to build awesome products. If you would like to connect with me, choose one from below options :) You can also send me an email at