Digital computers represent and manipulate all forms of data as a collection of numerical values. In the case of text, alphanumerical and punctuation characters are each assigned a numerical equivalent, and this collection of character-to-number mappings is referred to as a code. Wouldn’t it be nice if there were only one such beast...

View Topics

Introduction (A Morass of Confusion)
There's an old engineering joke that says: "Standards are great ... everyone should have one!" The problem is that – very often – everyone does. Consider the case of storing textual data inside a computer, where the computer regards everything as being a collection of numbers. In this case, someone has to (a) decide which characters are going to be represented and (b) decide which numbers are going to represent which characters.

Suppose, for example, that we wish to represent only the uppercase letters 'A' through 'Z', in which case we will require only 26 different codes. Now, five binary bits can represent 2^5 = 32 different patterns of 0s and 1s, so a 5-bit code would be sufficient to represent our 26 uppercase letters with some code combinations left over. And which number-to-letter mapping would we employ? Well, in this case we would probably go with the obvious choice of assigning 'A' to %00000, 'B' to %00001, 'C' to %00010, and so forth up to 'Z' being assigned to %11001 as illustrated below (where the ‘%’ character is used to indicate a binary number):

However, what if we decide that we also wish to represent the lowercase characters 'a' through 'z'? In this case we have 26 + 26 = 52 different characters, so we'd have to use a 6-bit code, because 2^6 = 64 different patterns of 0s and 1s. But now we have a problem: should the codes %000000 through %011001 be used to represent 'A' through 'Z' and the codes %011010 through %110011 be used to represent 'a' through 'z', or vice versa?

Alternatively, would it make more sense to use %000000 to represent 'a', %000001 to represent 'A', %000010 to represent 'b', %000011 to represent 'B', and so forth? In this scenario, even numbers represent lowercase characters and odd numbers represent their uppercase counterparts:

And there are many more options open to us; for example, %000000 through %011001 could be used to represent 'a' through 'z', while %100000 through %111001 could be used to represent 'A’ through 'Z' as illustrated below:

In this case, the most-significant (left-hand) bit tells us if we have a lowercase (0) or uppercase (1) character, while the remaining bits are used to inform us what the character is (the only difference between the codes for 'a' and 'A', for example, is a 0 or 1 in the most-significant bit).

Of course we will also wish to represent the decimal number characters '0' through '9', along with some punctuation characters such as commas, periods (full stops), parentheses (brackets), single and double quotes, question and exclamation marks, and so forth.

So, who actually gets to decide which characters are going to be represented and which numbers are going to represent which characters? Well, in the early days of computing, almost anyone (or any company) who designed a computer got to define their own character code mapping.

It's easy to see how this could lead to problems; for example, consider peripheral devices such as printers. These days we can wander down to our local computer store, purchase a cheap-and-cheerful printer, return home and connect it to our computer ... and it works (hurray)! And one of the reasons it works is that both the computer and the printer "talk the same language" (understand the same character code mapping).

But in the days of yore, when computer designers defined their own character code mappings, they also had to construct all of their own peripheral devices such as printers, which was a time-consuming and expensive hobby. And, of course, things only became more complicated when folks started to connect disparate computers from different manufacturers. Something had to be done, and the obvious solution was for everyone to adopt a standard code. Now read on...

The ASCII Code
Towards the end of the 1950s, the American Standards Association (ASA) began to consider the problem of defining a standard character code mapping that could be used to facilitate the representation, storing, and interchanging of textual data between different computers and peripheral devices. In 1963, the ASA – which changed its name to the American National Standards Institute (ANSI) in 1969 – announced the first version of the American Standard Code for Information Interchange (ASCII).

However, this first version of the ASCII code (which is pronounced “ask-key”) left many things – such as the lower case Latin letters – undefined, and it wasn't until 1968 that the currently used ASCII standard of 96 printing characters and 32 control characters was defined:

Note that code \$20 (which is annotated "SP") is equivalent to a space. Also, as an aside, the terms uppercase and lowercase were handed down to us by the printing industry, from the compositors' practice of storing the type for capital letters and small letters in two separate trays, or cases. When working at the type-setting table, the compositors invariably kept the capital letters and small letters in the upper and lower cases, respectively; hence, "uppercase" and "lowercase." Prior to this, scholars referred to capital letters as majuscules and small letters as minuscules, while everyone else simply called them capital letters and small letters.

We should also note that one of the really nice things about ASCII is that all of the alpha characters are numbered sequentially; that is, 65 (\$41 in hexadecimal) = 'A', 66 = 'B', 67 = 'C', and so on until the end of the alphabet. Similarly, 97 (\$61 in hexadecimal) = 'a', 98 = 'b', 99 = 'c', and so forth. This means that we can perform cunning programming tricks like saying "char = 'A' + 23" and have a reasonable expectation of ending up with the letter 'X'. Alternatively, if we wish to test to see if a character (called "char") is lowercase and &ndash if so – convert it into its uppercase counterpart, we could use a piece of code similar to the following:

if (char >= 'a') and (char <= 'z') then char = char – 32;

Don't worry as to what computer language this is; the important point here is that the left-hand portion of this statement is used to determine whether or not we have a lowercase character and, if we do, subtracting 32 (\$20 in hexadecimal) from that character's code will convert it into its uppercase counterpart.

As can be seen in the previous diagram, in addition to the standard alphanumeric characters ('a'...'z', 'A'...'Z' and '0'...'9'), punctuation characters (comma, period, semi-colon, ...) and special characters ('!', '#', '%', ...), there are an awful lot of strange mnemonics such as EOT, ACK, NAK, and BEL. The point is that, in addition to representing textual data, ASCII was intended for a number of purposes such as communications; hence the presence of such codes as EOT, meaning "End of transmission," and BEL, which was used to ring a physical bell on old-fashioned printers.

Some of these codes are still in use today, while others are, generally speaking, of historical interest only. For those who are interested, a more detailed breakdown of these special codes is presented in the following table:

One final point is that ASCII is a 7-bit code, which means that it only uses the binary values %00000000 through %01111111 (that is, 0 through 127 in decimal, or \$00 through \$7F in hexadecimal). In some systems, the unused, most-significant bit of an 8-bit byte representing an ASCII character is simply set to logic 0. Alternatively, this bit might be used to implement a form of error checking known as a parity check, in which case it would be referred to as the parity bit.

There are two forms of simple parity checking, which are known as even parity and odd parity. Let's suppose that one computer system is transmitting a series of ASCII characters to another system. In the case of even parity, the transmitting system counts the number of logic 1s in the ASCII code for the first character. If there are an even number of logic 1s, the most-significant bit of the 8-bit byte to be transmitted will be set to logic 0; but if there are an odd number of logic 1s, the most-significant bit will be set to logic 1.

This process will be repeated for each character as it's being transmitted. The end result of an even parity check is to ensure that there are always an even number of logic 1s in the transmitted value. (By comparison, an odd parity check ensures an odd number of logic 1s in the transmitted value).

Similarly, when the receiving computer is presented with a character code, it counts the number of logic 1s in the first seven bits to determine what the parity bit should be. It then compares this calculated parity bit against the transmitted parity bit to see if they agree.

This form of parity checking is just about the simplest form of error check there is. It will only detect a single-bit error, because two errors will cancel each other out. (Actually, to be more precise, this form of parity check will detect an odd number of errors – one bit, three bits, five bits, and so on – but an even number of errors will cancel each other out.) Furthermore, even if the receiving computer does detect an error, there's no way for it to tell which bit is incorrect (indeed, the main value could be correct and the parity bit itself could have been corrupted). A variety of more sophisticated forms of error checking can be used to detect multiple errors, and even to allow the computer to work out which bits are incorrect, but these techniques are outside the scope of these discussions.

The EBCDIC Code
The ASCII code discussed in the previous topic was quickly adopted by the majority of American computer manufacturers, and was eventually turned into an international standard (see also the discussions on ISO and Unicode later in this paper.) However, IBM already had its own six-bit code called BCDIC (Binary Coded Decimal Interchange Code). Thus, IBM decided to go its own way, and it developed a proprietary 8-bit code called the Extended Binary Coded Decimal Interchange Code (EBCDIC).

Pronounced "eb-sea-dick", EBCDIC was first used on the IBM 360 computer, which was presented to the market in 1964. As was noted in our earlier discussions, one of the really nice things about ASCII is that all of the alpha characters are numbered sequentially. In turn, this means that we can perform programming tricks like saying "char = 'A' + 23" and have a reasonable expectation of ending up with the letter 'X'. To cut a long story short, if you were thinking of doing this with EBCDIC ... don’t. The reason we say this is apparent from the following table:

A brief glance at this illustration shows just why EBCDIC can be such a pain to use – the alphabetic characters don't have sequential codes. That is, the letters 'A' through 'I' occupy codes \$C1 to \$C9, 'J' through 'R' occupy codes \$D1 to \$D9, and 'S' through 'Z' occupy codes \$E2 to \$E9 (and similarly for the lowercase letters). Thus, performing programming tricks such as using the expression ('A' + 23) is a real pain with EBCDIC. Another pain is that EBCDIC doesn't contain all of the ASCII codes, which makes transferring text files between the two representations somewhat problematical.

Once again, in addition to the standard alphanumeric characters ('a'...'z', 'A'...'Z' and '0'...'9'), punctuation characters (comma, period, semi-colon, ...), and special characters ('!', '#', '%', ...), EBCDIC includes a lot of strange mnemonics, such as ACK, NAK, and BEL, which were designed for communications purposes. Some of these codes are still used today, while others are, generally speaking, of historical interest only. A slightly more detailed breakdown of these codes is presented in the following table for your edification:

As one final point of interest, different countries have different character requirements, such as the á, ê, and ü characters. Due to the fact that IBM sold its computer systems around the world, it had to create multiple versions of EDCDIC. In fact, 57 different national variants were eventually wending their way across the planet (a "standard" with 57 variants; you can only imagine how much fun everybody had when transferring files from one country to another).

Chunky Graphics Codes
In our previous discussions, we noted that the original ASCII was a 7-bit code. As opposed to always forcing the most-significant bit of an 8-bit byte to logic 0 or using it to implement a parity check, many computer vendors opted for a third alternative, which was to use this bit to extend the character set.

That is, if the most significant bit was a logic 0, then the remaining bits were used to represent the standard ASCII codes \$00 through \$7F as discussed earlier in this paper. However, if the most significant bit was a logic 1, then the remaining bits could be used to represent additional characters of the vendor's choosing.

In this case, one option would be to use these extra codes to support the special characters required by other languages (see also the discussions on the ISO and Unicode standards later in this paper.) Another option was to use these codes to implement simple graphics. As the resulting displays were somewhat "chunky," these were often referred to as "chunky graphics."

For your delectation and delight, the console (screen) that has been added to the DIY Calculator (see the notes on the Download page) supports a set of chunky graphics characters as shown below:

ISO and Unicode
Upon its introduction, ASCII quickly became a de facto standard around the world. However, the original ASCII didn't include all of the special characters (such as á, ê, and ü) that are required by the various languages that employ the Latin alphabet. Thus, the International Organization for Standardization (ISO) in Geneva, Switzerland, undertook to adapt the ASCII code to accommodate other languages.

In 1967, the organization released its recommendation ISO 646. Essentially, this left the original 7-bit ASCII code "as was", except that ten character positions were left open to be used to code for so-called "national variants."

ISO 646 was a step along the way toward internationalization. However, it didn't satisfy everyone's requirements; in particular (as far as the ISO was concerned), it wasn't capable of handling all of the languages in use in Europe, such as the Arabic, Cyrillic, Greek, and Hebrew alphabets. Thus, the ISO created its standard 2022, which described the ways in which 7-bit and 8-bit character codes were to be structured and extended.

The principles laid down in ISO 2022 were subsequently used to create the ISO 8859-1 standard. Unofficially known as Latin-1, ISO 8859 is widely used for passing information around the Internet in Western Europe to this day. Full of enthusiasm, the ISO then set about defining an "all-singing all-dancing" 32-bit code called the Universal Coded Character Set (UCS). Now known as ISO/IEC DIS 10646 Version 1, this code was intended to employ escape sequences to switch between different character sets. The result would have been able to support up to 4,294,967,296 characters, which would have been more than sufficient to address the world's (possibly the universe's) character coding needs for the foreseeable future.

However, starting in the early 1980s, American computer companies began to consider their own solutions to the problem of supporting multilingual character sets and codes. This work eventually became known as the Unification Code, or Unicode for short. Many people preferred Unicode to the ISO 10646 offering on the basis that Unicode was simpler. After a lot of wrangling, the proponents of Unicode persuaded the ISO to drop 10646 Version 1 and to replace it with a Unicode-based scheme, which ended up being called ISO/IEC 10646 Version 2.