|
IBM Character sets Other Character sets |
Data Representation
Information represented by symbols
Two general catagories
Labels - formed by arbitrary characters or symbols to represent things.
Numbers - formed by a limited set of characters to represent quantities
of things.
Labels commonly formed by sequence of characters.
Symbol for a character assigned to arbitrary value.
Decimal 65 represents A in ASCII.
Value has no implicit value to computer.
Although arbitrary
Standard followed to allow data transfer between systems.
IBM character sets
Earliest systems each had their own character definitions although some
form of binary coded decimal was often used for numbers.
BCD - binary coded decimal
Originally 6 bit number/character set used on punch cards.
IBM mainframe and mid-range
Wastes some storage (4 bits = 16 values possible, only 0-9 represented )
Can represent an exact decimal value to the limit of available memory.
When number consists of multiple digits, nibbles can be packed.
EBCDIC - Extended Binary Coded Decimal Interchange Code
Expanded to handle upper and lower case characters.
8 bit storage.
1963/4 - Used on the 360 architecture and all decendent IBM enterprise
systems.
Superset of BCD which was the recognized default for most IBM peripherals.
Lower 4 bits (digit) of an 8 bit storage to represent 0-9
Upper 4 bits (zone) for sign ( sign or other characters )
Most data transfer programs are capable of translating between ASCII and
EBCDIC.
Common standards 2
ASCII - American Standard Code for Information Interchange.
Non-IBM (although was one of the main supporters for the standard).
Decendent of the teletype. (telegrams)
Initially 7 bits with 8th bit used for parity error checking.
Some tape punches still formatted for 7 bit.
128 value possible.
Latin alphabet (a-zA-Z with some special punctuation - acute, etc. )
Represents 0-9 A-Z a-z and punctuation.
Several values used for controling the teletype (later printer).
backspace, tab, linefeed, bell, etc.
8th bit eventually used allowing another 128 characters.
Some foreign characters
Still not able to be an international charater set.
or
Graphics - often custom to system or device (Epson vs. IBM)
In early systems, it was common to have the physical shape of the
"character" stored in firmware - a predefined table.
Unicode - 8 bit, 16 bit, 32 bit.
16 bit - 64 thousand characters.
Allows for extending the character set.
Several ranges reserved for variety of international character sets.
Group of standards - some are obsolete.
UCS - universal character set ( 8,16,32 bit characters )
UTF - Unicode transformation format. (1,2,3 bytes )
Unicode reserves 256 byte ranges for different language character sets.
Unicode does not define the "glyph", the actual shape, only which
"grapheme", character id is represented by that value.
The glyph is usually assigned at the software level, OS or application,
which in turn refrences a font table for actual glyph.
1st 256 byte/block usually match up to ASCII.
Because 16 bits are not enough to represent all languages, certain versions
of Unicode support a variable length up to 4 bytes (32 bits).
Most system support full font definition for 1st 16 bits but may require
loading of additional fonts for expanded Unicode values. (such as Chinese)
There is debate over what constitutes a grapheme in Oriental languages.