Defines each character by a unique name and integer value referred to as code point. 256 code points compose a code page. Code points have an assigned numeric value And a non-changing name (even if it was assigned in error). In Unicode protocol, code block is preferred term. The visual representation of a code point is a glyph. Fonts are specific instances of a glyph. Glyphs are defined A script recognizes the characters that make up one or more writing systems, and may span multiple blocks. Scripts also identify common rules for use, such as left to right rather than vertical. Also, in many writing systems, certain character sequences may be replaced by different symbols (glyphs). Latin script supports English, French, German, etc. And uses left to right presentation. A particular language may use characters found in one or more scripts. Most language character sets use single code page, but some can span multiple pages. Code pages/blocks further grouped into code planes. A plane is a grouping of 'similar' pages, each plane is 65536 bytes. 17 planes identified. Plane 0 - Basic Multilingual Plane, (BMP) contains code pages for current western language, currency symbols, sub/super scripts, mathematical notation, etc. Plane 0 also has 6400 code points for private use. (user assignable). * Plane 0 0000-FFFF, 2 bytes. Plane 1 - Supplementary Multilingual Plane largely historical scripts, also musical notation. * Plane 1 10000- 1FFFF, 3 bytes. Plane 2 - Supplementary Ideographic Pane. Chinese, Japanese, Korean (CJK). * Plane 2 20000- 2FFFF, 3 bytes. Planes 3-13 - unassigned. * Plane 2 20000- 2FFFF, 3 bytes. Plane 14 - Supplementary Special-purpose Plane. Tag characters - special formatting flags. Alternative glyphs for certain characters with no context clues. Plane 15,16 - private use, user assignable. But requires user to publish assignment. Example : Klingon (Plane 15) - NOT defined or maintained by Unicode consortium. Not all planes in current use. Not all code points in many planes not used. New code points and pages are constantly being assigned. Universal Coded Character Set (ISO 10646, 1989-1990). Universal Character set ~ UCS A standard maintained by ISO/IEC. International Standards Organization - 1926/1947 International Electrotechnical Commission - 1906 ISO set code page standards for a number of years. ISO code point assignments deal strictly with the 'grapheme' of the character. UCS-2 (obsolete, 1990) define code points as 2 byte fixed length values. Represented many modern printed languages, but not Chinese. And no historical languages. Essentially equivalent of the The Unicode Basic multilingual plane (BMP) . BOM - Byte order marker (big-endian vs. little-endian) Amendments added UCS-4, a 31 bit version. UCS-4 defined as a 4 byte (31-bit) fixed length. Contains ~ 136,000 'abstract' code points. ability to recognize > 1.1 million characters Fixed 4-byte units. Highly wasteful of general storage or transmission bandwidth. Sometimes used for a programs internal storage because of fixed size. UTF-1 defined variable length code points 1-5 bytes but poor implementation discouraged its use. Unicode Project by Joe Becker (Xerox), Lee Collins and Mark Davis (Apple) to design a universal character set. Others joined in. Draft proposal ~ 1988 16-bit wide-body character set to encode world's living languages. Introduced as an alternative to multiple independent code pages. Maintained by the Unicode Consortium, Incorporated in 1991. Oct. 1991 first standard published. First version similar to UCS-2, 16-bit, fixed size. Tweaks in protocol added some unused/restricted code points. Eventually ISO and the Unicode Consortium merged their protocols and work together. Most current updates to Unicode come from the Consortium. Differences UCCS only defines graphemes (meaning) of code points. Unicode defines graphemes, glyphs (visual representation), and rules. Current code point count : 137,439 assigned code points 147 modern and historic scripts Various symbols Emojis Code point definition. Unicode protocol defines 3 aspects. Grapheme - numeric representation or code point for each basic unit of spelling or formatting in the written form of the language. Decimal 65 is A. Glyph - Visual representation of character. Glyphs for A A A A Repositories are collections of glyphs. Repositories may span multiple code pages and even planes but not represent all code points in those planes. Repositories are defined by a third party and are 'closed', once defined shouldn't change. WGL4 provides recognized glyphs for Windows starting ~2004. Rules - how certain character combinations should be represented, Display direction, left to right, right to left, vertical. Sort order or collating order. BOM or byte-order-mark (#FEFF) handling. The BOM itself is a code point. Placed at the beginning of a stream of data, it allows the software to detect if data is big or little endian. Not required in UTF-8, but some software insists.. Normalization of forms - interpretation in context. i.e is vi an editor or a roman numeral In many languages, certain pairs of graphemes are represented by a unique glyph, although this can vary by context. Etc. Unicode currently defines several alternative encodings. UTF-16 (UCCS Transformation Format) defined as either a 2 byte or 4 byte (31 bit) character. Supersedes UCS-2. Was preferred format for Windows (98 and older) systems. UTF-8 Designed to encode any code point. (31 bit max.) Encodes into 1-6 octets. Preferred format for Internet (transmission). Also current choice for most OSes. 1 octet values reserved for US-ASCII High bit set to zero. For other code points (more than 1 octet), 1st octet indicates number of octets. Highest bit and additional bits set to 1 followed by zero bit to indicate size, rest of bits in 1st octet used for character bits. All subsequent octets start with 10 binary followed by 6-bits of the code point filled little-endian. 110xxxxx 10xxxxxx Given a Han character code point 65E5 65E5 = 0110 0101 1110 0101 binary This requires a 3 octet encoding. 0110 01 0111 10 0101 1110 0110 1001 0111 1010 0101 E 6 9 7 A 5 UCS-4 uses a 4 byte fixed length to access all planes. But, all values are 4 bytes wide code values. + single fixed length value allows usable arrays and tables in coding. - requires a 32-bit word for all values. So, a document with 64k characters takes 256k of storage. UTF-32 4 byte encoding. Essentially UCS-4 with Unicode semantics (glyphs and rules)