Back  Overview

Unicode/UCCS

  Defines each character by a unique name and integer value referred to 
    as code point.

  256 code points compose a code page.
    Code points have an assigned numeric value
    And a non-changing name (even if it was assigned in error).

  In Unicode protocol, code block is preferred term.

  The visual representation of a code point is a glyph.
    Fonts are specific instances of a glyph.
    Glyphs are defined 

  A script recognizes the characters that make up one or more writing systems,
    and may span multiple blocks.

    Scripts also identify common rules for use, such as left to right 
      rather than vertical.

      Also, in many writing systems, certain character sequences may be
        replaced by different symbols (glyphs).

    Latin script supports English, French, German, etc.
      And uses left to right presentation.

    A particular language may use characters found in one or more scripts.


  Most language character sets use single code page,
    but some can span multiple pages.

  Code pages/blocks further grouped into code planes.

  A plane is a grouping of 'similar' pages, each plane is 65536 bytes.

      17 planes identified.

        Plane 0 - Basic Multilingual Plane, (BMP)
          contains code pages for current western language, currency symbols,
          sub/super scripts, mathematical notation, etc.

          Plane 0 also has 6400 code points for private use.
            (user assignable).

          * Plane 0 0000-FFFF, 2 bytes.

        Plane 1 - Supplementary Multilingual Plane
          largely historical scripts, also musical notation.

          * Plane 1 10000- 1FFFF, 3 bytes.

        Plane 2 - Supplementary Ideographic Pane.
          Chinese, Japanese, Korean (CJK).

          * Plane 2 20000- 2FFFF, 3 bytes.

        Planes 3-13 - unassigned.

          * Plane 2 20000- 2FFFF, 3 bytes.

        Plane 14 - Supplementary Special-purpose Plane.
          Tag characters - special formatting flags.
          Alternative glyphs for certain characters with no context clues.

        Plane 15,16 - private use, user assignable.
          But requires user to publish assignment.
 
          Example : Klingon (Plane 15) - 
            NOT defined or maintained by Unicode consortium.

      Not all planes in current use.

      Not all code points in many planes not used.

      New code points and pages are constantly being assigned.


Universal Coded Character Set (ISO 10646, 1989-1990).
  Universal Character set ~ UCS

  A standard maintained by ISO/IEC.
   International Standards Organization - 1926/1947
   International Electrotechnical Commission - 1906

  ISO set code page standards for a number of years.

  ISO code point assignments deal strictly with the 'grapheme' of the
    character. 

  UCS-2 (obsolete, 1990) define code points as 2 byte fixed length values.
    Represented many modern printed languages, but not Chinese.
    And no historical languages.

    Essentially equivalent of the The Unicode Basic multilingual plane (BMP) .

    BOM - Byte order marker (big-endian vs. little-endian)

    Amendments added UCS-4, a 31 bit version.

  UCS-4 defined as a 4 byte (31-bit) fixed length.

    Contains ~ 136,000 'abstract' code points.
      ability to recognize > 1.1 million characters

    Fixed 4-byte units.

    Highly wasteful of general storage or transmission bandwidth.

    Sometimes used for a programs internal storage because of fixed size.

  UTF-1 defined variable length code points 1-5 bytes but poor implementation
    discouraged its use.

Unicode

  Project by Joe Becker (Xerox), Lee Collins and Mark Davis (Apple) to
    design a universal character set. Others joined in.

  Draft proposal ~ 1988
    16-bit wide-body character set to encode world's living languages.

  Introduced as an alternative to multiple independent code pages.

  Maintained by the Unicode Consortium, Incorporated in 1991.

  Oct. 1991 first standard published.

  First version similar to UCS-2, 16-bit, fixed size.
    Tweaks in protocol added some unused/restricted code points.

  Eventually ISO and the Unicode Consortium merged their protocols and work
    together.

    Most current updates to Unicode come from the Consortium.

    Differences

    UCCS only defines graphemes (meaning) of code points.

    Unicode defines graphemes, glyphs (visual representation), and rules.
 
  Current code point count :
    137,439 assigned code points
      147 modern and historic scripts
      Various symbols
      Emojis

  Code point definition.

  Unicode protocol defines 3 aspects.

    Grapheme - numeric representation or code point for each basic unit of 
      spelling or formatting in the written form of the language.
      Decimal 65 is A.

    Glyph - Visual representation of character.
      Glyphs for A A A A

      Repositories are collections of glyphs. Repositories may span
        multiple code pages and even planes but not represent all code points
        in those planes.

        Repositories are defined by a third party and are 'closed', once
          defined shouldn't change. 

        WGL4
          provides recognized glyphs for Windows starting ~2004.

    Rules - how certain character combinations should be represented,
      Display direction, left to right, right to left, vertical.

      Sort order or collating order.

      BOM or byte-order-mark (#FEFF) handling.
        The BOM itself is a code point.

        Placed at the beginning of a stream of data, it allows the software
          to detect if data is big or little endian.

        Not required in UTF-8, but some software insists..  

      Normalization of forms - interpretation in context. 
        i.e is vi an editor or a roman numeral

        In many languages, certain pairs of graphemes are represented by a
          unique glyph, although this can vary by context.
      Etc.

  Unicode currently defines several alternative encodings.

    UTF-16 (UCCS Transformation Format)
       defined as either a 2 byte or 4 byte (31 bit) character.

      Supersedes UCS-2.

      Was preferred format for Windows (98 and older) systems.

    UTF-8
      Designed to encode any code point.  (31 bit max.)

      Encodes into 1-6 octets.

      Preferred format for Internet (transmission).

      Also current choice for most OSes.
 
      1 octet values reserved for US-ASCII
        High bit set to zero.

      For other code points (more than 1 octet), 1st octet indicates
        number of octets.

        Highest bit and additional bits set to 1 followed by zero bit to 
          indicate size, rest of bits in 1st octet used for character bits.
 
          All subsequent octets start with 10 binary followed by 6-bits of 
            the code point filled little-endian.

          110xxxxx 10xxxxxx

          Given a Han character code point  65E5
            65E5 = 0110 0101 1110 0101 binary
            This requires a 3 octet encoding.
 
                 0110    01 0111    10 0101
            1110 0110  1001 0111  1010 0101
               E    6     9    7     A    5

    UCS-4 uses a 4 byte fixed length to access all planes.
      But, all values are 4 bytes wide code values.
      + single fixed length value allows usable arrays and tables in coding.
      - requires a 32-bit word for all values.
        So, a document with 64k characters takes 256k of storage.

    UTF-32 4 byte encoding.
      Essentially UCS-4 with Unicode semantics (glyphs and rules)