Private Use Areas
In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium.[1] Currently, three private use areas are defined: one in the Basic Multilingual Plane (U+E000
–U+F8FF
), and one each in, and nearly covering, planes 15 and 16 (U+F0000
–U+FFFFD
, U+100000
–U+10FFFD
). The code points in these areas cannot be considered as standardized characters in Unicode itself. They are intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy,[2] the Private Use Areas will remain allocated for that purpose in all future Unicode versions.
Assignments to Private Use Area characters need not be "private" in the sense of strictly internal to an organisation; a number of assignment schemes have been published by several organisations. Such publication may include a font that supports the definition (showing the glyphs), and software making use of the private-use characters (e.g. a graphics character for a "print document" function). By definition, multiple private parties may assign different characters to the same code point, with the consequence that a user may see one private character from an installed font where a different one was intended.
Definition
Under the Unicode definition, code points in the Private Use Areas are assigned characters—they are not noncharacters, reserved, or unassigned. Their category is "Other, private use (Co)
", and no character names are specified. No representative glyphs are provided, and character semantics are left to private agreement.
Private-use characters are assigned Unicode code points whose interpretation is not specified by this standard and whose use may be determined by private agreement among cooperating users. These characters are designated for private use and do not have defined, interpretable semantics except by private agreement.…
No charts are provided for private-use characters, as any such characters are, by their very nature, defined only outside the context of this standard.[3]
Assignment
In the Basic Multilingual Plane (plane 0), the block titled Private Use Area has 6400 code points. Planes 15 and 16 are almost[note 1] entirely assigned to two further Private Use Areas, Supplemental Private Use Area-A and Supplemental Private Use Area-B respectively.
In order to encode characters from planes 15 and 16 in UTF-16, a further block of the BMP is assigned to High Private Use Surrogates (U+DB80..U+DBFF, 128 code points).
Unicode: Private Use Areas | ||||
---|---|---|---|---|
Definition by character property: [a][b] | ||||
Range | Plane | Block name | Number of code points | Note |
U+E000..U+F8FF | BMP (0) | Private Use Area | 6400 | |
U+F0000..U+FFFFD | PUP (15)[c] | Supplemental Private Use Area-A | 65534 | Based on block High Private Use Surrogates (U+DB80..U+DBFF) in BMP, using UTF-16. |
U+100000..U+10FFFD | PUP (16)[c] | Supplemental Private Use Area-B | 65534 | |
Notes
. |
Usage
Standardization initiative uses
Many people and institutions have created character collections for the PUA. Some of these private use agreements are published, so other PUA implementers can aim for unused or less used code points to prevent overlaps. Several characters and scripts previously encoded in private use agreements have actually been fully encoded in Unicode, necessitating mappings from the PUA to other Unicode code points.
One of the more well-known and broadly implemented PUA agreements is maintained by the ConScript Unicode Registry (CSUR). The CSUR, which is not officially endorsed or associated with the Unicode Consortium, provides a mapping for constructed scripts, such as Klingon pIqaD and Ferengi script (Star Trek), Tengwar and Cirth (J.R.R. Tolkien's cursive and runic scripts), Alexander Melville Bell's Visible Speech, and Dr. Seuss' alphabet from On Beyond Zebra. The CSUR previously encoded the undeciphered Phaistos characters, as well as the Shavian and Deseret alphabets, which have all been accepted for official encoding in Unicode.
Another common PUA agreement is maintained by the Medieval Unicode Font Initiative (MUFI). This project is attempting to support all of the scribal abbreviations, ligatures, precomposed characters, symbols, and alternate letterform found in medieval texts written in the Latin alphabet. The express purpose of MUFI is to experimentally determine which characters are necessary to represent these texts, and to have those characters officially encoded in Unicode. As of Unicode version 5.1, 152 MUFI characters have been incorporated into the official Unicode encoding.
Some agreed-upon PUA character collections exist in part or whole because Unicode Consortium is in no hurry to encode them. Some, such as unrepresented languages, are likely to end up encoded in the future. Some unusual cases such as fictional languages are outside the usual scope of Unicode but not explicitly ruled out by the principles of Unicode, and may show up eventually (such as the Star Trek and Tolkien writing systems). In other cases, the proposed encoding violates one or more Unicode principles and hence is unlikely to ever be officially recognized by Unicode—mostly where users want to directly encode alternate forms, ligatures, or base-character-plus-diacritic combinations (such as the TUNE scheme).
Publishing organisation | Topic | PUA area used | Font |
---|---|---|---|
CSUR | Artificial scripts | PUA (BMP) and Plane 15 | Code2000 |
MUFI | Medieval scripts | PUA (BMP) | several |
SIL | Phonetics and languages | PUA (BMP) | Charis SIL |
TITUS | Ancient and medieval scripts | PUA (BMP) | TITUS Cyberbit Basic |
- Emoji is an encoding for picture characters or emoticons used in Japanese wireless messages and webpages. With Unicode 6.0 and later, many of these have been encoded in the block Miscellaneous Symbols And Pictographs and elsewhere in the SMP.
- GB/T 20542-2006 ("Tibetan Coded Character Set Extension A") and GB/T 22238-2008 ("Tibetan Coded Character Set Extension B") are Chinese national standards that use the PUA to encode precomposed Tibetan ligatures.
- The Institute of the Estonian Language uses the PUA to encode Latin and Cyrillic precomposed characters[4] that have no Unicode encoding.
- The Free Tengwar Font Project uses a different mapping from the ConScript Unicode Registry that largely follows Michael Everson’s 2001-03-07 Tengwar discussion paper, but diverges in some details.
- The MARC 21 standard uses the PUA to encode East Asian characters present in MARC-8[5] that have no Unicode encoding.
- The SIL Corporate PUA uses the PUA to encode characters used in minority languages that have not yet been accepted into Unicode.
- The STIX Fonts project uses the PUA to provide a comprehensive font set of mathematical symbols and alphabets, many of which are also available in the SMP now, e.g. in the Mathematical Alphanumeric Symbols block.
- The Tamil Unicode New Encoding (TUNE)[6] is a proposed scheme for encoding Tamil that overcomes perceived deficiencies in the current Unicode encoding.
Vendor use
Informally, the range U+F000 through U+F8FF is known as Corporate Use Area.
- The Adobe Glyph List used to use the PUA for some of its glyphs.
- Apple lists a range of 1,280 characters in its developer documentation[7] of U+F400–U+F8FF within the PUA for Apple’s use. Of those, only 311 are used in the range U+F700–U+F8FF.[8]
- WGL4 uses the PUA (U+F001 and U+F002) to encode duplicates of the ligatures fi (U+FB01) fl (U+FB02).[9]
- In old versions of its RichEdit component, Microsoft mapped U+F020–U+F0FF within the PUA to symbol fonts. For any character in this range, RichEdit would show a character from a symbol font instead of the end-user-defined character (EUDC)[10][11]
- AutoCAD uses U+F8FC–U+F8FE for ⌀ (diameter sign), ± (plus-minus sign) and ° (degree sign) respectively.
- Some fonts place Windows logo key at
U+F000
. - On Ubuntu,
U+E0FF
is displayed as the "Circle Of Friends" logo andU+F200
is "ubuntu" in the Ubuntu (typeface) with a superscripted "Circle Of Friends" (this itself isU+F0FF
). - The 3270 font includes the Debian logo at
U+F100
- Especially with the Linux Libertine font,
U+E000
displays Tux, the mascot of Linux - The Font Awesome icon font utilizes the PUA to display various glyphs.
- Powerline, a status line plugin for vim, use U+E0A0–U+E0A2 and U+E0B0–U+E0B3 for extra box-drawing characters.[12][13]
- On the Fira Sans codec used in Firefox OS,
U+E003
is displayed as the Mozilla logo (the dinosaur head).
U+F8FF
The Unicode code point U+F8FF
is the last code point in the Private Use Area of the BMP. Its meaning and appearance vary depending on the font in use, but its usage in several fonts makes it the most notable code point in the Private Use Area.
U+F8FF usage examples
- Dingbats font "DavysDingbats" uses it to display a face, presumably that of the font's creator.
- In most Apple-supplied fonts, it represents the Apple logo, or an early version of the command key.
- Some early Tengwar fonts map Elvish characters to it.
- The Imitari font draws it as a capital eth (Ð).
- The font Luxi draws it as the euro sign.
- The font "Standard Symbols L" uses it as one of the box drawing characters.
- The official PRC standard on precomposed Tibetan uses the codepoint for the Tibetan syllable hwo.
- The ConScript Unicode Registry suggests it be used for the Klingon glyph "KLINGON MUMMIFICATION GLYPH".
Private-use characters in other character sets
The concept of reserving specific code points for Private Use is based on similar earlier usage in other character sets. In particular, many otherwise obsolete characters in East Asian scripts continue to be used in specific names or other situations, and so some character sets for those scripts made allowance for private-use characters (such as the user-defined planes of CNS 11643, or gaiji in certain Japanese encodings). The Unicode standard references these uses under the name "End User Character Definition" (EUCD).[3]
Additionally, the C1 control block contains two codes intended for private use "control functions" by ECMA-48: 0x91 private use one (PU1) and 0x92 private use two (PU2).[14][15] Unicode includes these at U+0091 <control-0091> and U+0092 <control-0092> but defines them as control characters (category Cc
), not private-use characters (category Co
).[16][17]
Notes
- ↑ The last two characters of every plane are defined to be non-characters. The remaining 65,534 characters of each of planes 15 and 16 are assigned as private-use characters.
References
- ↑ Unicode Consortium. Glossary of Unicode Terms: "Private Use Area (PUA)"
- ↑ "Unicode Character Encoding Stability Policy". 2012-05-29. Retrieved 2012-08-15.
- 1 2 Unicode Standard chapter 16.5 Private Use characters
- ↑ "Letter Database". Eki.ee. Retrieved 2013-04-11.
- ↑ "Character Sets: East Asian Characters: Alternative Unicode Mappings for MARC 21 Characters Assigned to the Private Use Area (PUA): MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media (Library of Congress)". Loc.gov. 2004-09-02. Retrieved 2013-04-11.
- ↑ "tunerfc.tn.nic.in". tunerfc.tn.nic.in. Retrieved 2013-04-11.
- ↑ "NSCharacterSet Class Reference". Developer.apple.com. 2008-10-15. Retrieved 2013-04-11.
- ↑ Registry (external version) of Apple use of Unicode corporate-zone characters.
- ↑ See WGL4 Unicode Range U+2013 through U+FB02
- ↑ Microsoft Knowledge Base, The range of characters between U+F020 and U+F0FF in the Private Use Area of Unicode is mapped to symbol fonts in Richedit 4.1.
- ↑ SIL International, Handling of PUA Characters in Microsoft Software
- ↑ Powerline status line plugin question on StackOverflow mentioning private use area characters
- ↑ Pictures showing private use area characters in Powerline patched fonts
- ↑ Standard ECMA-48, Fifth Edition - June 1991 §8.2.14 Miscellaneous control functions, §8.3.100, §8.3.101
- ↑ ISO C1 Control Character Set of ISO 6429 (1983)
- ↑ UnicodeData.txt
- ↑ Unicode 6.1.0, Chapter 4, Table 4-9