Script (Unicode)

In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems.^[1] Some scripts support one and only one writing system and language, for example, Armenian. Other scripts support many different writing systems; for example, the Latin script supports English, French, German, Italian, Vietnamese, Latin itself, and several other languages. Some languages make use of multiple alternate writing systems, thus also use several scripts. In Turkish, the Arabic script was used before the 20th century, but transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system. More or less complementary to scripts are symbols and Unicode control characters.

The unified diacritical characters and unified punctuation characters frequently have the "common" or "inherited" script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.

Unicode 8.0 includes over 80 modern scripts plus over 40 ancient (out of use a thousand years or more) and historic (out of use several hundred years) scripts.^[2]^[3] More scripts are in the process for encoding or have been tentatively allocated for encoding in roadmaps.^[4]

Definition and classification

When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ‘å’ (sometimes called a "Swedish O") while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of scripts is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms.

Script versus writing system

"Writing system" is sometimes treated as a synonym for script. However it also can be used as the specific concrete writing system supported by a script. For example the Vietnamese writing system is supported by the Latin script. A writing system may also cover more than one script, for example the Japanese writing system makes use of the Han, Hiragana and Katakana scripts.

Most writing systems can be broadly divided into several categories: logographic, syllabic, alphabetic (or segmental), abugida, abjad and featural; however, all features of any of these may be found in any given writing system in varying proportions, often making it difficult to purely categorize a system. The term complex system is sometimes used to describe those where the admixture makes classification problematic.

Unicode supports all of these types of writing systems through its numerous scripts. Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text processing algorithms.

Special script property values

In addition to explicit or specific script properties Unicode uses three special values:^[5]

Common: Unicode can assign a character in the UCS to a single script only. However, many characters — those that are not part of a formal natural language writing system or are unified across many writing systems may be used in more than one script. For example, currency signs, symbols, numerals and punctuation marks. In these cases Unicode defines them as belonging to the "common" script (ISO 15924 code "Zyyy").
Inherited: Many diacritics and non-spacing combining characters may be applied to characters from more than one script. In these cases Unicode assigns them to the "inherited" script (ISO 15924 code Zinh), which means that they have the same script class as the base character with which they combine, and so in different contexts they may be treated as belonging to different scripts. For example, U+0308 ̈ COMBINING DIAERESIS may combine with either U+0065 e LATIN SMALL LETTER E to create a Latin "ë", or with U+0435 е CYRILLIC SMALL LETTER IE for the Cyrillic "ё". In the former case it inherits the Latin script of the base character whereas in the latter case it inherits the Cyrillic script of the base character.
Unknown: The value of "unknown" script (ISO 15924 code Zzzz) is given to unassigned, private use, noncharacter, and surrogate code points.

Character categories within scripts

Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letter and modifier letters. Some characters are considered titlecase letters for a few precomposed ligatures such as ǲ (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all compatibility characters and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.

Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as "other letter" or "modifier letter". Ideographs such as Unihan ideographs are also categorized as "other letters". A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are neither uppercase nor lowercase.

Scripts can also contain any other general category character such as marks (diacritic and otherwise), numbers (numerals), punctuation, separators (word separators such as spaces), symbols and non-graphical format characters. These are included in a particular script when they are unique to that script. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.

Table of scripts in Unicode

Unicode defines over a hundred script names (called "Alias" or "Property value alias"), based on the ISO 15924 list. Unicode uses the "Common" script name for ISO 15924's Zyyy (code for undetermined script), "Inherited" for ISO 15924's Zinh (code for inherited script), and "Unknown" for ISO 15924's Zzzz (code for uncoded script). Not used are, among others, the ISO 15924 script codes: Zsym (Symbols) and Zmth (Mathematical notation). These are considered not to be scripts in Unicode sense.

**ISO 15924 script codes^[a]^[b] and Unicode^[c]^[d]**
ISO 15924			Script in Unicode^[e]
Code	No.	Name	Alias^[f]	Direction	Version	Characters	Remark
Adlm	166	Adlam		R-to-L			Approved for inclusion in a future version of the Unicode Standard^[6]^[7]
Afak	439	Afaka		L-to-R			Not in Unicode, proposal under review by the Unicode Technical Committee^[6]
Aghb	239	Caucasian Albanian	Caucasian Albanian	L-to-R	7.0	53	Ancient/historic
Ahom	338	Ahom, Tai Ahom	Ahom	L-to-R	8.0	57	Ancient/historic
Arab	160	Arabic	Arabic	R-to-L	1.0	1,257
Aran	161	Arabic (Nastaliq variant)		R-to-L			Typographic variant of Arabic
Armi	124	Imperial Aramaic	Imperial Aramaic	R-to-L	5.2	31	Ancient/historic
Armn	230	Armenian	Armenian	L-to-R	1.0	93
Avst	134	Avestan	Avestan	R-to-L	5.2	61	Ancient/historic
Bali	360	Balinese	Balinese	L-to-R	5.0	121
Bamu	435	Bamum	Bamum	L-to-R	5.2	657
Bass	259	Bassa Vah	Bassa Vah	L-to-R	7.0	36	Ancient/historic
Batk	365	Batak	Batak	L-to-R	6.0	56
Beng	325	Bengali	Bengali	L-to-R	1.0	93
Bhks	334	Bhaiksuki		L-to-R			Approved for inclusion in a future version of the Unicode Standard^[6]
Blis	550	Blissymbols		L-to-R			Not in Unicode, proposal in initial/exploratory stage^[6]
Bopo	285	Bopomofo	Bopomofo	L-to-R	1.0	70
Brah	300	Brahmi	Brahmi	L-to-R	6.0	109	Ancient/historic
Brai	570	Braille	Braille	L-to-R	3.0	256
Bugi	367	Buginese	Buginese	L-to-R	4.1	30
Buhd	372	Buhid	Buhid	L-to-R	3.2	20
Cakm	349	Chakma	Chakma	L-to-R	6.1	67
Cans	440	Unified Canadian Aboriginal Syllabics	Canadian Aboriginal	L-to-R	3.0	710
Cari	201	Carian	Carian	L-to-R	5.1	49	Ancient/historic
Cham	358	Cham	Cham	L-to-R	5.1	83
Cher	445	Cherokee	Cherokee	L-to-R	3.0	172
Cirt	291	Cirth		L-to-R			Not in Unicode
Copt	204	Coptic	Coptic	L-to-R	1.0	137	Ancient/historic, Disunified from Greek in 4.1
Cprt	403	Cypriot	Cypriot	R-to-L	4.0	55	Ancient/historic
Cyrl	220	Cyrillic	Cyrillic	L-to-R	1.0	434
Cyrs	221	Cyrillic (Old Church Slavonic variant)		L-to-R			Not in Unicode
Deva	315	Devanagari (Nagari)	Devanagari	L-to-R	1.0	154
Dsrt	250	Deseret (Mormon)	Deseret	L-to-R	3.1	80
Dupl	755	Duployan shorthand, Duployan stenography	Duployan	L-to-R	7.0	143
Egyd	070	Egyptian demotic		R-to-L			Not in Unicode
Egyh	060	Egyptian hieratic		R-to-L			Not in Unicode
Egyp	050	Egyptian hieroglyphs	Egyptian Hieroglyphs	L-to-R	5.2	1,071	Ancient/historic
Elba	226	Elbasan	Elbasan	L-to-R	7.0	40	Ancient/historic
Ethi	430	Ethiopic (Geʻez)	Ethiopic	L-to-R	3.0	495
Geok	241	Khutsuri (Asomtavruli and Nuskhuri)	Georgian	L-to-R			Unicode groups Geok and Geor together as "Georgian"
Geor	240	Georgian (Mkhedruli)	Georgian	L-to-R	1.0	127	For Unicode, see also Geok
Glag	225	Glagolitic	Glagolitic	L-to-R	4.1	94	Ancient/historic
Goth	206	Gothic	Gothic	L-to-R	3.1	27	Ancient/historic
Gran	343	Grantha	Grantha	L-to-R	7.0	85	Ancient/historic
Grek	200	Greek	Greek	L-to-R	1.0	516
Gujr	320	Gujarati	Gujarati	L-to-R	1.0	85
Guru	310	Gurmukhi	Gurmukhi	L-to-R	1.0	79
Hanb	503	Han with Bopomofo (alias for Han + Bopomofo)		L-to-R			See Hani, Bopo
Hang	286	Hangul (Hangŭl, Hangeul)	Hangul	L-to-R	1.0	11,739	Hangul syllables relocated in 2.0
Hani	500	Han (Hanzi, Kanji, Hanja)	Han	L-to-R	1.0	81,734
Hano	371	Hanunoo (Hanunóo)	Hanunoo	L-to-R	3.2	21
Hans	501	Han (Simplified variant)		L-to-R			Subset Hani
Hant	502	Han (Traditional variant)		L-to-R			Subset Hani
Hatr	127	Hatran	Hatran	R-to-L	8.0	26	Ancient/historic
Hebr	125	Hebrew	Hebrew	R-to-L	1.0	133
Hira	410	Hiragana	Hiragana	L-to-R	1.0	91
Hluw	080	Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)	Anatolian Hieroglyphs	L-to-R	8.0	583	Ancient/historic
Hmng	450	Pahawh Hmong	Pahawh Hmong	L-to-R	7.0	127
Hrkt	412	Japanese syllabaries (alias for Hiragana + Katakana)	Katakana or Hiragana	L-to-R			See Hira, Kana
Hung	176	Old Hungarian (Hungarian Runic)	Old Hungarian	R-to-L	8.0	108	Ancient/historic
Inds	610	Indus (Harappan)		R-to-L			Not in Unicode, proposal in initial/exploratory stage^[6]
Ital	210	Old Italic (Etruscan, Oscan, etc.)	Old Italic	L-to-R	3.1	36	Ancient/historic
Jamo	284	Jamo (alias for Jamo subset of Hangul)		L-to-R			Subset Hang
Java	361	Javanese	Javanese	L-to-R	5.2	90
Jpan	413	Japanese (alias for Han + Hiragana + Katakana)		L-to-R			See Hani, Hira and Kana
Jurc	510	Jurchen		L-to-R			Not in Unicode
Kali	357	Kayah Li	Kayah Li	L-to-R	5.1	47
Kana	411	Katakana	Katakana	L-to-R	1.0	300
Khar	305	Kharoshthi	Kharoshthi	R-to-L	4.1	65	Ancient/historic
Khmr	355	Khmer	Khmer	L-to-R	3.0	146
Khoj	322	Khojki	Khojki	L-to-R	7.0	61	Ancient/historic
Kitl	505	Khitan large script		L-to-R			Not in Unicode
Kits	288	Khitan small script		T-to-B			Not in Unicode
Knda	345	Kannada	Kannada	L-to-R	1.0	87
Kore	287	Korean (alias for Hangul + Han)		L-to-R			See Hani and Hang
Kpel	436	Kpelle		L-to-R			Not in Unicode, proposal in initial/exploratory stage^[6]
Kthi	317	Kaithi	Kaithi	L-to-R	5.2	66	Ancient/historic
Lana	351	Tai Tham (Lanna)	Tai Tham	L-to-R	5.2	127
Laoo	356	Lao	Lao	L-to-R	1.0	67
Latf	217	Latin (Fraktur variant)		L-to-R			Typographic variant of Latin
Latg	216	Latin (Gaelic variant)		L-to-R			Typographic variant of Latin
Latn	215	Latin	Latin	L-to-R	1.0	1,349	See Latin script in Unicode
Leke	364	Leke		L-to-R			Not in Unicode
Lepc	335	Lepcha (Róng)	Lepcha	L-to-R	5.1	74
Limb	336	Limbu	Limbu	L-to-R	4.0	68
Lina	400	Linear A	Linear A	L-to-R	7.0	341	Ancient/historic
Linb	401	Linear B	Linear B	L-to-R	4.0	211	Ancient/historic
Lisu	399	Lisu (Fraser)	Lisu	L-to-R	5.2	48
Loma	437	Loma		L-to-R			Not in Unicode, proposal in initial/exploratory stage^[6]
Lyci	202	Lycian	Lycian	L-to-R	5.1	29	Ancient/historic
Lydi	116	Lydian	Lydian	R-to-L	5.1	27	Ancient/historic
Mahj	314	Mahajani	Mahajani	L-to-R	7.0	39	Ancient/historic
Mand	140	Mandaic, Mandaean	Mandaic	R-to-L	6.0	29
Mani	139	Manichaean	Manichaean	R-to-L	7.0	51	Ancient/historic
Marc	332	Marchen		L-to-R			Approved for inclusion in a future version of the Unicode Standard^[6]^[7]
Maya	090	Mayan hieroglyphs					Not in Unicode
Mend	438	Mende Kikakui	Mende Kikakui	R-to-L	7.0	213
Merc	101	Meroitic Cursive	Meroitic Cursive	R-to-L	6.1	90	Ancient/historic
Mero	100	Meroitic Hieroglyphs	Meroitic Hieroglyphs	R-to-L	6.1	32	Ancient/historic
Mlym	347	Malayalam	Malayalam	L-to-R	1.0	100
Modi	324	Modi, Moḍī	Modi	L-to-R	7.0	79	Ancient/historic
Mong	145	Mongolian	Mongolian	T-to-B	3.0	153	Includes Clear, Manchu scripts
Moon	218	Moon (Moon code, Moon script, Moon type)					Not in Unicode, proposal in initial/exploratory stage^[6]
Mroo	199	Mro, Mru	Mro	L-to-R	7.0	43
Mtei	337	Meitei Mayek (Meithei, Meetei)	Meetei Mayek	L-to-R	5.2	79
Mult	323	Multani	Multani	L-to-R	8.0	38	Ancient/historic
Mymr	350	Myanmar (Burmese)	Myanmar	L-to-R	3.0	223
Narb	106	Old North Arabian (Ancient North Arabian)	Old North Arabian	R-to-L	7.0	32	Ancient/historic
Nbat	159	Nabataean	Nabataean	R-to-L	7.0	40	Ancient/historic
Newa	333	Newa, Newar, Newari, Nepāla lipi		L-to-R			Approved for inclusion in a future version of the Unicode Standard^[6]^[7]
Nkgb	420	Nakhi Geba ('Na-'Khi ²Ggŏ-¹baw, Naxi Geba)		L-to-R			Not in Unicode, proposal in initial/exploratory stage^[6]
Nkoo	165	N’Ko	NKo	R-to-L	5.0	59
Nshu	499	Nüshu		L-to-R			Approved for inclusion in a future version of the Unicode Standard^[8]^[7]
Ogam	212	Ogham	Ogham		3.0	29	Ancient/historic
Olck	261	Ol Chiki (Ol Cemet’, Ol, Santali)	Ol Chiki	L-to-R	5.1	48
Orkh	175	Old Turkic, Orkhon Runic	Old Turkic	R-to-L	5.2	73	Ancient/historic
Orya	327	Oriya	Oriya	L-to-R	1.0	90
Osge	219	Osage		L-to-R			Approved for inclusion in a future version of the Unicode Standard^[6]^[7]
Osma	260	Osmanya	Osmanya	L-to-R	4.0	40
Palm	126	Palmyrene	Palmyrene	R-to-L	7.0	32	Ancient/historic
Pauc	263	Pau Cin Hau	Pau Cin Hau	L-to-R	7.0	57
Perm	227	Old Permic	Old Permic	L-to-R	7.0	43	Ancient/historic
Phag	331	Phags-pa	Phags-pa	T-to-B	5.0	56	Ancient/historic
Phli	131	Inscriptional Pahlavi	Inscriptional Pahlavi	R-to-L	5.2	27	Ancient/historic
Phlp	132	Psalter Pahlavi	Psalter Pahlavi	R-to-L	7.0	29	Ancient/historic
Phlv	133	Book Pahlavi		R-to-L			Not in Unicode
Phnx	115	Phoenician	Phoenician	R-to-L	5.0	29	Ancient/historic
Piqd	293	Klingon (KLI pIqaD)		L-to-R			Rejected for inclusion in the Unicode Standard^[9]^[10]
Plrd	282	Miao (Pollard)	Miao	L-to-R	6.1	133
Prti	130	Inscriptional Parthian	Inscriptional Parthian	R-to-L	5.2	30	Ancient/historic
Qaaa	900	Reserved for private use (start)					Not in Unicode
Qaai	908	(Private use)					Not in Unicode (Before version 5.2, this was used instead of Zinh)
Qabx	949	Reserved for private use (end)					Not in Unicode
Rjng	363	Rejang (Redjang, Kaganga)	Rejang	L-to-R	5.1	37
Roro	620	Rongorongo					Not in Unicode, proposal in initial/exploratory stage^[6]
Runr	211	Runic	Runic	L-to-R	3.0	86	Ancient/historic
Samr	123	Samaritan	Samaritan	R-to-L	5.2	61
Sara	292	Sarati					Not in Unicode
Sarb	105	Old South Arabian	Old South Arabian	R-to-L	5.2	32	Ancient/historic
Saur	344	Saurashtra	Saurashtra	L-to-R	5.1	81
Sgnw	095	SignWriting	SignWriting	T-to-B	8.0	672
Shaw	281	Shavian (Shaw)	Shavian	L-to-R	4.0	48
Shrd	319	Sharada, Śāradā	Sharada	L-to-R	6.1	94
Sidd	302	Siddham, Siddhaṃ, Siddhamātṛkā	Siddham	L-to-R	7.0	92	Ancient/historic
Sind	318	Khudawadi, Sindhi	Khudawadi	L-to-R	7.0	69
Sinh	348	Sinhala	Sinhala	L-to-R	3.0	110
Sora	398	Sora Sompeng	Sora Sompeng	L-to-R	6.1	35
Sund	362	Sundanese	Sundanese	L-to-R	5.1	72
Sylo	316	Syloti Nagri	Syloti Nagri	L-to-R	4.1	44
Syrc	135	Syriac	Syriac	R-to-L	3.0	77
Syre	138	Syriac (Estrangelo variant)		R-to-L			Typographic variant of Syriac
Syrj	137	Syriac (Western variant)		R-to-L			Typographic variant of Syriac
Syrn	136	Syriac (Eastern variant)		R-to-L			Typographic variant of Syriac
Tagb	373	Tagbanwa	Tagbanwa	L-to-R	3.2	18
Takr	321	Takri, Ṭākrī, Ṭāṅkrī	Takri	L-to-R	6.1	66
Tale	353	Tai Le	Tai Le	L-to-R	4.0	35
Talu	354	New Tai Lue	New Tai Lue	L-to-R	4.1	83
Taml	346	Tamil	Tamil	L-to-R	1.0	72
Tang	520	Tangut		L-to-R			Approved for inclusion in a future version of the Unicode Standard^[6]^[7]
Tavt	359	Tai Viet	Tai Viet	L-to-R	5.2	72
Telu	340	Telugu	Telugu	L-to-R	1.0	96
Teng	290	Tengwar		L-to-R			Not in Unicode
Tfng	120	Tifinagh (Berber)	Tifinagh	L-to-R	4.1	59
Tglg	370	Tagalog (Baybayin, Alibata)	Tagalog	L-to-R	3.2	20
Thaa	170	Thaana	Thaana	R-to-L	3.0	50
Thai	352	Thai	Thai	L-to-R	1.0	86
Tibt	330	Tibetan	Tibetan	L-to-R	2.0	207	Added in 1.0, removed in 1.1 and reintroduced in 2.0
Tirh	326	Tirhuta	Tirhuta	L-to-R	7.0	82
Ugar	040	Ugaritic	Ugaritic	L-to-R	4.0	31	Ancient/historic
Vaii	470	Vai	Vai	L-to-R	5.1	300
Visp	280	Visible Speech		L-to-R			Not in Unicode
Wara	262	Warang Citi (Varang Kshiti)	Warang Citi	L-to-R	7.0	84
Wole	480	Woleai		R-to-L			Not in Unicode, proposal in initial/exploratory stage^[6]
Xpeo	030	Old Persian	Old Persian	L-to-R	4.1	50	Ancient/historic
Xsux	020	Cuneiform, Sumero-Akkadian	Cuneiform	L-to-R	5.0	1,234	Ancient/historic
Yiii	460	Yi	Yi	L-to-R	3.0	1,220
Zinh	994	Code for inherited script	Inherited	Inherited		563
Zmth	995	Mathematical notation		L-to-R			Not a 'script' in Unicode
Zsym	996	Symbols					Not a 'script' in Unicode
Zsye	993	Symbols (emoji variant)					Not a 'script' in Unicode
Zxxx	997	Code for unwritten documents					Not a 'script' in Unicode
Zyyy	998	Code for undetermined script	Common			7,179
Zzzz	999	Code for uncoded script	Unknown			993,309	All other code points
Notes ^ ISO 15924 publications As of 17 June 2014 ^ ISO 15924 Normative text file As of 15 November 2014 ^ ISO 15924 Changes (including Aliases for Unicode; as of 2014-11-15) ^ Unicode version 8.0 ^ Unicode charts ^ Unicode uses the "Property Value Alias" (Alias) as the script-name. These Alias names are part of Unicode and are published informatively next to ISO 15924

References

↑ Glossary of Unicode Terms
↑ Unicode Character Database: Scripts
↑ "Chapter 14: Additional Ancient and Historic Scripts". The Unicode Standard, Version 6.2 (PDF). Mountain View, CA: Unicode, Inc. September 2012. p. 473. ISBN 978-1-936213-07-8.
↑ http://www.unicode.org/roadmaps/ Roadmaps to Unicode
↑ Unicode Standard Annex #24: Unicode Script Property
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 "Proposed New Scripts". Unicode Consortium. 2015-06-12. Retrieved 2015-07-16.
1 2 3 4 5 6 "Roadmap to the SMP". Unicode Consortium. 2015-03-26. Retrieved 2015-05-22.
↑ "Proposed New Characters: Pipeline Table". Unicode Consortium. 2015-07-08. Retrieved 2015-07-16.
↑ Michael Everson (1997-09-18). "Proposal to encode Klingon in Plane 1 of ISO/IEC 10646-2".
↑ The Unicode Consortium (2001-08-14). "Approved Minutes of the UTC 87 / L2 184 Joint Meeting".

Unicode

Code points

Characters

Special purpose	BOM Combining Grapheme Joiner Left-to-right mark / Right-to-left mark Soft hyphen Word joiner Zero-width joiner Zero-width non-joiner Zero-width space

Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth

Processing

Algorithms	Bi-directional text Collation ISO 14651 Equivalence

Comparison	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-9/UTF-18 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards