Bi-directional text

Bi-directional text is text containing text in both text directionalities, both right-to-left (RTL or dextrosinistral) and left-to-right (LTR or sinistrodextral). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text directionality in each row.

Some writing systems of the world, including the Arabic and Hebrew scripts or derived systems such as the Persian, Urdu, and Yiddish scripts, are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. This is different from the left-to-right (LTR) direction used by most languages in the world. When LTR text is mixed with RTL in the same paragraph, each type of text is written in its own direction, which is known as bi-directional text. This can get rather complex when multiple levels of quotation are used.

Many computer programs fail to display bi-directional text correctly. For example, the Hebrew name Sarah (שרה) is spelled: sin (ש) (which appears rightmost), then resh (ר), and finally heh (ה) (which should appear leftmost). (הרש)

Note: Some web browsers may display the Hebrew text in this article in the opposite direction.

Bidirectional script support

Bidirectional script support is the capability of a computer system to correctly display bi-directional text. The term is often shortened to "BiDi" or "bidi".

Early computer installations were designed only to support a single writing system, typically for left-to-right scripts based on the Latin alphabet only. Adding new character sets and character encodings enabled a number of other left-to-right scripts to be supported, but did not easily support right-to-left scripts such as Arabic or Hebrew, and mixing the two was not practical. Right-to-left scripts were introduced through encodings like ISO/IEC 8859-6 and ISO/IEC 8859-8, storing the letters (usually) in writing and reading order. It is possible to simply flip the left-to-right display order to a right-to-left display order, but doing this sacrifices the ability to correctly display left-to-right scripts. With bidirectional script support, it is possible to mix scripts from different scripts on the same page, regardless of writing direction.

In particular, the Unicode standard provides foundations for complete BiDi support, with detailed rules as to how mixtures of left-to-right and right-to-left scripts are to be encoded and displayed.

Unicode bidi support

^[1] The Unicode standard calls for characters to be ordered 'logically', i.e. in the sequence they are intended to be interpreted, as opposed to 'visually', the sequence they appear. This distinction is relevant for bidi support because at any bidi transition, the visual presentation ceases to be the 'logical' one. Thus, in order to offer bidi support, Unicode prescribes an algorithm for how to convert the logical sequence of characters into the correct visual presentation. For this purpose, the Unicode encoding standard divides all its characters into one of four types: 'strong', 'weak', 'neutral', and 'explicit formatting'.

strong characters

Strong characters are those with definite directionality. Examples of this type of character include most alphabetic characters, syllabic characters, Han ideographs, non-European or non-Arabic digits, and punctuation characters that are specific to only those scripts.

weak characters

Weak characters are those with vague directionality. Examples of this type of character include European digits, Eastern Arabic-Indic digits, arithmetic symbols, and currency symbols. Punctuation symbols that are common to many scripts, such as the colon, comma, full-stop, and the no-break-space also fall within this category.

numbers

^[1] Unless a directional override is present numbers are always encoded (and entered) big-endian, and the numerals rendered LTR. The weak directionality only applies to the placement of the number in its entirety.

neutral characters

Neutral characters have directionality indeterminable without context. Examples include paragraph separators, tabs, and most other whitespace characters.

explicit formatting

Explicit formatting characters, also referred to as "directional formatting characters", are special Unicode sequences that direct the unicode algorithm to modify its default behavior. These characters are subdivided into "marks", "embeddings", "isolates", and "overrides". Their effects continue until the occurrence of either a paragraph separator, or a "pop" character.

marks

If a "weak" character is followed by another "weak" character, the algorithm will look at the first neighbouring "strong" character. Sometimes this leads to unintentional display errors. These errors are corrected or prevented with "pseudo-strong" characters. Such Unicode control characters are called marks. The mark (U+200E LEFT-TO-RIGHT MARK (HTML ‎ · &lrm; · LRM) or U+200F RIGHT-TO-LEFT MARK (HTML ‏ · &rlm; · RLM)) is to be inserted into a location to make an enclosed weak character inherit its writing direction.

For example, to correctly display the U+2122 ™ TRADE MARK SIGN for an English name brand (LTR) in an Arabic (RTL) passage, an LRM mark is inserted after the trademark symbol if the symbol is not followed by LTR text. If the LRM mark is not added, the weak character ™ will be neighbored by a strong LTR character and a strong RTL character. Hence, in an RTL context, it will be considered to be RTL, and displayed in an incorrect order.

embeddings

The "embedding" directional formatting characters are the classical Unicode method of explicit formatting, and as of Unicode 6.3, are being discouraged in favor of "isolates". An "embedding" signals that a piece of text is to be treated as directionally distinct. The text within the scope of the embedding formatting characters is not independent of the surrounding text. Also, Characters within an embedding can affect the ordering of characters outside. Unicode 6.3 recognized that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use.

isolates

The "isolate" directional formatting characters signal that a piece of text is to be treated as directionally isolated from its surroundings. As of Unicode 6.3, these are the formatting characters that are being encouraged in new documents – once target platforms are known to support them. These formatting characters were introduced after it became apparent that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use. Unlike the legacy 'embedding' directional formatting characters, 'isolate' characters have no effect on the ordering of the text outside their scope. Isolates can be nested, and may be placed within embeddings and overrides.

overrides

The "override" directional formatting characters allow for special cases, such as for part numbers (e.g. to force a part number made of mixed English, digits and Hebrew letters to be written from right to left), and are recommended avoided wherever possible. As is true of the other directional formatting characters, "overrides" can be nested one inside another, and in embeddings and isolates.

pops

The "pop" directional formatting characters terminate the scope of the most recent "embedding", "override", or "isolate".

runs

In the algorithm, each sequence of concatenated strong characters is called a "run". A "weak" character that is located between two "strong" characters with the same orientation will inherit their orientation. A "weak" character that is located between two "strong" characters with a different writing direction, will inherit the main context's writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL).

Table of possible BiDi-types

Bidirectional character type (Unicode character property Bidi_Class)^[1]

Type^[2]	Description	Strength	Directionality	General scope	Bidi_Control character^[3]
L	Left-to-Right	01 !Strong	L-to-R	Most alphabetic and syllabic characters, Han ideographs, non-European or non-Arabic digits, LRM character, ...	8206 !U+200E LEFT-TO-RIGHT MARK (LRM)
R	Right-to-Left	02 !Strong	R-to-L	Hebrew, Mandaic, Mende Kikakui, N'Ko, Samaritan, ancient scripts like Kharoshthi and Nabataean, RLM character, ...	8207 !U+200F RIGHT-TO-LEFT MARK (RLM)
AL	Arabic Letter	03 !Strong	R-to-L	Arabic, Syriac, and Thaana alphabets, and most punctuation specific to those scripts, ALM character, ...	1564 !U+061C ARABIC LETTER MARK (ALM)
EN	European Number	04 !Weak		European digits, Eastern Arabic-Indic digits, Coptic epact numbers, ...
ES	European Separator	05 !Weak		plus sign, minus sign, ...
ET	European Number Terminator	06 !Weak		degree sign, currency symbols, ...
AN	Arabic Number	07 !Weak		Arabic-Indic digits, Arabic decimal and thousands separators, Rumi digits, ...
CS	Common Number Separator	08 !Weak		colon, comma, full stop, no-break space, ...
NSM	Nonspacing Mark	09 !Weak		Characters in General Categories Mark, nonspacing, and Mark, enclosing (Mn, Me)
BN	Boundary Neutral	10 !Weak		Default ignorables, non-characters, control characters other than those explicitly given other types
B	Paragraph Separator	11 !Neutral		paragraph separator, appropriate Newline Functions, higher-level protocol paragraph determination
S	Segment Separator	12 !Neutral		Tabs
WS	Whitespace	13 !Neutral		space, figure space, line separator, form feed, General Punctuation block spaces (smaller set than the Unicode whitespace list)
ON	Other Neutrals	14 !Neutral		All other characters, including object replacement character
LRE	Left-to-Right Embedding	15 !Explicit	L-to-R	LRE character only	8234 !U+202A LEFT-TO-RIGHT EMBEDDING (LRE)
LRO	Left-to-Right Override	16 !Explicit	L-to-R	LRO character only	8237 !U+202D LEFT-TO-RIGHT OVERRIDE (LRO)
RLE	Right-to-Left Embedding	17 !Explicit	R-to-L	RLE character only	8235 !U+202B RIGHT-TO-LEFT EMBEDDING (RLE)
RLO	Right-to-Left Override	18 !Explicit	R-to-L	RLO character only	8238 !U+202E RIGHT-TO-LEFT OVERRIDE (RLO)
PDF	Pop Directional Format	19 !Explicit		PDF character only	8236 !U+202C POP DIRECTIONAL FORMATTING (PDF)
LRI	Left-to-Right Isolate	20 !Explicit	L-to-R	LRI character only	8294 !U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
RLI	Right-to-Left Isolate	21 !Explicit	R-to-L	RLI character only	8295 !U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
FSI	First Strong Isolate	22 !Explicit		FSI character only	8296 !U+2068 FIRST STRONG ISOLATE (FSI)
PDI	Pop Directional Isolate	23 !Explicit		PDI character only	8297 !U+2069 POP DIRECTIONAL ISOLATE (PDI)
Notes 1.^ Unicode Bidirectional Algorithm (UAX#9), As of Unicode version 8.0 2.^ Possible Bidirectional character types for character property: Bidi_Class or 'type' 3.^ Bidi_Control characters: Twelve Bidi_Control formatting characters are defined. They are invisible, and have no effect apart from directionality. Nine of them have a unique, overruling BiDi-type that is used by the algorithm. Their type is also their acronym (e.g. character 'LRE' has BiDi type 'LRE').

Scripts using bi-directional text

Egyptian hieroglyphs

Egyptian hieroglyphs can be written bi-directionally, where the signs had a distinct "head" that faced the beginning of a line and "tail" that faced the end.

Chinese characters

Chinese characters can be written in either direction as well as vertically (top to bottom then right to left), especially in signs (such as plaques), but the orientation of the individual characters is never changed. This can often be seen on tour buses in China, where the company name customarily runs from the front of the vehicle to its rear — that is, from right to left on the right side of the bus, and from left to right on the left side of the bus. English texts on the right side of the vehicle are also quite commonly written in reverse order. (See pictures of tour bus and post vehicle below.)

The right side (text runs from right to left)
The left side (text runs from left to right)
On the right side of this Hainan Airlines aircraft, the text runs from right to left ( 空航南海 ).
The left side, however, shows the text running from left to right ( 海南航空 ).
A photo that shows text on both sides of a China Post vehicle

Boustrophedon

Boustrophedon is a writing style found in ancient Greek inscriptions and in Hungarian runes. This method of writing alternates direction, and usually reverses the individual characters, on each successive line.

References

↑ http://www.unicode.org/reports/tr9/#Resolving_Weak_Types

External links

Unicode Standards Annex #9 The Bidirectional Algorithm
W3C guidelines on authoring techniques for bi-directional text - includes examples and good explanations
GNU FriBidi A free implementation of the Unicode bidirectional algorithm
ICU International Components for Unicode contains an implementation of the bidirectional algorithm — along with other internationalization services
UCData: "Pretty Good Bidi Algorithm Library" A small and fast bidirectional reordering algorithm that works pretty good, but not necessarily compliant to the Unicode algorithm
Bidirectional Scripts in Desktop Software Working group for supporting BiDi in Free Software. Contains several links to readings and implementation regarding BiDi in computer systems.
Another Wiki about BiDi
Bidirectional text - Examples and practical advice
.Net BiDi Implementation
A freely available rather final version of Israeli standard 5194 - bidirectional text editing
Consequences of BiDi in Quran pages ~~
Work in progress on new version of Bidi editing standard + reference implementation
Series of articles about pitfalls of BiDi programming
BidiRenderer — An application that illustrates the shaping and layout of complex text in bidirectional paragraphs using FriBidi, FreeType, and HarfBuzz

Unicode

Code points

Characters

Special purpose	BOM Combining Grapheme Joiner Left-to-right mark / Right-to-left mark Soft hyphen Word joiner Zero-width joiner Zero-width non-joiner Zero-width space

Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth

Processing

Algorithms	Bi-directional text Collation ISO 14651 Equivalence

Comparison	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-9/UTF-18 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards