Unicode bidirectional algorithm

Algorithm for displaying text containing both left-to-right and right-to-left scripts From Wikipedia, the free encyclopedia

The Unicode Bidirectional Algorithm (UBA), formally defined in Unicode Standard Annex #9 (UAX #9), is a specification developed by the Unicode Consortium that determines how text containing a mixture of left-to-right and right-to-left scripts is displayed. It is a normative part of the Unicode Standard and is required for conformance wherever characters from right-to-left scripts such as Arabic or Hebrew are rendered.

StatusActive

Year started1999

Latest versionUnicode 17.0.0 (Revision 51, 13 August 2025)

OrganizationUnicode Consortium

Quick facts Status, Year started ...

Unicode Bidirectional Algorithm
Status	Active
Year started	1999
Latest version	Unicode 17.0.0 (Revision 51, 13 August 2025)
Organization	Unicode Consortium
Editors	Manish Goregaokar, Robin Leroy
Website	www.unicode.org/reports/tr9/

Close

Background

Most writing systems display text from left to right, but several scripts—including Arabic, Hebrew, Thaana, and Syriac—are written from right to left. When text from both directions appears in the same document, the result is known as bidirectional text (or bidi text). Without a clear specification, ambiguities arise in determining the correct display order of characters.

The Unicode Standard prescribes a logical order for storing characters in memory, regardless of their visual direction. The UBA translates this logical order into a correct visual display order.

Directional Formatting Characters

The UBA defines several categories of special control characters used to influence text direction:

Implicit Directional Marks

Lightweight, zero-width characters that act as directional anchors without affecting display:

More information Abbreviation, Code Point ...

Abbreviation	Code Point	Name
LRM	U+200E	LEFT-TO-RIGHT MARK
RLM	U+200F	RIGHT-TO-LEFT MARK
ALM	U+061C	ARABIC LETTER MARK

Close

Explicit Directional Embeddings

Signal that a piece of text is to be treated as embedded in a given direction:

More information Abbreviation, Code Point ...

Abbreviation	Code Point	Name
LRE	U+202A	LEFT-TO-RIGHT EMBEDDING
RLE	U+202B	RIGHT-TO-LEFT EMBEDDING

Close

Explicit Directional Overrides

Force characters to be treated as strongly directional, overriding their implicit types:

More information Abbreviation, Code Point ...

Abbreviation	Code Point	Name
LRO	U+202D	LEFT-TO-RIGHT OVERRIDE
RLO	U+202E	RIGHT-TO-LEFT OVERRIDE

Close

Explicit Directional Isolates

Introduced in Unicode 6.3, isolates prevent the enclosed text from affecting the surrounding text's ordering:

More information Abbreviation, Code Point ...

Abbreviation	Code Point	Name
LRI	U+2066	LEFT-TO-RIGHT ISOLATE
RLI	U+2067	RIGHT-TO-LEFT ISOLATE
FSI	U+2068	FIRST STRONG ISOLATE
PDI	U+2069	POP DIRECTIONAL ISOLATE

Close

Terminating Characters

More information Abbreviation, Code Point ...

Abbreviation	Code Point	Name	Terminates
PDF	U+202C	POP DIRECTIONAL FORMATTING	LRE, RLE, LRO, RLO
PDI	U+2069	POP DIRECTIONAL ISOLATE	LRI, RLI, FSI

Close

The Algorithm

The UBA processes text in four main phases:

1. Paragraph Separation

Text is split into paragraphs at paragraph separator characters (type B). Each paragraph is processed independently.

2. Initialization

Each character is assigned a bidirectional character type (e.g., L, R, AL, EN, AN) from the Unicode Character Database. An embedding level list is also initialized.

3. Resolving Embedding Levels

A series of rules resolves the embedding level of each character:

P1–P3: Determine the paragraph embedding level (0 for LTR, 1 for RTL).
X1–X10: Assign explicit embedding levels based on directional formatting characters.
W1–W7: Resolve weak types (e.g., European numbers, separators).
N0–N2: Resolve neutral and isolate formatting types, including bracket pairs.
I1–I2: Resolve implicit embedding levels.

The maximum embedding depth is 125 levels, a value guaranteed not to change in future versions of the standard.^[1]

4. Reordering

Rules L1–L4 reorder characters on each line for display:

L1: Resets trailing whitespace and separators to the paragraph embedding level.
L2: Reverses contiguous sequences of characters at the highest embedding levels, progressively down to the lowest odd level.
L3: Reorders combining marks relative to their base characters.
L4: Applies glyph mirroring to characters with the Bidi_Mirrored property when their resolved direction is right-to-left (e.g., "(" becomes ")").

Bidirectional Character Types

Characters are classified into the following categories:

More information Category, Type ...

Category	Type	Description
Strong	L	Left-to-Right (e.g., Latin, Han)
	R	Right-to-Left (e.g., Hebrew)
	AL	Right-to-Left Arabic (e.g., Arabic, Syriac)
Weak	EN	European Number
	ES	European Number Separator
	ET	European Number Terminator
	AN	Arabic Number
	CS	Common Number Separator
	NSM	Nonspacing Mark
Neutral	B	Paragraph Separator
	S	Segment Separator
	WS	Whitespace
	ON	Other Neutrals

Close

Conformance

A conforming implementation must:

Display all visible characters in the order described by the UBA (UAX9-C1).
Only apply higher-level protocol overrides as defined in Section 4.3 of the specification (UAX9-C2).

Higher-Level Protocols

The UBA permits six higher-level protocol overrides (HL1–HL6), including:

HL1: Override the paragraph embedding level.
HL3: Emulate explicit directional formatting characters via markup (e.g., HTML dir attribute).
HL4: Apply the UBA independently to segments of structured text (e.g., XML, source code).
HL6: Apply additional glyph mirroring beyond the standard Bidi_Mirrored property.

HTML and CSS Equivalents

On web pages, Unicode directional formatting characters can be replaced by HTML5 and CSS3 markup:

More information HTML, CSS ...

Unicode	HTML	CSS
RLI...PDI	`dir="rtl"`	`direction:rtl; unicode-bidi:isolate`
LRI...PDI	`dir="ltr"`	`direction:ltr; unicode-bidi:isolate`
FSI...PDI	`<bdi>`, `dir="auto"`	`unicode-bidi:plaintext`

Close

Security Considerations

The misuse of bidirectional formatting characters poses significant security risks, as they can be used to make malicious code or text appear benign. This is documented in Unicode Technical Report #36 (UTR36). Directional overrides (LRO, RLO) are particularly dangerous and should be avoided where possible.

History

Unicode 1.0 (1991): Basic bidirectional support introduced.
Unicode 6.3 (2013): Major revision introducing directional isolates (LRI, RLI, FSI, PDI) and bracket pair resolution (rule N0). These additions were made to address the overly strong effect of directional embeddings on surrounding text.
Unicode 17.0 (2025): Current version (Revision 51).

See also

References

External links

Related Articles

Wikiwand AI