Unicode bidirectional algorithm

Algorithm for displaying text containing both left-to-right and right-to-left scripts From Wikipedia, the free encyclopedia

The Unicode Bidirectional Algorithm (UBA), formally defined in Unicode Standard Annex #9 (UAX #9), is a specification developed by the Unicode Consortium that determines how text containing a mixture of left-to-right and right-to-left scripts is displayed. It is a normative part of the Unicode Standard and is required for conformance wherever characters from right-to-left scripts such as Arabic or Hebrew are rendered.

StatusActive
Year started1999
Latest versionUnicode 17.0.0 (Revision 51, 13 August 2025)
OrganizationUnicode Consortium
Quick facts Status, Year started ...
Unicode Bidirectional Algorithm
StatusActive
Year started1999
Latest versionUnicode 17.0.0 (Revision 51, 13 August 2025)
OrganizationUnicode Consortium
EditorsManish Goregaokar, Robin Leroy
Websitewww.unicode.org/reports/tr9/
Close

Background

Most writing systems display text from left to right, but several scripts—including Arabic, Hebrew, Thaana, and Syriac—are written from right to left. When text from both directions appears in the same document, the result is known as bidirectional text (or bidi text). Without a clear specification, ambiguities arise in determining the correct display order of characters.

The Unicode Standard prescribes a logical order for storing characters in memory, regardless of their visual direction. The UBA translates this logical order into a correct visual display order.

Directional Formatting Characters

The UBA defines several categories of special control characters used to influence text direction:

Implicit Directional Marks

Lightweight, zero-width characters that act as directional anchors without affecting display:

More information Abbreviation, Code Point ...
AbbreviationCode PointName
LRMU+200ELEFT-TO-RIGHT MARK
RLMU+200FRIGHT-TO-LEFT MARK
ALMU+061CARABIC LETTER MARK
Close

Explicit Directional Embeddings

Signal that a piece of text is to be treated as embedded in a given direction:

More information Abbreviation, Code Point ...
AbbreviationCode PointName
LREU+202ALEFT-TO-RIGHT EMBEDDING
RLEU+202BRIGHT-TO-LEFT EMBEDDING
Close

Explicit Directional Overrides

Force characters to be treated as strongly directional, overriding their implicit types:

More information Abbreviation, Code Point ...
AbbreviationCode PointName
LROU+202DLEFT-TO-RIGHT OVERRIDE
RLOU+202ERIGHT-TO-LEFT OVERRIDE
Close

Explicit Directional Isolates

Introduced in Unicode 6.3, isolates prevent the enclosed text from affecting the surrounding text's ordering:

More information Abbreviation, Code Point ...
AbbreviationCode PointName
LRIU+2066LEFT-TO-RIGHT ISOLATE
RLIU+2067RIGHT-TO-LEFT ISOLATE
FSIU+2068FIRST STRONG ISOLATE
PDIU+2069POP DIRECTIONAL ISOLATE
Close

Terminating Characters

More information Abbreviation, Code Point ...
AbbreviationCode PointNameTerminates
PDFU+202CPOP DIRECTIONAL FORMATTINGLRE, RLE, LRO, RLO
PDIU+2069POP DIRECTIONAL ISOLATELRI, RLI, FSI
Close

The Algorithm

The UBA processes text in four main phases:

1. Paragraph Separation

Text is split into paragraphs at paragraph separator characters (type B). Each paragraph is processed independently.

2. Initialization

Each character is assigned a bidirectional character type (e.g., L, R, AL, EN, AN) from the Unicode Character Database. An embedding level list is also initialized.

3. Resolving Embedding Levels

A series of rules resolves the embedding level of each character:

  • P1–P3: Determine the paragraph embedding level (0 for LTR, 1 for RTL).
  • X1–X10: Assign explicit embedding levels based on directional formatting characters.
  • W1–W7: Resolve weak types (e.g., European numbers, separators).
  • N0–N2: Resolve neutral and isolate formatting types, including bracket pairs.
  • I1–I2: Resolve implicit embedding levels.

The maximum embedding depth is 125 levels, a value guaranteed not to change in future versions of the standard.[1]

4. Reordering

Rules L1–L4 reorder characters on each line for display:

  • L1: Resets trailing whitespace and separators to the paragraph embedding level.
  • L2: Reverses contiguous sequences of characters at the highest embedding levels, progressively down to the lowest odd level.
  • L3: Reorders combining marks relative to their base characters.
  • L4: Applies glyph mirroring to characters with the Bidi_Mirrored property when their resolved direction is right-to-left (e.g., "(" becomes ")").

Bidirectional Character Types

Characters are classified into the following categories:

More information Category, Type ...
CategoryTypeDescription
StrongLLeft-to-Right (e.g., Latin, Han)
RRight-to-Left (e.g., Hebrew)
ALRight-to-Left Arabic (e.g., Arabic, Syriac)
WeakENEuropean Number
ESEuropean Number Separator
ETEuropean Number Terminator
ANArabic Number
CSCommon Number Separator
NSMNonspacing Mark
NeutralBParagraph Separator
SSegment Separator
WSWhitespace
ONOther Neutrals
Close

Conformance

A conforming implementation must:

  • Display all visible characters in the order described by the UBA (UAX9-C1).
  • Only apply higher-level protocol overrides as defined in Section 4.3 of the specification (UAX9-C2).

Higher-Level Protocols

The UBA permits six higher-level protocol overrides (HL1–HL6), including:

  • HL1: Override the paragraph embedding level.
  • HL3: Emulate explicit directional formatting characters via markup (e.g., HTML dir attribute).
  • HL4: Apply the UBA independently to segments of structured text (e.g., XML, source code).
  • HL6: Apply additional glyph mirroring beyond the standard Bidi_Mirrored property.

HTML and CSS Equivalents

On web pages, Unicode directional formatting characters can be replaced by HTML5 and CSS3 markup:

More information HTML, CSS ...
UnicodeHTMLCSS
RLI...PDIdir="rtl"direction:rtl; unicode-bidi:isolate
LRI...PDIdir="ltr"direction:ltr; unicode-bidi:isolate
FSI...PDI<bdi>, dir="auto"unicode-bidi:plaintext
Close

Security Considerations

The misuse of bidirectional formatting characters poses significant security risks, as they can be used to make malicious code or text appear benign. This is documented in Unicode Technical Report #36 (UTR36). Directional overrides (LRO, RLO) are particularly dangerous and should be avoided where possible.

History

  • Unicode 1.0 (1991): Basic bidirectional support introduced.
  • Unicode 6.3 (2013): Major revision introducing directional isolates (LRI, RLI, FSI, PDI) and bracket pair resolution (rule N0). These additions were made to address the overly strong effect of directional embeddings on surrounding text.
  • Unicode 17.0 (2025): Current version (Revision 51).

See also

References

Related Articles

Wikiwand AI