Brzozowski derivative

In theoretical computer science, particularly in formal language theory, the Brzozowski derivative $u^{-1}S$ of a set $S$ of strings and a string $u$ is the set of all strings obtainable from a string in $S$ by cutting off the prefix $u$ . Formally:

u^{-1}S=\{v\in \Sigma ^{*}\mid uv\in S\}

.

For example,

{\text{c}}^{-1}\{{\text{cat}},{\text{cow}},{\text{dog}}\}=\{{\text{at}},{\text{ow}}\}.

The Brzozowski derivative was introduced under various different names since the late 1950s.^[1]^[2]^[3] Today it is named after the computer scientist Janusz Brzozowski who investigated its properties and gave an algorithm to compute the derivative of a generalized regular expression.^[4]

Definition

Even though originally studied for regular expressions, the definition applies to arbitrary formal languages. Given any formal language $S$ over an alphabet $\Sigma$ and any string $u\in \Sigma ^{*}$ , the derivative of $S$ with respect to $u$ is defined as:^[4]

u^{-1}S=\{v\in \Sigma ^{*}\mid uv\in S\}

The Brzozowski derivative is a special case of left quotient by a singleton set containing only $u$ : $\ u^{-1}S=\{u\}\;\backslash \;S$ .

Equivalently, for all $u,v\in \Sigma ^{*}$ :

v\in u^{-1}S\;\Leftrightarrow \;uv\in S.

From the definition, for all $u,v\in \Sigma ^{*}$ :

(uv)^{-1}S=v^{-1}(u^{-1}S)

since for all $w\in \Sigma ^{*}$ , we have $w\in (uv)^{-1}S\Leftrightarrow uvw\in S\Leftrightarrow vw\in u^{-1}S\Leftrightarrow w\in v^{-1}(u^{-1}S)$ .

The derivative with respect to an arbitrary string reduces to successive derivatives over the symbols of that string, since for all $a\in \Sigma ,u\in \Sigma ^{*}$ : ${\begin{aligned}(ua)^{-1}S&=a^{-1}(u^{-1}S)\\\varepsilon ^{-1}S&=S\end{aligned}}$

A language $S\subseteq \Sigma ^{*}$ is called nullable if and only if it contains the empty string $\varepsilon$ . Each language $S$ is uniquely determined by nullability of its derivatives:

w\in S\ \Leftrightarrow \ \varepsilon \in w^{-1}S

A language can be viewed as a (potentially infinite) boolean-labelled tree (see also tree (set theory) and infinite-tree automaton). Each possible string $w\in \Sigma ^{*}$ denotes a node in the tree, with label true when $w\in S$ and false otherwise. In this interpretation, the derivative with respect to a symbol $a$ corresponds to the subtree obtained by following the edge $a$ from the root. Decomposing a tree into the root and the subtrees $a^{-1}S$ corresponds to the following equality, which holds for every language $S\subseteq \Sigma ^{*}$ :

S=(\{\varepsilon \}\cap S)\cup \bigcup _{a\in \Sigma }a(a^{-1}S).

Derivatives of generalized regular expressions

When a language is given by a regular expression, the concept of derivatives leads to an algorithm for deciding whether a given word belongs to the regular expression.

Given a finite alphabet A of symbols,^[5] a generalized regular expression R denotes a possibly infinite set of finite-length strings over the alphabet A, called the language of R, denoted L(R).

A generalized regular expression can be one of the following (where a is a symbol of the alphabet A, and R and S are generalized regular expressions):

"∅" denotes the empty set: L(∅) = {},
"ε" denotes the singleton set containing the empty string: L(ε) = {ε},
"a" denotes the singleton set containing the single-symbol string a: L(a) = {a},
"R∨S" denotes the union of R and S: L(R∨S) = L(R) ∪ L(S),
"R∧S" denotes the intersection of R and S: L(R∧S) = L(R) ∩ L(S),
"¬R" denotes the complement of R (with respect to A*, the set of all strings over A): L(¬R) = A* \ L(R),
"RS" denotes the concatenation of R and S: L(RS) = L(R) · L(S),
"R*" denotes the Kleene closure of R: L(R*) = L(R)*.

In an ordinary regular expression, neither ∧ nor ¬ is allowed.

Computation

For any given generalized regular expression R and any string u, the derivative u⁻¹R is again a generalized regular expression (denoting the language u⁻¹L(R)).^[6] It may be computed recursively as follows.^[7]

(ua)⁻¹R	= a⁻¹(u⁻¹R)	for a symbol a and a string u
ε⁻¹R	= R

Using the previous two rules, the derivative with respect to an arbitrary string is explained by the derivative with respect to a single-symbol string a. The latter can be computed as follows:^[8]

a⁻¹a	= ε
a⁻¹b	= ∅	for each symbol b≠a
a⁻¹ε	= ∅
a⁻¹∅	= ∅
a⁻¹(R*)	= (a⁻¹R)R*
a⁻¹(RS)	= (a⁻¹R)S ∨ ν(R)a⁻¹S
a⁻¹(R∧S)	= (a⁻¹R) ∧ (a⁻¹S)
a⁻¹(R∨S)	= (a⁻¹R) ∨ (a⁻¹S)
a⁻¹(¬R)	= ¬(a⁻¹R)

Here, $ν(R)$ is an auxiliary function yielding a generalized regular expression that evaluates to the empty string ε if R's language contains ε, and otherwise evaluates to ∅. This function can be computed by the following rules:^[9]

ν(a)	= ∅	for any symbol a
ν(ε)	= ε
ν(∅)	= ∅
ν(R*)	= ε
ν(RS)	= ν(R) ∧ ν(S)
ν(R ∧ S)	= ν(R) ∧ ν(S)
ν(R ∨ S)	= ν(R) ∨ ν(S)
ν(¬R)	= ε	if ν(R) = ∅
ν(¬R)	= ∅	if ν(R) = ε

Properties

A string u is a member of the string set denoted by a generalized regular expression R if and only if ε is a member of the string set denoted by the derivative u⁻¹R.^[10]

Considering all the derivatives of a fixed generalized regular expression R results in only finitely many different languages. If their number is denoted by d_R, all these languages can be obtained as derivatives of R with respect to strings of length less than d_R.^[11] Furthermore, there is a complete deterministic finite automaton with d_R states that recognises the regular language given by R, as stated by the Myhill–Nerode theorem.

Derivatives of context-free languages

Derivatives are also effectively computable for recursively defined equations with regular expression operators, which are equivalent to context-free grammars. This insight was used to derive parsing algorithms for context-free languages.^[12] Implementation of such algorithms have shown to have cubic time complexity,^[13] corresponding to the complexity of the Earley parser on general context-free grammars.

Brzozowski derivative

Definition

Derivatives of generalized regular expressions

Computation

Properties

Derivatives of context-free languages

See also

References

Related Articles