Brzozowski derivative
|
In theoretical computer science, in particular in formal language theory, the Brzozowski derivative u−1S of a set S of strings and a string u is defined as the set of all rest-strings obtainable from a string in S by cutting off its prefix u (if possible), formally: u−1S = { v ∈ Σ*: uv ∈ S }, cf. picture.[1] It is named after the computer scientist Janusz Brzozowski who investigated their properties and gave an algorithm to compute the derivative of a generalized regular expression.
Derivative of a regular expression
Given a finite alphabet A of symbols,[2] a generalized regular expression denotes a possibly infinite set of finite-length strings of symbols from A. It may be built of:
- ∅ (denoting the empty set of strings),
- ε (denoting the singleton set containing just the empty string),
- a symbol a from A (denoting the singleton set containing the single-symbol string a),
- R∨S (where R and S are, in turn, generalized regular expressions; denoting their set's union),
- R∧S (denoting the intersection of R 's and S 's set),
- ¬R (denoting the complement of R 's set with respect to the set of all strings of symbols from A),
- RS (denoting the set of all possible concatenations of strings from R 's and S 's set),
- R* (denoting the set of n-fold repetitions of strings from R 's set, for any n≥0, including the empty string).
In an ordinary regular expression, neither ∧ nor ¬ is allowed. The string set denoted by a generalized regular expression R is called its language, denoted as L(R).
As an auxiliary function, δ(R) yields the empty string ε if the language corresponding to R contains ε; otherwise, δ(R) yields ∅. This function can be computed by the following rules:[3]
δ(ε) | = ε | |
δ(∅) | = ∅ | |
δ(R*) | = ε | |
δ(RS) | = δ(R) ∧ δ(S) | |
δ(R ∧ S) | = δ(R) ∧ δ(S) | |
δ(R ∨ S) | = δ(R) ∨ δ(S) | |
δ(¬R) | = ε | if δ(R) = ∅ |
δ(¬R) | = ∅ | if δ(R) = ε |
Based on that, the derivative of a generalized regular expression with respect to a single-symbol string a can be computed as follows:[4]
a−1a | = ε | |
a−1b | = ∅ | for each symbol b≠a |
a−1ε | = ∅ | |
a−1∅ | = ∅ | |
a−1(R*) | = a−1RR* | |
a−1(RS) | = (a−1R)S ∨ δ(R)a−1S | |
a−1(R∧S) | = (a−1R) ∧ (a−1S) | |
a−1(R∨S) | = (a−1R) ∨ (a−1S) | |
a−1(¬R) | = ¬(a−1R) |
Finally, for a symbol a, an arbitrary string u, and a generalized regular expression R, the derivative (ua)−1R can be computed recursively as a−1(u−1R); and ε−1R simply equals R.[5] This way, for any given generalized regular expression R and any string u, the derivative u−1R may be computed as another generalized regular expression.[6]
Properties
A string u is a member of the string set denoted by a generalized regular expression R if and only if ε is a member of the string set denoted by the derivative u−1R.[7]
Considering all the derivatives of a fixed generalized regular expression R results in only finitely many different languages. If their number is denoted by dR, all these languages can be obtained as derivatives of R with respect to string of length below dR.[8] Furthermore, there is a complete deterministic finite automaton with dR states which recognises the regular language given by R, as laid out by the Myhill–Nerode theorem.
References
- ↑ Janusz A. Brzozowski (1964). "Derivatives of Regular Expressions". JACM 11: 481–494. doi:10.1145/321239.321249.
- ↑ Brzozowski (1964), p.481, required A to consist of the 2n combinations of n bits, for some n.
- ↑ Brzozowski (1964), p.482, Definition 3.2
- ↑ Brzozowski (1964), p.483, Theorem 3.1
- ↑ Brzozowski (1964), p.483, Theorem 3.2
- ↑ Brzozowski (1964), p.483, Theorem 4.1
- ↑ Brzozowski (1964), p.483, Theorem 4.2
- ↑ Brzozowski (1964), p.484, Theorem 4.3