Regex complexity

Regular expression complexity of homogeneous languages

We could measure complexity by the length of the polish notation for the regex. However, this will always be an odd number $2m+1$. Therefore we prefer to use $m$, the number of operators, as the complexity. We have an upper bound of $c(n-1)+c-1 = cn-1$ for the complexity for a family of cardinality $c$ consisting of words of length $n$. This is because we can do: $++..x_1x_2x_3..y_1y_2y_3$ (in the case $n=3,c=2$). This is not sharp: for instance, when $c=2^n$ we also have the upper bound $2n-1$ due to the pattern $..+01+01+01$ (in the case $n=3$). Of interest: which value of $c$ maximizes the complexity given $n$?
Data below from February 20-23, 2021, pertain to a search through the space of words over {0,1,+,.}. A better, more semantic approach is to use that the complexity $R$ of a language $L=L_1\cup L_2$ or $L=L_1L_2$ (a concatenation) is at most $R(L_1)+R(L_2)+1$, and in fact $R(L)$ is realized this way for some $L_1,L_2$. With the new code it takes only 5 minutes to find all complexity for $n\ge 3$.
Length $n=0$ (max 0)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{\varepsilon\}$0$\varepsilon$
Length $n=1$ (max 1)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{0\}$0$0$
$\{1\}$0$1$
$\{0,1\}$1$+01$
Uniqueness is of interest. For $n=1$, +10 = +01 but we may always rule out the phrase +1 so it is essentially unique. More genuine non-uniqueness occurs for $\{00,01,11\}$:
+.00.+011, which is by commutativity of + equal to +.+011.00
+.0+01.11, which is by commutativity of + equal to +.11.0+01
We associate to the left so that instead of .0.0 we use ..00, to eliminate duplicates. For $\{000, 001, 011, 111\}$ we have:

+.+01.11..00+01
+..00+01.+01.11
+..00+01..+0111
+..+0111..00+01
A more complex example is $\{000,001,011,110\}$ which has:
+.0+.00.+011..110
+.0+.0+01.11..110
+.0+.11.0+01..110
+.0+.+011.00..110
+.+.00.110..0+011
+.+.11.000..0+011
+..0+011.+.00.110
+..0+011.+.11.000
+..110.0+.00.+011
+..110.0+.0+01.11
+..110.0+.11.0+01
+..110.0+.+011.00
Length $n=2$ (max 4)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{00\}$1$.00$
$\{01\}$1$.01$
$\{10\}$1$.10$
$\{11\}$1$.11$
$\{00,01\}$2.0+01
$\{00,10\}$2.+010
$\{00,11\}$3+.00.11 and +.11.00
$\{01,10\}$3+.01.10 and +.10.01
$\{01,11\}$2.+011
$\{10,11\}$2.1+01
$\{00,01,10\}$4+.01.+010
$\{00,01,11\}$4+.00.+011
$\{00,10,11\}$4+.00.1+01
$\{01,10,11\}$4+.01.1+01
$\{00,01,10,11\}$3.+01+01
Length $n=3$, cardinality $\le 2$ (max 5)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{000\}$2$..000$
$\{001\}$2$..001$
$\{010\}$2$..010$
$\{011\}$2$..011$
$\{100\}$2$..100$
$\{101\}$2$..101$
$\{110\}$2$..110$
$\{111\}$2$..111$
$\{000,001\}$3.0.0+01
$\{000,010\}$3.0.+010
$\{000,011\}$4.0+.00.11
$\{000,100\}$3.+01.00
$\{000,101\}$5+.0.00.1.01
$\{000,110\}$4.+.00.110
$\{000,111\}$5+.0.00.1.11
$\{001,010\}$4.0+.01.10
$\{001,011\}$3.0.+011
$\{001,100\}$5+.0.01.1.00
$\{001,101\}$3.+01.01
$\{001,110\}$5+.0.01.1.10
$\{001,111\}$4.+.00.111
$\{010,011\}$3.0.1+01
$\{010,100\}$4.+.01.100
$\{010,101\}$5+.0.10.1.01
$\{010,110\}$3.+01.10
$\{010,111\}$5+.0.10.1.11
$\{011,100\}$5+.0.11.1.00
$\{011,101\}$4.+.01.101
$\{011,110\}$5+.0.11.1.10
$\{011,111\}$3.+01.11
$\{100,101\}$3.1.0+01
$\{100,110\}$3.1.+010
$\{100,111\}$4.1+.00.11
$\{101,110\}$4.1+.01.10
$\{101,111\}$3.1.+011
$\{110,111\}$3.1.1+01
Length 3, cardinality 3 (max=7)
SetComplexityRegex
$\{000,001,010\}$5.0+.01.+010
$\{000,001,011\}$5.0+.00.+011
$\{000,001,100\}$6+.0.01.+01.00
$\{000,001,101\}$6+.0.00.+01.01
$\{000,001,110\}$6+.0.0+01.1.10
$\{000,010,011\}$5.0+.00.1+01
$\{000,010,100\}$5.+.01.+0100
$\{000,010,101\}$6+.0.+010.1.01
$\{000,010,110\}$5.+.00.+0110
$\{000,010,111\}$6+.0.+010.1.11
$\{000,011,100\}$6+.0.11.+01.00
$\{000,011,101\}$7+.0+.00.11.1.01
The example $\{000,011,101\}$ illustrates that 7 is the maximum complexity for a family of 3 binary words of length 3: there will always be two words that start with the same letter.
Length 3, cardinality 4
SetComplexityRegex
$\{000,001,010,011\}$4.0.+01+01
$\{000,001,010,100\}$8+. +01 .00 .0+ .01 .10
$\{000,001,010,101\}$7+. +01 .01 .0. +01 0
$\{000,001,010,110\}$7+. .00 +01 . +01 .10
$\{000,001,010,111\}$8+ .+ .00 .11 1 .0. +01 0
$\{000,001,011,100\}$7+.0.+011.+01.00
$\{000,001,011,101\}$8+ ..101 .0 + .11 .0 +01
$\{000,001,011,110\}$8+ ..110 .0 + .11 .0 +01
$\{000,001,011,111\}$7+.0.0+01.+01.11
etc.
With the current code it takes quite long to show that a complexity is $\ge 9$, and already 10 minutes or so to show it is $\ge 8$.
Phrases that are needed, and the minimum length at which they occur
Phrase$n$$c$
011
111
+12
.21
0021
0112
0+22
0.22
1021
1121
1+22
1.22
+012
+1not needed
++not found yet
+.22
.021
.121
.+22
..31
000yes $\{000\}$
001yes $\{001\}$
00+
00.
010yes $\{010\}$
011yes $\{011\}$
01+
01.
0+0
0+1no need +1
0++
0+.
0.0
0.1
0.+
0..
100yes $\{100\}$
101yes $\{101\}$
10+
10.
110yes $\{110\}$
111yes $\{111\}$
11+
11.
1+0
1+1no need +1
1++
1+.
1.0
1.1
1.+
1..
+00
+01
+0+
+0.
+10no need +1
+11no need +1
+1+no need +1
+1.no need +1
++0
++1no need +1
+++
++.
+.0
+.1
+.+
+..
.00
.01
.0+
.0.
.10
.11
.1+
.1.
.+0
.+1no need +1
.++
.+.
..0
..1
..+

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>