# Regex complexity

### Regular expression complexity of homogeneous languages

We could measure complexity by the length of the polish notation for the regex. However, this will always be an odd number $2m+1$. Therefore we prefer to use $m$, the number of operators, as the complexity. We have an upper bound of $c(n-1)+c-1 = cn-1$ for the complexity for a family of cardinality $c$ consisting of words of length $n$. This is because we can do: $++..x_1x_2x_3..y_1y_2y_3$ (in the case $n=3,c=2$). This is not sharp: for instance, when $c=2^n$ we also have the upper bound $2n-1$ due to the pattern $..+01+01+01$ (in the case $n=3$). Of interest: which value of $c$ maximizes the complexity given $n$?
Data below from February 20-23, 2021, pertain to a search through the space of words over {0,1,+,.}. A better, more semantic approach is to use that the complexity $R$ of a language $L=L_1\cup L_2$ or $L=L_1L_2$ (a concatenation) is at most $R(L_1)+R(L_2)+1$, and in fact $R(L)$ is realized this way for some $L_1,L_2$. With the new code it takes only 5 minutes to find all complexity for $n\ge 3$.
Length $n=0$ (max 0)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{\varepsilon\}$0$\varepsilon$
Length $n=1$ (max 1)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{0\}$0$0$
$\{1\}$0$1$
$\{0,1\}$1$+01$
Uniqueness is of interest. For $n=1$, +10 = +01 but we may always rule out the phrase +1 so it is essentially unique. More genuine non-uniqueness occurs for $\{00,01,11\}$:
+.00.+011, which is by commutativity of + equal to +.+011.00
+.0+01.11, which is by commutativity of + equal to +.11.0+01
We associate to the left so that instead of .0.0 we use ..00, to eliminate duplicates. For $\{000, 001, 011, 111\}$ we have:

+.+01.11..00+01
+..00+01.+01.11
+..00+01..+0111
+..+0111..00+01
A more complex example is $\{000,001,011,110\}$ which has:
+.0+.00.+011..110
+.0+.0+01.11..110
+.0+.11.0+01..110
+.0+.+011.00..110
+.+.00.110..0+011
+.+.11.000..0+011
+..0+011.+.00.110
+..0+011.+.11.000
+..110.0+.00.+011
+..110.0+.0+01.11
+..110.0+.11.0+01
+..110.0+.+011.00
Length $n=2$ (max 4)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{00\}$1$.00$
$\{01\}$1$.01$
$\{10\}$1$.10$
$\{11\}$1$.11$
$\{00,01\}$2.0+01
$\{00,10\}$2.+010
$\{00,11\}$3+.00.11 and +.11.00
$\{01,10\}$3+.01.10 and +.10.01
$\{01,11\}$2.+011
$\{10,11\}$2.1+01
$\{00,01,10\}$4+.01.+010
$\{00,01,11\}$4+.00.+011
$\{00,10,11\}$4+.00.1+01
$\{01,10,11\}$4+.01.1+01
$\{00,01,10,11\}$3.+01+01
Length $n=3$, cardinality $\le 2$ (max 5)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{000\}$2$..000$
$\{001\}$2$..001$
$\{010\}$2$..010$
$\{011\}$2$..011$
$\{100\}$2$..100$
$\{101\}$2$..101$
$\{110\}$2$..110$
$\{111\}$2$..111$
$\{000,001\}$3.0.0+01
$\{000,010\}$3.0.+010
$\{000,011\}$4.0+.00.11
$\{000,100\}$3.+01.00
$\{000,101\}$5+.0.00.1.01
$\{000,110\}$4.+.00.110
$\{000,111\}$5+.0.00.1.11
$\{001,010\}$4.0+.01.10
$\{001,011\}$3.0.+011
$\{001,100\}$5+.0.01.1.00
$\{001,101\}$3.+01.01
$\{001,110\}$5+.0.01.1.10
$\{001,111\}$4.+.00.111
$\{010,011\}$3.0.1+01
$\{010,100\}$4.+.01.100
$\{010,101\}$5+.0.10.1.01
$\{010,110\}$3.+01.10
$\{010,111\}$5+.0.10.1.11
$\{011,100\}$5+.0.11.1.00
$\{011,101\}$4.+.01.101
$\{011,110\}$5+.0.11.1.10
$\{011,111\}$3.+01.11
$\{100,101\}$3.1.0+01
$\{100,110\}$3.1.+010
$\{100,111\}$4.1+.00.11
$\{101,110\}$4.1+.01.10
$\{101,111\}$3.1.+011
$\{110,111\}$3.1.1+01
Length 3, cardinality 3 (max=7)
SetComplexityRegex
$\{000,001,010\}$5.0+.01.+010
$\{000,001,011\}$5.0+.00.+011
$\{000,001,100\}$6+.0.01.+01.00
$\{000,001,101\}$6+.0.00.+01.01
$\{000,001,110\}$6+.0.0+01.1.10
$\{000,010,011\}$5.0+.00.1+01
$\{000,010,100\}$5.+.01.+0100
$\{000,010,101\}$6+.0.+010.1.01
$\{000,010,110\}$5.+.00.+0110
$\{000,010,111\}$6+.0.+010.1.11
$\{000,011,100\}$6+.0.11.+01.00
$\{000,011,101\}$7+.0+.00.11.1.01
The example $\{000,011,101\}$ illustrates that 7 is the maximum complexity for a family of 3 binary words of length 3: there will always be two words that start with the same letter.
Length 3, cardinality 4
SetComplexityRegex
$\{000,001,010,011\}$4.0.+01+01
$\{000,001,010,100\}$8+. +01 .00 .0+ .01 .10
$\{000,001,010,101\}$7+. +01 .01 .0. +01 0
$\{000,001,010,110\}$7+. .00 +01 . +01 .10
$\{000,001,010,111\}$8+ .+ .00 .11 1 .0. +01 0
$\{000,001,011,100\}$7+.0.+011.+01.00
$\{000,001,011,101\}$8+ ..101 .0 + .11 .0 +01
$\{000,001,011,110\}$8+ ..110 .0 + .11 .0 +01
$\{000,001,011,111\}$7+.0.0+01.+01.11
etc.
With the current code it takes quite long to show that a complexity is $\ge 9$, and already 10 minutes or so to show it is $\ge 8$.
Phrases that are needed, and the minimum length at which they occur
 Phrase $n$ $c$ 0 1 1 1 1 1 + 1 2 . 2 1 00 2 1 01 1 2 0+ 2 2 0. 2 2 10 2 1 11 2 1 1+ 2 2 1. 2 2 +0 1 2 +1 not needed ++ not found yet +. 2 2 .0 2 1 .1 2 1 .+ 2 2 .. 3 1 000 yes $\{000\}$ 001 yes $\{001\}$ 00+ 00. 010 yes $\{010\}$ 011 yes $\{011\}$ 01+ 01. 0+0 0+1 no need +1 0++ 0+. 0.0 0.1 0.+ 0.. 100 yes $\{100\}$ 101 yes $\{101\}$ 10+ 10. 110 yes $\{110\}$ 111 yes $\{111\}$ 11+ 11. 1+0 1+1 no need +1 1++ 1+. 1.0 1.1 1.+ 1.. +00 +01 +0+ +0. +10 no need +1 +11 no need +1 +1+ no need +1 +1. no need +1 ++0 ++1 no need +1 +++ ++. +.0 +.1 +.+ +.. .00 .01 .0+ .0. .10 .11 .1+ .1. .+0 .+1 no need +1 .++ .+. ..0 ..1 ..+ …