# Regex complexity

### Regular expression complexity of homogeneous languages

We could measure complexity by the length of the polish notation for the regex. However, this will always be an odd number $2m+1$. Therefore we prefer to use $m$, the number of operators, as the complexity. We have an upper bound of $c(n-1)+c-1 = cn-1$ for the complexity for a family of cardinality $c$ consisting of words of length $n$. This is because we can do: $++..x_1x_2x_3..y_1y_2y_3$ (in the case $n=3,c=2$). This is not sharp: for instance, when $c=2^n$ we also have the upper bound $2n-1$ due to the pattern $..+01+01+01$ (in the case $n=3$). Of interest: which value of $c$ maximizes the complexity given $n$?
Data below from February 20-23, 2021, pertain to a search through the space of words over {0,1,+,.}. A better, more semantic approach is to use that the complexity $R$ of a language $L=L_1\cup L_2$ or $L=L_1L_2$ (a concatenation) is at most $R(L_1)+R(L_2)+1$, and in fact $R(L)$ is realized this way for some $L_1,L_2$. With the new code it takes only 5 minutes to find all complexity for $n\ge 3$.
Length $n=0$ (max 0)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{\varepsilon\}$0$\varepsilon$
Length $n=1$ (max 1)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{0\}$0$0$
$\{1\}$0$1$
$\{0,1\}$1$+01$
Uniqueness is of interest. For $n=1$, +10 = +01 but we may always rule out the phrase +1 so it is essentially unique. More genuine non-uniqueness occurs for $\{00,01,11\}$:
+.00.+011, which is by commutativity of + equal to +.+011.00
+.0+01.11, which is by commutativity of + equal to +.11.0+01
We associate to the left so that instead of .0.0 we use ..00, to eliminate duplicates. For $\{000, 001, 011, 111\}$ we have:

+.+01.11..00+01
+..00+01.+01.11
+..00+01..+0111
+..+0111..00+01
A more complex example is $\{000,001,011,110\}$ which has:
+.0+.00.+011..110
+.0+.0+01.11..110
+.0+.11.0+01..110
+.0+.+011.00..110
+.+.00.110..0+011
+.+.11.000..0+011
+..0+011.+.00.110
+..0+011.+.11.000
+..110.0+.00.+011
+..110.0+.0+01.11
+..110.0+.11.0+01
+..110.0+.+011.00
Length $n=2$ (max 4)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{00\}$1$.00$
$\{01\}$1$.01$
$\{10\}$1$.10$
$\{11\}$1$.11$
$\{00,01\}$2.0+01
$\{00,10\}$2.+010
$\{00,11\}$3+.00.11 and +.11.00
$\{01,10\}$3+.01.10 and +.10.01
$\{01,11\}$2.+011
$\{10,11\}$2.1+01
$\{00,01,10\}$4+.01.+010
$\{00,01,11\}$4+.00.+011
$\{00,10,11\}$4+.00.1+01
$\{01,10,11\}$4+.01.1+01
$\{00,01,10,11\}$3.+01+01
Length $n=3$, cardinality $\le 2$ (max 5)
SetComplexityRegex
$\emptyset$0$\emptyset$
$\{000\}$2$..000$
$\{001\}$2$..001$
$\{010\}$2$..010$
$\{011\}$2$..011$
$\{100\}$2$..100$
$\{101\}$2$..101$
$\{110\}$2$..110$
$\{111\}$2$..111$
$\{000,001\}$3.0.0+01
$\{000,010\}$3.0.+010
$\{000,011\}$4.0+.00.11
$\{000,100\}$3.+01.00
$\{000,101\}$5+.0.00.1.01
$\{000,110\}$4.+.00.110
$\{000,111\}$5+.0.00.1.11
$\{001,010\}$4.0+.01.10
$\{001,011\}$3.0.+011
$\{001,100\}$5+.0.01.1.00
$\{001,101\}$3.+01.01
$\{001,110\}$5+.0.01.1.10
$\{001,111\}$4.+.00.111
$\{010,011\}$3.0.1+01
$\{010,100\}$4.+.01.100
$\{010,101\}$5+.0.10.1.01
$\{010,110\}$3.+01.10
$\{010,111\}$5+.0.10.1.11
$\{011,100\}$5+.0.11.1.00
$\{011,101\}$4.+.01.101
$\{011,110\}$5+.0.11.1.10
$\{011,111\}$3.+01.11
$\{100,101\}$3.1.0+01
$\{100,110\}$3.1.+010
$\{100,111\}$4.1+.00.11
$\{101,110\}$4.1+.01.10
$\{101,111\}$3.1.+011
$\{110,111\}$3.1.1+01
Length 3, cardinality 3 (max=7)
SetComplexityRegex
$\{000,001,010\}$5.0+.01.+010
$\{000,001,011\}$5.0+.00.+011
$\{000,001,100\}$6+.0.01.+01.00
$\{000,001,101\}$6+.0.00.+01.01
$\{000,001,110\}$6+.0.0+01.1.10
$\{000,010,011\}$5.0+.00.1+01
$\{000,010,100\}$5.+.01.+0100
$\{000,010,101\}$6+.0.+010.1.01
$\{000,010,110\}$5.+.00.+0110
$\{000,010,111\}$6+.0.+010.1.11
$\{000,011,100\}$6+.0.11.+01.00
$\{000,011,101\}$7+.0+.00.11.1.01
The example $\{000,011,101\}$ illustrates that 7 is the maximum complexity for a family of 3 binary words of length 3: there will always be two words that start with the same letter.
Length 3, cardinality 4
SetComplexityRegex
$\{000,001,010,011\}$4.0.+01+01
$\{000,001,010,100\}$8+. +01 .00 .0+ .01 .10
$\{000,001,010,101\}$7+. +01 .01 .0. +01 0
$\{000,001,010,110\}$7+. .00 +01 . +01 .10
$\{000,001,010,111\}$8+ .+ .00 .11 1 .0. +01 0
$\{000,001,011,100\}$7+.0.+011.+01.00
$\{000,001,011,101\}$8+ ..101 .0 + .11 .0 +01
$\{000,001,011,110\}$8+ ..110 .0 + .11 .0 +01
$\{000,001,011,111\}$7+.0.0+01.+01.11
etc.
With the current code it takes quite long to show that a complexity is $\ge 9$, and already 10 minutes or so to show it is $\ge 8$.
Phrases that are needed, and the minimum length at which they occur
 Phrase $n$ $c$ 0 1 1 1 1 1 + 1 2 . 2 1 00 2 1 01 1 2 0+ 2 2 0. 2 2 10 2 1 11 2 1 1+ 2 2 1. 2 2 +0 1 2 +1 not needed ++ not found yet +. 2 2 .0 2 1 .1 2 1 .+ 2 2 .. 3 1 000 yes $\{000\}$ 001 yes $\{001\}$ 00+ 00. 010 yes $\{010\}$ 011 yes $\{011\}$ 01+ 01. 0+0 0+1 no need +1 0++ 0+. 0.0 0.1 0.+ 0.. 100 yes $\{100\}$ 101 yes $\{101\}$ 10+ 10. 110 yes $\{110\}$ 111 yes $\{111\}$ 11+ 11. 1+0 1+1 no need +1 1++ 1+. 1.0 1.1 1.+ 1.. +00 +01 +0+ +0. +10 no need +1 +11 no need +1 +1+ no need +1 +1. no need +1 ++0 ++1 no need +1 +++ ++. +.0 +.1 +.+ +.. .00 .01 .0+ .0. .10 .11 .1+ .1. .+0 .+1 no need +1 .++ .+. ..0 ..1 ..+ …

# MATH 454 Fall 2021

Watch this space for updates on the undergraduate course MATH 454 (Axiomatic Set Theory).
Textbook: Schimmerling’s “A course on set theory”

A similar course at the graduate level was MATH 654 Fall 2020, which had the following readings:

• Foundations of Mathematics, Kenneth Kunen, parts of chapters I-IV.
• Modal logic for open minds, Johan van Benthem, some of chapters 1-11 and 16.
• Logic and Proof, Lean tutorial, and The Natural Number Game

# Principles-based classification

I was the discussant for the following paper at Hawaii Accounting Research Conference 2020 on the UH Hilo campus:

The Contract Disclosure Mandate and Earnings Management under External Scrutiny

by Carlos Corona and Tae-Wook Ryan Kim

Discussant slides

I learned that research in the Theory Track of the accounting discipline primarily is about mathematical modeling of the effects of government policies and business decisions. It borrows methods from economics for such modeling. In the case of the Corona-Kim paper: quadratic programming without constraints, and exponential utility functions. Usually these are not empirical papers, i.e., they don’t test the model explicitly against data. Indeed this would be hard to do with notions like “intensity of scrutiny”.

I am a discussant for “A theory of principles-based classification” by Konvalinka, Penno, and Stecher, at HARC 2021.

# Jaccard distance and Normalized Information Distance

https://youtu.be/t1rFAyPp7s0