MATHEMATICS OF SQUARE CONSTRUCTION
by Chris Long
Word Ways, 1993
A Theoretical Model
In the May 1992 issue of Word Ways, Ross Eckler raised the question of the minimum number of words needed before you'd expect to be able to form a word square, and he termed this number the support. A simple estimate may be obtained by assuming that the frequencies for each letter in each word are probabilistically independent (e.g., the fraction of words that have an A in the first position times the fraction that have B in the second position equals the fraction that have both an A in the first position and a B in the second position). For English words this is clearly not true (e.g., a Q is almost always followed by a U), but this estimate should still be reasonably accurate.
Under this assumption, we may apply Bayes' theorem from probability and compute that the expected number of fills for a form F which is a double n-square is approximately
E(F) = W2npn*2 where p = f(a)2 + ... + f(z)2 and n*2 is the square of n
W is the number of different n-letter words in our word list, and F(x) is the frequency of the letter x in our word list. Computations for various word lists reveal that p is about 1/15.8 with very little variation. We may now obtain an estimate of the support by defining the support of F, denoted by S(F), to equal that value of W which yields an E(F) of one. Setting E(F) = 1 in the above formula, we solve for W, finding that
W = p-n/2 or S(F) = 15.8n/2
when F is a double n-square.
To compute S(F) for a regular n-square we must be careful not to include any redundancies. It turns out that
E(F) = Wnpn(n-1)/2 and therefore S(F) = p-(n-1)/2
where p = 1/15.8 as before. This is very interesting, since, assuming that difficulty of square construction is directly proportional to the size of the word lists required, this implies that it is about as difficult to construct a regular square of order n as it is to construct a double square of order n-1. This was the opinion of Palmer Peterson of the National Puzzlers' League, one of the greatest formists (constructors of geometric forms including squares) of all time. Letting S(1) be an n-square and S(2) be a double n-square, quick work with a calculator gives
n S(1) S(2) 4 63 250 5 250 992 6 992 3,944 7 3,944 15,678 8 15,678 62,320 9 62,320 247,718 10 247,718 984,648 11 984,648 3,913,938 12 3,913,938 15,557,957Comparing these figures to the number of words of the appropriate length in the typical unabridged dictionary goes a long way toward explaining why Johnny hasn't constructed a 10-square or a double 9-square using only dictionary words.
More accurate comparisons result if we drop the condition that letter frequencies must be independent of position. A simple computer program may be written to implement this; it states that a more accurate estimate for a 9-square is 75642 and a 10-square is 256945, so the above figures tend to be on the low side.
Experimental Results
Several sample runs were done to compute the number of single and double squares for the 4-square and 5-square cases. Random subsets of a master word list were generated by choosing each word with a selection probability q, and then two computer runs were done using each list, once for the regular squares, and once for the double squares. In the tables below, the selection probability is given first, followed by the size of the word list which resulted. Then the number of squares produced from this list is given for both forms (the number of double squares found was actually twice the given number, since each square may be reflected around the diagonal). The support estimate E was generated by using the heuristic formulas derived in the first section, i.e., E(F) = (W/S)n for the single square and E(F) = (W/S)2n for the double square, by setting E(F) equal to the number of fills actually found and then solving for S. For this estimate the number of double squares found needed to be doubled, as the heuristic formula was derived under the assumption that reflective pairs are different fills.
Note that at the q = .30 level the number of double 4-squares found exceeded that of the single 4-squares, and that at the q = .40 level the number of double 5-squares found exceeded that of the single 5-squares, which verifies a somewhat unexpected prediction of the model. The relative stability of the support estimates implies that the heuristic formula appears to be a good model of reality. However, the support size for actual word lists appears to be approximately 1.5 times as large as the theoretical support size based on my probabilistic model of word generation. The number of n-squares appears to go as the nth power of the number of words and the number of double n-squares appears to go as the 2nth power of the number of words.
How many 4-squares and 5-squares are out there? The total number of 4-letter words in my list is 7364, so extrapolating based on the above data yields the surprisingly large estimates of 45 million 4-squares and 11 billion double 4-squares. Furthermore, doing the same computations for the 5-letter case, there are 16536 words in my list, yielding the estimates of 240 million 5-squares and 37 billion double 5-squares.
4-Square Results
q Words Singles E Doubles E .10 726 3,587 93.81 95 376.78 .15 1,041 16,671 91.61 1,780 374.56 .20 1,445 59,308 92.60 20,736 382.52 .25 1,883 196,875 87.02 192,866 367.18 .30 2,260 398,897 89.93 1,014,462 367.87 .35 2,587 741,522 88.16 2,981,655 368.01 .40 2,973 1,236,104 89.16 8,460,854 371.23 .45 3,310 1,777,351 90.65 17,910,777 376.32 .50 3,718 2,937,478 89.81 47,781,879 373.91 5-Square Results
q Words Singles E Doubles E .10 1654 2,651 341.86 0 .15 2,493 19,788 344.70 199 1370.04 .20 3,281 67,255 355.19 2,525 1398.55 .25 4,250 285,411 344.58 52,080 1338.50 .40 6,515 2,277,345 348.68 3,347,557 1353.13