Quicktrader wrote:Long time no news here around..had to fight with a broken laptop, last file sadly completely gone. But now there is some news.
Some of us may have thought about the chances that
three double symbols occur in a 340 cipher. Me too. And it is not enough to simply go for some probabilities of how often a double letter would actually occur in a normal text.
Because his statistical observation fades away with the usage of homophones. Therefore we got to review this issue under consideration of the
amount of homophones Z might have used. Compared to the 408, by using 63 homophones, Z has in average increased the amount of homophones per alphabetical letter in the 340. We do assume, however, that he had used a similar (or identical) amount of homophones, nevertheless has increased the homophones in low-frequency letters. I do so because otherwise the statistical significance (later to be focussed on) would even grow higher.
Three main criteria are essential in the following consideration:
The amount of letters expected (overall frequency)
The amount of homophones
The statistical expectation regarding double letters.
Latter I have received from this list (
http://www.wordfrequency.info/free.asp?s=y) which appears to be more precise (450m words). Indeed, it also covers the main 5,000 words, many of them might have been used by Z.
Let's jump quickly into a preliminary example:
1. The letter Q, for example, is not frequent (overall frequency) nor frequent as a double letter (double letter frequency). Our double symbol which shows up three times is the '++'. Generally, '+' alone has a frequency of 24, therefore Q can be ruled out for two (later even three) reasons:
a.) Q is not frequent enough as it should show up approximately 0.37 times only
b.) Q as a double letter (QQ) doesn't occur in 5,000 words even once.
It therefore is most unlikely that '++' is representing 'QQ'.
2. The letter 'E', instead, is very frequent (overall frequency) and somehow frequent as a double letter (33 times in 5,000 words). It therefore could - on first view - represent the '+' symbol.
However I will now demonstrate why this is not possible at all.Please would you allow me a short introduction?
The main reason is that the statistical expectation of
three double letters showing up, in homophone ciphers, is in advance being influenced by the amount of homophones. Assume that 'EE' would show up, let's say, 200 times in 5,000 words. With an average length of 5 letters per word (assumed), we could now think about the 340 cipher consisting out of approximately 68 words.
This leads us to
200 / 5000 x 68
therefore an expectation of 2.72 times the 'EE' is present in the 340 (this amount is incorrect, in fact it's much lower). Nevertheless, in fact it could be represented by the three (and this is crucial!) '++' symbols.
BUT:
The letter 'E' is represented not only by one symbol. It is rather estimated that 'E' is represented by at least 5 or up to 8 different symbols! Even if 'EE' would show up
three times in the 340,
it is most unlikely that those three double letters - accidentially - are represented by always (3x2 times) the same out of those 5 to 8 different homophones! The very reverse, chances are even much greater that
other homophones come into play to represent at least one or more symbols in those three expected double letters. Therefore even only one or two different double symbol would show show up in the cipher. This difference is significant as we do not only discuss the double symbols being present twice or three times in the cipher, no...instead we do understand that, assuming 'E' being represented by e.g. 8 symbols, the chances for 'EE' to be represented three times all by the same symbols is actually very low! It even is low for just the first double symbol!
Shortly described...the MORE homophones a letter is represented by, the LESS is the chance that a double letter (such as 'EE') would show up us a double symbol ('++') in the 340, too. Even less likely, that the following double symbols are existing of the same symbols.
So what to do? In fact, we do already talk about combinatory mathematics. Everybody of us does know the chance to throw a dice to a 6, which is, throwing once, 1/6. This is easy. But what we've got here is way more complicated:
Scenario 1:Imagine 16 balls in a bowl...with 8 different colors (two each - representing the double letter - and the 8 colors representing some 8 different homophones). We are now allowed to draw 3 times (representing e.g. the expected 2.72 double letters ('EE')). Now guess how low is the chance that you draw three times after another ALWAYS the SAME COLOR for BOTH BALLS. In this scenario, not even one ball would be allowed to have a different color. This most likely won't work, even if you may want to try it at home with your kids.
Scenario 2:Imagine 2 balls in a bowl only, both of the same color. This would be the case if only one homophone existed for the letter, therefore '+' representing a letter such as Z. Then we definitely get the double homophones all being the same, without any problems. But what is the issue now? First, the letter Z usually doesn't occur 24 times in a 340 cipher. Second, with only 9 times in 5,000 words it is a bad candidate for showing up three times as a ZZ in a 340. Although 'everything is possible', this scenario is even less realistic than our first one.
Scenario 3:The favorite...what if the amount of homophones (colors of the balls in a bowl) is quite limited, e.g. to three different homophones, AND the expectation of the letter to occur as a double letter is quite high, e.g. 7 times? And in addition to that, let this letter be even a medium to frequent one, too?
All those scenarios above can be calculated.
We therefore are able to say which letter (based on a certain number of assumed homophones and certain assumed double letter frequencies) is
the most likely to be represented by three double '++'. In the beginning I thought this would be easy to calculate, but I soon realized that it is actually not. Mainly because we got two criteria of 'balls' to be drawn, the first double letter symbol and the second one (
thus drawing two balls at the same time with the goal for them having the same color). This is circumstantial as we can write down all those combinations even manually down on a piece of paper in form of a probability tree. To make that easier, however, there is the great Bernoulli method to evaluate those chances considered above. Those, however, are still depending on the amount of homophones.
An example:
Let's assume two homophones for one letter, the symbols A and B. Therefore a double letter can be represented by AA, AB, BA or BB. The chance to get an AA is 0.25 or 25%, same with BB and the others. To get a 3 times in a row 'AA' situation is therefore 0.25^3 or 1.5625%. This might be doubled as we are completely satisfied with three BB's, too (however we don't expect BAs and ABs hanging around..maybe additionally but that again is against our double letter frequency expectation).
Probabilities for getting an 'AA' situation are:
1 homophone: 100% - AA
2 homophones: 25% - AA, AB, BA, BB
3 homophones: 11.11% (1/9) - AA, AB, AC, BA, BB, BC, CA, CB, CC
4 homophones: 6.25% (1/16) - AA, ..., DD
..
7 homophones: 2.04% (1/49) - AA, ..., ..., GG
(every homophone may be combined with every other homophone, therefore e.g. 7x7 in the latter example)
If we now have look at the 'E', with - let's now say 7 - homophones, this is getting different: 1/7 = 0.14285^3 = 0.00291545 or approximately 0.3%. This is simplified, as we only deal with three draws to get those identical double letters in a row.
But what if out of 7 draws, no matter when, 3 identical pairs of homophones show up? What are the odds then?
Great Bernoulli developed a formula for shortening up this tree of probabilities. It can be tested on both, a low and a high numbered basis, and does go like this (German link:
http://www.mathe-online.at/lernpfade/Ko ... ?kapitel=4):
QT Chart One.jpg
With
n=number of 'tries' (expected double letters of a specific alphabetical letter in a 340 cipher text)
i=number of 'successes' (the required three double symbols present in the 340 cipher text)
p=probability of success (the odds of e.g. getting an 'AA' instead of an 'AB' or 'AC',..)
Putting it into the formula, we simply get the probability of
how likely it is to get three double symbols at all. This we can do for each alphabetical letter seperately. Beware: We now do assume a certain amount of homophones for each letter, too, so the chances ('p') is differing for each individual letter of the alphabet, fully depending on how many homophones are expected to be represented by.
I did make a tableau consisting of 'overall letter frequency' & 'amount', 'double letter frequency' & 'amount' and went into the possibilities of getting an 'AA' situation which shall depend on 2, 3, 4 etc. homophones. In my excel sheet the formula for each alphabetical letter looked somehow like this:
=FACULTY(G3)/FACULTY(3)*IF(H3=3;P$3^3;IF(H3=4;R$3^3;IF(H3=2;N$3^3;IF(H3=1;L$3;IF(H3=7;T$3;fehler)))))*((1-IF(H3=3;P$3;IF(H3=4;R$3;IF(H3=2;N$3;IF(H3=1;L$3;IF(H3=7;T$3;fehler))))))^(G3-3))
which should be no more and no less than the Bernoulli formula (the 'if' refers to the different amount of homophones expected).
I have to admit that I was suprised:
1. Three letters showed up with an 'error' as I had not yet defined the precondition of '0' homophones, e.g. Z, Q, J. Didn't take much care about those.
2. All letters with only one homophone expected showed up with a different error (because the square in Bernoullis formula gets into zero or negative), those were the letters PCMGBZUJKQVWXY. All those letters have an overall expected frequency of 0.09% to 3.06%, therefore would most of them not even be considered to show up more than 10 times (as double and single letterS) - please reconsider: The '+' is present 24 times. Therefore those letters didn't even reach 46% of what '+' is coming up with. Also none of those letters is expected to show up even twice (therefore not even three times) as a double letter.
3. The letters LSHDFEROTAIN, however, went into a closer Bernoulli-consideration: Those letters gave values on what the chances are that they would show up in the cipher with three identical double symbols.
And guess what:
QT Chart Two.jpg
This is the result. Both, depending on a certain amount of homophones as well as oncertain double letter expectations (latter one based on 450m words). It is the statistical expectation for a letter to appear three times as a double symbol in a 340 cipher.
EVEN IF we assume e.g. the letter 'O' to show up with four different types of its homophones (therefore satisfied not only with AA but also BB, CC, DD), the probability doesn't even increase above the benchmark of 0.1 percent (0.024% x 4 = 0.096%).
Why is that? Why is this so low?
Well that's explained quicker than what I had written above:
'O' is expected to show up only 3.11 times as a double letter ('OO'). And, due to it's overall letter frequency, it is expected to have 4 homophones. To draw three times two identical balls at the same time out of a bowl with 8 balls (with putting them back), all of those being colored colored the same, is simply extremely hard.
With 71% it is obviously way easier to draw 7 times, with three colors in the bowl only, to get three times a pair of two identical colored balls (or symbols). It is, imo, not possible to follow that thought 'immediately' as such a drawing of balls could look like this (or completely different..many ways, seven levels):
AC, BB, BC, CA, BA, CB, BB, AA, BC, AC, BB
As you can see, in this example it took us 11 draws to complete the three double homophones.
And this is exactly what the table above had told us:
It is not enough to draw 7.1 times and expect three double 'LL' symbols. Instead even 10-11 draws are expected to be needed to get those three identical symbol double patterns in the ciphertext.
Accidentially, in the 340, 7.1 draws (occurrances) were enough. Or had Z used the double letter 'LL' not only 7.1 times but 10 or more times in his cipher?
What is a fact, however, is that due to a higher amounts of homophones (caused by a higher overall frequency of the letter) and due to lower expected amounts of double letters (e.g. the 'DD', which is rather seldom compared to 'LL'), the other letters do not (significantly!) come into consideration to be represented by the three double '++' symbols.
The first solved symbol, therefore, imo based on valid, significant statistical data, is therefore the '+' representing the consonant 'L' (and only that one, as long as no error or absolutely extreme outliers exist).
It should be mentioned, however, that assuming different amounts of homophones (higher ones) would even reduce this probability (lower ones would increase, however the 340 has more, not less homophones). 'LL' to occur three times in the 340 is not even expected at all. This would actually requre a value of 100%. Instead it's 71%. With Z using 'LL' a bit more often or the symbols accidentially falling into a good pattern, the triple (definitely exist more in the cleartext) 'LL' represented by the triple '++' is, as far as I can say, the only correct solution.
QT