Unigram distance curiosity

Re: Unigram distance curiosity

Postby doranchak » Thu Nov 16, 2017 6:21 am

Here is another result to add to your pondering.

Earlier I posted the outliers in my unigram distance sum per symbol tests:

Image

Z408 has 13 outliers and Z340 has 25 outliers.
Compared to 1000 shuffles, are the outlier counts unusual?
Result:

1) Z408's outlier count is -1.72 sigma below the mean outlier count observed during shuffles.
2) Z340's outlier count is 0.68 sigma above the mean outlier count observed during shuffles.

So, I suppose we can conclude that Z408 shows fewer unigram distance outliers than expected, and Z340 shows a little more than expected, at least compared to randomizations.
User avatar
doranchak
 
Posts: 2358
Joined: Thu Mar 28, 2013 5:26 am

Re: Unigram distance curiosity

Postby Jarlve » Thu Nov 16, 2017 5:35 pm

Using regular unigram distance, versus a sequential homophonic substitution with added cycle randomization hypothesis the 408 (first 340 characters) has a -0.75 sigma and the 340 has a 2.32 sigma. Logging distances over 170 only, the sigma goes up to 2.85 for the 340, yet versus randomizations it goes down from 4.23 to 1.37.
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby smokie treats » Fri Nov 17, 2017 7:25 am

unigram distance 2.png


The four symbol-positions unique to the middle 8 rows interest me as well.

I P 20 transposed and homophonic encoded most of Brave New World with a key that was a little less efficient than one that exactly matches plaintext frequencies, with one polyphone to simulate the +, closely matching symbol and polyphone count on average, 768 messages total ( it only took a few minutes with my new spreadsheet suite that would have taken me hours before ). EDIT: 25% random.

Left column is count of symbols unique to both top 6 rows and bottom 6 rows, right column is number of messages with that count. The 340 has 39 symbol-positions. None of my messages had 39, and the most was one message that had 34. The average was only 12.

EDIT: After typo in formula fixed, same test, but different messages because of random symbol selection at 25%. Average is still 12. There were two of 768 that had 35 in this batch.

1 0
2 23
3 18
4 20
5 36
6 35
7 36
8 46
9 53
10 53
11 46
12 53
13 33
14 42
15 30
16 34
17 39
18 22
19 34
20 18
21 18
22 15
23 11
24 10
25 4
26 4
27 3
28 4
29 3
30 2
31 1
32 1
33 0
34 0
35 2
36 0
37 0
38 0
39 0

40 0

Left column is count of symbol-positions unique to the middle 8 rows, right column is number of messages with that count. The 340 has only 4 symbols, and only three of 768 messages have 4 or fewer symbols.

EDIT: After typo in formula fixed. The average was 4, making the 340 very typical in the middle.

1 84
2 92
3 104
4 106
5 102
6 87
7 54
8 35
9 20
10 10
11 9
12 0
13 2
14 1
15 0
16 0
17 1
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0

Well this does not fit the transposition + nulls, skips & filler + homophonic +/- 63 + polyphone +/- 24 model.

We have seen that regional Caesar shifting the letters in a key like this one, but keeping the homophones in place, makes it easy to create a lot of symbol-positions unique to the top 6 and bottom 6 rows, but make it difficult to make repeats and only four symbol-positions unique to the middle 8 rows. Shifting the homophonic symbols and keeping the letters in their place would be the same thing.

E 1 2 3 4 5 6 7 68
T 8 9 10 11 12 68
O 13 14 15 16 68
A 17 18 19 20 68
I 21 22 23 24 68
N 25 26 27 28
H 29 30 31
S 32 33 34
R 35 36 37
D 38 39 40
L 41 42 43
U 44 45
C 46 47
W 48 49
M 50 51
G 52 53
Y 54 55
F 56 57
P 58 59
B 60 61
V 62
K 63
X 64
Z 65
Q 66
J 67

When I say 39 symbol-positions, I mean some number of different symbols occupying 39 positions total. In the case of the 340, there are 9 symbols occupying 39 positions. But only some of these 9 symbols are responsible for the phenomenon because most test messages have at least some unique symbol positions.

So of those 9 symbols, there is a small group that maybe he introduced to the key for the top / bottom, but did not use in the key in the middle. Is there a slow, subtle rotation of symbols through the key?

Or, did he use some type of creative cycling that causes this?

Or something else?

Could be 4-12-4, 5-10-5, 6-8-6, 7-9-7, or by position. Don't know.
You do not have the required permissions to view the files attached to this post.
Last edited by smokie treats on Sun Nov 19, 2017 8:23 am, edited 1 time in total.
User avatar
smokie treats
 
Posts: 1619
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Unigram distance curiosity

Postby Jarlve » Fri Nov 17, 2017 5:09 pm

smokie treats wrote:Left column is count of symbols unique to both top 6 rows and bottom 6 rows, right column is number of messages with that count. The 340 has 39 symbol-positions. None of my messages had 39, and the most was one message that had 34. The average was only 12.

Thank you for this excellent test. I will try to follow up on this.

smokie treats wrote:So of those 9 symbols, there is a small group that maybe he introduced to the key for the top / bottom, but did not use in the key in the middle. Is there a slow, subtle rotation of symbols through the key?

Or, did he use some type of creative cycling that causes this?

Or something else?

Could be 4-12-4, 5-10-5, 6-8-6, 7-9-7, or by position. Don't know.

It is insane. I do not know either.

Image

The 4 symbols unique to the middle may be trivial. If you look at the outlier table that doranchak put up for the 340 and 408 it can be seen that a symbol in the 408 has a -2.43 sigma (because it maps to the letter "Y" that only appears in the third part of the cipher). It is just one of many distributions that all happen to be equally unlikely (for only a handful of symbols).

The symbols that do not appear in the middle 8 rows are where the money is at.

Moonrock, out of all his cycle types considered regional cycling for the 340.
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby smokie treats » Fri Nov 17, 2017 8:18 pm

Jarlve wrote:The 4 symbols unique to the middle may be trivial. If you look at the outlier table that doranchak put up for the 340 and 408 it can be seen that a symbol in the 408 has a -2.43 sigma (because it maps to the letter "Y" that only appears in the third part of the cipher). It is just one of many distributions that all happen to be equally unlikely (for only a handful of symbols).

The symbols that do not appear in the middle 8 rows are where the money is at.

Moonrock, out of all his cycle types considered regional cycling for the 340.


Well, I don't know.

Thanks to marie, by the way, to help us focus on this issue.

Since the last test was of transposed plaintext, I figured to try with untransposed plaintext, since maybe certain words could appear at the bottom and top but not the middle. I thought that this would maybe produce a few more results similar to the 340.

Left column is count of symbols unique to both top 6 rows and bottom 6 rows, right column is number of messages with that count. The 340 has 39 symbol-positions. Only one of the 768 test messages had 39. The next closest one had 32. This is all with no transposition, and the average is 11, very similar to with transposition.

EDIT: Similar data for top and bottom after typo in formula fixed, just new batch and average is now 12.

1 0
2 20
3 13
4 33
5 44
6 44
7 39
8 61
9 51
10 52
11 51
12 46
13 40
14 37
15 48
16 30
17 20
18 25
19 21
20 16
21 10
22 13
23 11
24 8
25 8
26 7
27 4
28 3
29 2
30 1
31 2
32 0
33 0
34 1
35 1
36 0
37 0
38 0
39 0

40 0

Left column is count of symbol-positions unique to the middle 8 rows, right column is number of messages with that count. The 340 has only 4 symbols, and this time, with no transposition, none of the 768 messages have 4 or fewer unique symbol-positions. There was like one that had 5 and three that had 6.

EDIT: New data after typo in formula fixed. This is very different because 4 is the average for the middle of a 6-8-6.

1 92
2 102
3 130
4 99
5 86
6 63
7 42
8 39
9 22
10 16
11 10
12 5
13 0
14 1
15 1
16 0
17 2
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0

So both with and without transposition, the results are similar.

If you have to fix a car that isn't running, then start with the simplest, cheapest solution and the cheapest parts. Since it was a paper and pencil cipher, and changing the key throughout seems like overkill after alleged transposition, maybe just some creative cycling. There have to be other options besides what moonrock said. Finding a cycle model that fits may not lead to a solution, but it would be interesting to see if we can. I would say a cycle model that only results in a just a few unique symbol-positions in the middle portion of the message as well.

EDIT: Or a different key model. Like maybe mapping a whole lot of symbols to a few high frequency plaintext together with some 1:1. Would this fit other stats?
Last edited by smokie treats on Sun Nov 19, 2017 8:34 am, edited 1 time in total.
User avatar
smokie treats
 
Posts: 1619
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Unigram distance curiosity

Postby Jarlve » Sat Nov 18, 2017 5:23 am

Okay smokie,

I am now using the same measurement as you, 39 unigrams that appear in the top and bottom 6 rows but not in the middle 8 rows. This count of 39 is the score.

- Versus randomizations of the 340 it is a 2.37 sigma observation (mean 20).

- Versus randomized plaintexts + sequential homophonic substitution with 26% cycle randomization hypothesis it is a 5.12 sigma observation (mean 10).

- Versus randomized plaintexts + 6-8-6 caesar shift + sequential homophonic substitution with 26% cycle randomization hypothesis it is a -2.23 sigma observation (mean 68).

That seems to line up with your tests does it not? The observation correlates better with randomizations than with a regular sequential homophonic substitution hypothesis... Also with the caesar shifts, it seems you only need to shift the middle 8 rows to get the effect since my results were nearly indentical between these two, could be a time saver.

smokie treats wrote:Finding a cycle model that fits may not lead to a solution, but it would be interesting to see if we can.

Okay. Preferably in a new dedicated thread.
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby Jarlve » Sat Nov 18, 2017 5:51 am

moonrock wrote:10. The regional cycle, which restricts substitutions to or from specific regions, or areas, of the ciphertext; this restriction typically manifests as either a restriction to specific rows or to specific columns, and, if used exclusively, is the equivalent of a series of simple substitutions.

Where did he come up with this stuff?

This moonrock personage is really intriguing to me. He really knows what he is talking about. This is not some guy who invented some cycle systems on the fly. What the hell?
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby smokie treats » Sat Nov 18, 2017 5:59 am

Jarlve wrote:
moonrock wrote:10. The regional cycle, which restricts substitutions to or from specific regions, or areas, of the ciphertext; this restriction typically manifests as either a restriction to specific rows or to specific columns, and, if used exclusively, is the equivalent of a series of simple substitutions.

Where did he come up with this stuff?

This moonrock personage is really intriguing to me. He really knows what he is talking about. This is not some guy who invented some cycle systems on the fly. What the hell?


I know. He apparently has put a great deal of thought into the cycles. He showed up out of the blue, made a really good post that still lingers with us, and then just disappeared. Weird!
User avatar
smokie treats
 
Posts: 1619
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Unigram distance curiosity

Postby smokie treats » Sat Nov 18, 2017 7:22 am

Here is a progression by rows, 340 on the left and an average message from my first experiment, with transposition, on the right. If a cell is colored, that means that it is unique to the region(s), whether top / bottom or middle.

13 1 35 3 26 32 58 26 14 21 3 50 56 33 60 7 5
30 41 17 44 27 18 38 68 8 46 19 22 34 38 48 16 45
36 30 28 6 7 40 52 68 49 68 9 1 20 42 47 48 2
24 36 61 3 68 68 32 25 26 31 59 27 59 29 59 43 14
5 17 41 8 38 60 17 57 61 18 39 6 20 7 60 44 68
59 42 45 33 55 11 30 34 44 12 1 2 62 51 62 32 28
20 9 35 3 4 40 8 9 68 17 18 35 31 43 53 20 5
6 29 7 46 55 63 40 39 17 14 15 34 34 68 34 21 10
16 52 55 37 45 40 63 33 30 56 11 25 59 12 34 68 38
68 8 32 57 68 68 34 9 17 26 45 18 13 30 39 29 30
10 14 8 35 34 47 55 1 45 58 12 46 15 61 63 36 2
37 3 4 19 27 31 60 68 20 16 68 35 68 4 44 49 61
22 48 32 23 59 8 36 24 6 7 29 37 45 35 3 63 36
37 33 44 56 68 18 28 45 19 1 33 40 41 2 3 47 25
20 4 31 9 68 35 5 36 62 53 32 60 38 31 6 13 33
39 37 4 35 28 17 19 68 34 50 40 19 68 29 41 68 51
58 49 18 43 22 34 33 68 38 24 40 17 18 34 52 27 68
6 10 2 44 30 62 68 3 9 30 28 61 25 12 36 41 68
4 53 40 8 29 26 9 24 19 32 38 42 3 59 6 39 30
20 1 14 23 62 31 29 23 10 55 33 52 11 12 55 30 45

You can see a big difference in symbols only appearing in the middle for 3-14-3. Almost twice as many. 34 for the 340 and 18 for the test message.

unigram distance 3.png


And then the difference starts to get more dramatic stepping into 4-12-4, 5-10-5, and 6-8-6.

unigram distance 4.png


They get more similar with 7-6-7, 8-4-8 and 9-2-9, except that with 8-2-8 there is a big difference between the top / bottom counts.
You do not have the required permissions to view the files attached to this post.
User avatar
smokie treats
 
Posts: 1619
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Unigram distance curiosity

Postby smokie treats » Sat Nov 18, 2017 7:28 am

9-2-9 the same again.

unigram distance 6.png
You do not have the required permissions to view the files attached to this post.
User avatar
smokie treats
 
Posts: 1619
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

PreviousNext

Return to Zodiac Cipher Mailings & Discussion

Who is online

Users browsing this forum: Shawn, versaceversace and 30 guests

cron