Unigram distance curiosity

Re: Unigram distance curiosity

Postby Jarlve » Mon Nov 20, 2017 11:47 am

doranchak wrote:Z340 has 10 symbols that occur only in the first or last six lines

The 340 has 10 unique symbols occupying 39 positions that occur only in the first and last six lines.

doranchak wrote:The 10 exclusive symbols occupy 39 positions of Z340. This is a 1.66 observation compared to randomizations. About 1 in 16 shuffles had at least that many positions occupied by symbols exclusive to those lines.

I posted this table earlier (see codebox), it shows a sigma of 2.37 for the 39 positions while you have it at 1.66. 100000 randomizations: mean=20.27, variance 62.45, standard deviation=7.90, sigma of 39: 2.37.

Code: Select all
Row division: top-bottom unique unigram count, sigma
--------------------------
1: 0, -0.20
2: 0, -0.45
3: 4, 1.02
4: 15, 2.93 <---
5: 20, 1.89
6: 39, 2.37 <---
7: 43, 0.30
8: 94, 1.24
9: 147, -0.22
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby Jarlve » Mon Nov 20, 2017 12:17 pm

moonrock wrote:
Jarlve wrote:Moonrock, out of all his cycle types considered regional cycling for the 340.

My hypothesis was that the 340 cipher used a combination of regional cycles and semi-regional cycles...

Hey moonrock,

Thanks for your explanation. I have some questions for you if you do not mind.

1. What statistics have you used to compare your ciphers versus the 340?

2. You once made a nifty list of different cycle types. Did you source these from somewhere? If not, what prompted you to come up with it?

3. Your definition of regional cycling seems very interesting in relation to some of the observations in this thread. Namely, a fair amount of symbols in the 340 do not appear in the middle 8 rows. Do you think this could be caused by regional cycling? Could you give some examples of a regional cycle.

4. Could you give another example of a concurrent cycle. I suppose the order between the sub cycles does not really mater as long as each individual sub cycle cycles? In your example (see quote below) I can see the 1 incrementing its position and the 5 is alternating between the 4th and 5th position.

5. I think some of us would like to work on different cycle types. Do you mind if I create a thread for this which lists your original cycle types and some others?

moonrock wrote:1. The perfect cycle, which has substitutions arranged in an unchanging pattern throughout the entire cipher: 12345 - 12345 - 12345 - 12345.
2. The increasingly random cycle, which has substitutions start off in an organized cycle and gradually become random: 12345 - 12345 - 12435 - 24153.
3. The decreasingly random cycle, which is the opposite of the increasingly random cycle: 24153 - 12435 - 12345 - 12345.
4. The random cycle, which has the substitutions arranged in random order: 32415 - 12543 - 41352 - 53124.
5. The concurrent cycle, which has two separate cycles existing at the same time for a single substitution; 1 and 5 cycling, and 2, 3, and 4 cycling in the following example: 12345 - 21354 - 23154 - 23415.
6. The palindromic cycle, which has the substitutions arranged in an order that reads the same forward and backward: 11211 - 11211 - 11211, 12321 - 12321 - 12321, and 123454321.
7. The inverted cycle, which has a uniform cycle inverted to be the opposite of what it was beyond a certain point: 12345 - 12345 - 54321 - 54321, and 11211 - 11211 - 22122 - 22122. The former example is an example of a perfect cycle being inverted, which creates a palindrome, and the latter example is of a palindromic cycle being inverted.
8. The shortened cycle, which has a cycle decrease in length as the ciphertext progresses: 12345 - 12345 - 1234 - 1234 - 123 - 123.
9. The lengthened cycle, which is the opposite of the shortened cycle: 123 - 123 - 1234 - 1234 - 12345 - 12345.
10. The regional cycle, which restricts substitutions to or from specific regions, or areas, of the ciphertext; this restriction typically manifests as either a restriction to specific rows or to specific columns, and, if used exclusively, is the equivalent of a series of simple substitutions.
11. The semi-regional cycle, which has the frequency of substitutions fluctuate between different regions, or areas, of the ciphertext in a similar way to regional cycling. When regional and semi-regional cycles are combined, it increases their level of security. Both regional and semi-regional cycles are capable of being accompanied by non-regional assignment of substitutions.
12. The sequential cycle, which is when one type of cycle is followed by another type of cycle: 12345 - 12345 - 123454321 - 1234321 - 12321; in this example, a perfect cycle changes to a palindromic cycle, which is then shortened.
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby moonrock » Mon Nov 20, 2017 11:46 pm

Jarlve wrote:Hey moonrock,

Thanks for your explanation. I have some questions for you if you do not mind.

1. What statistics have you used to compare your ciphers versus the 340?

2. You once made a nifty list of different cycle types. Did you source these from somewhere? If not, what prompted you to come up with it?

3. Your definition of regional cycling seems very interesting in relation to some of the observations in this thread. Namely, a fair amount of symbols in the 340 do not appear in the middle 8 rows. Do you think this could be caused by regional cycling? Could you give some examples of a regional cycle.

4. Could you give another example of a concurrent cycle. I suppose the order between the sub cycles does not really mater as long as each individual sub cycle cycles? In your example (see quote below) I can see the 1 incrementing its position and the 5 is alternating between the 4th and 5th position.

5. I think some of us would like to work on different cycle types. Do you mind if I create a thread for this which lists your original cycle types and some others?


1. I am referring to this table:

https://docs.google.com/spreadsheets/d/ ... itMiA/edit

One of my test ciphers, called "moonrock1", is in this table and is near the top. moonrock1 was created using the methods that I thought the 340 cipher might be using. moonrock1, however, contains numerous encryption errors because I made it hastily. For example, the number of times each symbol occurs does not match the number of times each symbol in the 340 cipher occurs. I made "improvements" in an additional cipher, but this cipher was not added to the foregoing document, so I do not know how it compares to moonrock1. If you are unable to see moonrock1 in that document, then here it is:

I79,L8#;T<WJ2.4QG
1/ZNAD{-0W$U4OCYJ
VXK\9;TTS:;27L_"R
3,4)2<6AD;Z?SF/8{
=X&0_[F-G&N74_*,0
"+.NJO44D:]WTTKN7
R0<V\_T}&UOG,4<F#
TTO0'P4D4270QH"+V
G=X_8MZ44}7TC;Z']
/63K0#1%AJ#GSH2:D
IB;*^4CI27JZ2DUVR
,%8H(/]08WI2"B;);
9,T}U.>)6-4EYR}:.
D45T=0*P1F#Y4]G&#
Y[.1EJ*K&:6;/Z8=K
T]()/6#DC[B_&;+}"
A+2]G)B6^Y*$E4G#-
?L[)ZE4#:A>9PDYL}
934B":(1Z#(TNG)V6
^79)B;-=ESJ5-CH;R

And here is moonrock2, the "improved" version of moonrock1:

079,R8#;JEWT2N%QG
1T6A.YL-0$WI4OCD/
VGK\X&J/3:;.Y>_B4
S,4BA<Z.7;Z3SFT(4
K[FUV*F-XF2D4_9,0
"UA#:?447T]$TT=NY
{0E_LVJ<F0?*,4EF2
TJ)0QP4Y4AD+QH?IV
9KG_(MZ44}YJ];Z8C
T63K+N14#JA9SH#:7
I?&9^4]02YJZAD+_\
,4'H8J]U(W02?O;"&
[,:}02>O6-{<74ET#
D45:=09P1FNY4CGFN
7GN1</G=F/6;:Z'=K
:]Q"TZ2YCG)VF&0EO
A02]GO)6PDX$}4[.-
S4G"Z<R./A44^Y74E
X3%B)/Q162'TAG)VZ
PD[?B;-=<UA952(^4

2. I didn't source the list from anywhere. It's something that I came up with on my own.

3. So when I consider regional and semi-regional cycles to be used in the 340 cipher, the distinct regions in the 340 cipher roughly correspond to the Olson line boundaries, i.e. region 1 is the first few lines, region two is lines ~4 to ~10, and region three is lines ~11 to ~13. It's unclear where the boundaries between regions are after line 13, so there may be 5 or 6 regions in total or some ciphertext letters or plaintext letters may have their own set of regions while others have their own regions. Between each of these regions, numerous symbols either exclusively appear in one or two of the regions, or they experience very noticeable changes in frequency from region to region. The "plus sign" ciphertext symbol is a noticeable example of fluctuating frequency between the regions. The symbols that have been highlighted as only appearing at the beginning and end are among the ones that may be assigned regionally, yes. My two test ciphers above also contain many such substitutions.

4. 1 and 5 behaving like that in my example was incidental since it was probably a convenient way to make the cycle work. Here are three concurrent cycles, each with two (1 and 2; 3 and 4, and 5 and 6): 123456 - 315624 - 534612 - 531462 - 152634. These example have been constructed in a way that requires each substitution to be used in each regular interval, but this isn't necessary, e.g. 123456 - 3124 - 5346 - 531462 - 152634. This is the same as the previous example but with some deletions; the concurrent cycles remain. Also, to clarify the concurrent cycle, these are not separate cycles existing for different plaintext letters; they are separate cycles being used for a single plaintext letter that, when combined, may appear to be unrelated to each other.

5. No, I don't mind. If you feel like doing so would be of benefit to our efforts, then feel free to do so.
moonrock
 
Posts: 8
Joined: Sat Sep 03, 2016 6:04 pm

Re: Unigram distance curiosity

Postby smokie treats » Tue Nov 21, 2017 7:55 am

It looks to me like the W, C and Theta are the main contributors to the regional bias, because they cluster together more, and the other symbols are more spread out.

That accounts for 14 of the 39 positions. But even with 25 that is high compared to my 768 test messages. There were only 24 of the 768 messages that had as many as 25 positions for a 6-8-6.

Thanks for sharing more of your ideas moonrock. I don't think that anybody is going to solve this message alone, so thanks a lot.

Seems like the easiest way to pencil and paper encode a message is to encode all of the letter A, all of the letter B, all of the letter C and so on. That would make it easier to use different cycle types for the different letters. Encoding by position and using different cycle types for the letters would make it difficult to keep track of cycle type.

Maybe if you had A = 1 2 3 4 5 you could encode 1 1 1 2 3 4 2 3 4 5 5 5, which would create the regional biases.

But then next encode B = 6 7 as 6 7 6 7 6 7, which would boost the L=2 cycle scores but make it impossible to differentiate from false L=2 cycles.

Maybe he just encoded high frequency plaintext, like A, in a regional way to hide the cycles EDIT: homophone groups. That to me seems like the simplest explanation for a pencil and paper cipher.
User avatar
smokie treats
 
Posts: 1619
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Unigram distance curiosity

Postby Jarlve » Tue Nov 21, 2017 10:37 am

moonrock wrote:3. So when I consider regional and semi-regional cycles to be used in the 340 cipher, the distinct regions in the 340 cipher roughly correspond to the Olson line boundaries, i.e. region 1 is the first few lines, region two is lines ~4 to ~10, and region three is lines ~11 to ~13. It's unclear where the boundaries between regions are after line 13, so there may be 5 or 6 regions in total or some ciphertext letters or plaintext letters may have their own set of regions while others have their own regions...

Thank moonrock.

What happens to a regional cycle when it leaves its region? Does it become a 1:1 substitute?
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby Jarlve » Tue Nov 21, 2017 11:50 am

If anyone feels like doing some work on the cycles and cycle types, I have created a thread for it, feel free to use it: viewtopic.php?f=81&t=3616
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby smokie treats » Tue Nov 21, 2017 2:13 pm

I scanned through the 340 in chunks of 8 rows and chunks of 10 rows, sliding down through the message one row at a time. And calculated the cycle scores.

Left column is top row scanned, middle column is lowest row scanned, and right column is score. Only scanning in chunks of 8 rows I didn't find much of a difference between chunks, except that the highest scores were rows 5 through 12 at 10566 and rows 9 through 16 at 10524.

1 8 10062
2 9 9856
3 10 9746
4 11 9864
5 12 10566
6 13 10490
7 14 10046
8 15 9800
9 16 10524
10 17 10018
11 18 8938
12 19 8716
13 20 9068

Here is a scan of 10 rows chunks. The highest scoring chunk was rows 7-16, which is exactly where the symbols unique to the top and bottom 6 rows do not exist. There is a strong second place though at rows 3-12.

1 10 14888
2 11 15638
3 12 16012
4 13 15118
5 14 15218
6 15 15120
7 16 16052
8 17 14972
9 18 14120
10 19 13908
11 20 13324
User avatar
smokie treats
 
Posts: 1619
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Unigram distance curiosity

Postby doranchak » Thu Nov 23, 2017 7:13 am

Jarlve wrote:
doranchak wrote:Z340 has 10 symbols that occur only in the first or last six lines

The 340 has 10 unique symbols occupying 39 positions that occur only in the first and last six lines.

doranchak wrote:The 10 exclusive symbols occupy 39 positions of Z340. This is a 1.66 observation compared to randomizations. About 1 in 16 shuffles had at least that many positions occupied by symbols exclusive to those lines.

I posted this table earlier (see codebox), it shows a sigma of 2.37 for the 39 positions while you have it at 1.66. 100000 randomizations: mean=20.27, variance 62.45, standard deviation=7.90, sigma of 39: 2.37.


Thanks for pointing that out, Jarlve. My test did indeed lack the requirement that the symbol had to appear in both regions. I did another test that generalized to this search: Find symbols exclusive to pairs of equal-sized rectangular regions, where the exclusive symbols appear at least once in each region. This ended up being a rougher version of an even more general search: Find all symbols exclusive to pairs of small-ish regions, where the exclusive symbols appear at least once in each region. This search notices some of the "tighter" clusters such as these:

Image

Image

Image

Image

Are such clusters true features of the encipherment method? I think to answer this will require careful comparisons of the overall distribution of these observations, instead of focusing on individual observations, because there are so many observation samples taking place.
User avatar
doranchak
 
Posts: 2358
Joined: Thu Mar 28, 2013 5:26 am

Re: Unigram distance curiosity

Postby Largo » Thu Nov 23, 2017 7:31 am

I made a similar observation some time ago (not distance curiosity at all, but region-based symbols). Don't know if it is worth a deeper look:

Reversed_B.png


The reversed B only occurs three times. Two times in the pivot on the right and one time indirectly connected to the pivot on the left if you assume they form a square. I made some tests by replacing the symbols from the pivot on the right with the symbols from the pivot on the left. No obvious observations. I also replaced all symbols of the pivots by one single symbol. Also no interesting results. Maybe I'll give my "merge the pivots" approach another try.
You do not have the required permissions to view the files attached to this post.
Largo
 
Posts: 454
Joined: Tue Jun 14, 2016 4:38 am
Location: Frankfurt, Germany

Re: Unigram distance curiosity

Postby Jarlve » Thu Nov 23, 2017 1:55 pm

doranchak wrote:Are such clusters true features of the encipherment method? I think to answer this will require careful comparisons of the overall distribution of these observations, instead of focusing on individual observations, because there are so many observation samples taking place.

Some meta reasoning,

From looking at a perfect cycle many of the measureable properties versus randomizations of sequential homophonic substitution can be inferred/upscaled. One of these measurable properties is that the symbols tend to spread out over the cipher with roughly similar distances. To put it in other words, look at any symbol in the cycle (see codebox) and notice that the distance between them is always 5. This property tends to happen in sequential homophonic substitution ciphers. Ofcourse the distance similarity will be more roughly but certainly measurable versus randomizations and for this reason the observation in the 340 is quite significant. It does not make a whole lot of sense that so many of these symbols would not appear in the middle 8 rows as they are more or less expected to appear at regular intervals.

The observation correlates better with randomizations than with sequential homophonic substitution.

Code: Select all
ABCDEABCDEABCDEABCDEABCDE
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

PreviousNext

Return to Zodiac Cipher Mailings & Discussion

Who is online

Users browsing this forum: Shawn, versaceversace and 30 guests

cron