Unigram distance curiosity

Re: Unigram distance curiosity

Postby Largo » Wed Nov 08, 2017 10:59 am

At the moment I am reading myself a little deeper into the subject. So if I repeat something that has already been mentioned, I'm sorry. I can't contribute much yet, but I used the opportunity to use my statistics tool and cipher generator in practice. I have used Project Gutenberg's "Dracula" as source and created lots of encryptions. I filtered all results that have at least a raw ioc of 2200 and a distance score of at least 15034. Here are the results (don't rely on them ... my tool is still in beta phase) :

Code: Select all
Ciphers analyzed: 1937 (each from period 1 to 170)

0% Randomization (Perfect Cyles): 27 ciphers matched or exceeded target score
10% Randomization: 11 ciphers matched or exceeded target score
20% Randomization: 2 ciphers matched or exceeded target score

Used key:
dDiZ94bekBHax3y+0PfgIEJF2phcC5mj:uU;Aq7RX=YOnGVto1zKMLw8sNrvSQl-TW
AAAABBCCDDDEEEEEEEEEFGGHHHIIIIJKLLLMNNNNOOPQRRRRSSSSSTTTTTUUUVWXYZ


I couldn't find any encryption where a high distance score and a high raw ioc occurred on period higher than 1. That is not surprising. If the high distance score in z340 is not a phantom then this could be another evidence that z340 was not transposed >after< homophonic substitution (which is almost ruled out anyway).
The less cyclic the encryption is, the less frequently high distance scores occur. This is also no surprise. But here z340 stands out...again. Its not perfect cyclic but has a high distance score.

Another thing: It is curious that AZDecrypt takes longer than usual for many of the found encryptions. I don't have enough analysis material to rule out coincidence. I just noticed it.

Here are some examples (if you like to have more I can generate them):

Raw ioc: 2289, Distance 15224
Code: Select all
Fax;YLTQ3hAol2cbp
YCqyIXnF5;m=27ld1
wXEhQ+pcz9:=XkDKF
0CM82P;=GfTXrRJiA
BosV=qENpZ7;gFatx
dnL2vGwX=j;TpDRHi
AklVSqJ58FZtBh71c
u3Rey4rsAXlT=vdn+
20GPTXSDVf;=tgEX=
HNpiqrzXUk=nTXv7J
lF=LXC:;Sb25Rwpal
=GuBXI8F=rE2sXvVA
xtQ3KZnyq=NMXedU;
D7H=SG9:X=kRXLo=4
VhJpw8FiATXrt1Zns
2vGNSVq+BL=pc;d7H
zDCk5ITXr=RuTjA0l
FXlEUiB:Thl=vuHkc
PIXt2fnT=SlXrUBvq
HgGKwZ7kpaM8=YYxB


Raw: 2261, Distance 15944
Code: Select all
lhLFdoXnw=Ib2Xjac
ApC1Q=5exEX=k9XTz
DhBQiqF3:Kc7JCR82
yA=sMXIZG=IITXrl5
uU4+pdYYTNFDLT=v2
iQ0HXqPZ:uI=VpftT
XSU=QgbX;a7=ldRk9
xoh:3AwTXr1FDuUjc
zK2yn=qe+40IXGPC8
5MB=7f9vsNpgRTXS;
roLE=iAHTXv;S1w:a
ZQxd8;TzhJqKDT7=l
XVks=;iBZ;3TXrjR=
lFXlcNCMl5L2pytwF
+n0;vo84PA=12Xbjd
qTj7=lufHEgXIsphz
l=SUk9aXRxe=;3lyD
:ul+ANrYLXUvbTKG=
X;iVwFSt4TBcn0e8C
=qGP;Z57fHXrsMhkg


Raw: 2663, Distance: 15184
Code: Select all
;aXAx:=XjIn=;TXrl
huUlcY3dlDTi:uLFC
o1=zZk2XvGI=V5lXS
UBH=lpdwD;iqbZ78X
KdQyT=rItX;M=nGXl
mvosNFhRjI=Vl2T1p
XS:kcECQ+;Tz0uIK=
;reFXIUD9=vtiABMX
;Sb2=IoXnG=l5pZQP
eX;fFgVaIt=;;TXlq
:d7HL=kXl2DwhbiR=
IJX=BZ8spxIcn1NLX
Yu3dzy;TIGC+AHm=F
qD7kw20R8XpPUYiKl
fgsT=rAE:ZBTlFX;N
=X5ed;aL=uXQxI=V2
3thD;iMpZ;yHwXodT
1=;vbF4S8czDTCs5q
jh7kR+KMcJiQ0l2ZN
TXrEdQPLpf9U=XB=I
Largo
 
Posts: 455
Joined: Tue Jun 14, 2016 4:38 am
Location: Frankfurt, Germany

Re: Unigram distance curiosity

Postby Jarlve » Wed Nov 08, 2017 1:16 pm

doranchak wrote:
Jarlve wrote:In the 340 there are very few unigram repeats over short distances, especially when taking in consideration its higher ioc per cipher length than the 408. This is why we see 9 rows which have no repeats, it is not easily connected to transposition after/during encoding and typical encoding randomization.

I did another test comparing unigrams in rows and columns. The idea is this:
Take a symbol, then count how many columns it is in. For example, there are three P symbols but they appear in only two columns.
Make these counts for every symbol, then the final measurement is the sum of the counts.

Cool measurement doranchak. I have verified your numbers, added normalization and gave it a name if you do not mind.

Code: Select all
340:
Unigram row coverage:
- Raw: 322
- Normalized: 0.9583333333333334
Unigram column coverage:
- Raw: 284
- Normalized: 0.8528528528528528

Code: Select all
408:
Unigram row coverage:
- Raw: 380
- Normalized: 0.9313725490196079
Unigram column coverage:
- Raw: 337
- Normalized: 0.8259803921568627

The following test encoded each message of my plaintext library a 100 times with cyclic homophonic substitution (26% cycle randomization) while targetting the ioc of the 340. It covers 10000 samples. The 340 appears to line up very well except for unigram distance and rows with no unigram repeats.

Test average versus 340:

Average raw ioc: 2236.09 versus 2236
Average 2-symbol cycles: 2139.03 versus 2137
Average 3-symbol cycles: 5999.64 versus 5922
Average perfect 2-symbol cycles: 1480.27 versus 1576
Average perfect 3-symbol cycles: 930.58 versus 1060
Average unigram repeats: 21.17 versus 18
Average unigram distance: 13756.88 versus 15034
Average unigram row coverage: 318.76 versus 322
Average unigram column coverage: 284.15 versus 284
Average rows with no unigram repeats: 6.55 versus 9
Average columns with no unigram repeats: 0.27 versus 0
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby Jarlve » Wed Nov 08, 2017 1:40 pm

Largo wrote:The less cyclic the encryption is, the less frequently high distance scores occur. This is also no surprise. But here z340 stands out...again. Its not perfect cyclic but has a high distance score.

I am glad you came to the same conclusion. If not an outlier then here are some ideas of mine (from the 2nd page of this thread).

Jarlve wrote:Here are my 2 most likely hypotheses for the high unigram distance in the 340:
1. The high unigram distance of the 340 is related to the group of symbols that do not appear in the middle 7 rows (as marie said).
2. A long key (as Largo said) is used in the homophonic substitution process and some of the most frequently occuring symbols are for whatever reason not part it or are wildcards. Likely hypotheses for the symbols that were not taken up in the homophonic substitution or are wildcards are that these are 1:1 substitutes, plaintext nulls or wildcards (as smokie said). I would like to go as wide as possible with the interpretation of wildcards.

In your test my cipher jarlve2 very closely matches the 340 which is hypothesis 2. Some things to look in may be exotic cycling types (again) such as palindromic cycling which also increases unigram distance and not trying to repeat symbols in a certain view window as opposed to actively cycling homophones.

I have tested the no repeat view window hypothesis and it does not produce a high unigram distance so that can be ignored. Palindromic cycling has also been tested by doranchak but still has my interest.

Crude example of hypothesis 2:

Code: Select all
1-gram frequencies > 0:
--------------------------------------------------------
+: 24
B: 12
p: 11
|: 10
O: 10
c: 10
F: 10
<--- Symbols above this line much more likely to be 1:1 substitutes, not part of the key. Symbols under this line are part of a long (efficient) key.
2: 9
z: 9
R: 8
l: 7
(: 7
K: 7
M: 7
5: 7
^: 6
V: 6
L: 6
G: 6
W: 6
.: 6
<: 6
*: 6
4: 6
k: 5
T: 5
d: 5
N: 5
#: 5
): 5
y: 5
U: 5
-: 5
C: 5
H: 4
>: 4
D: 4
Y: 4
f: 4
Z: 4
J: 4
S: 4
8: 4
9: 4
t: 4
E: 3
P: 3
1: 3
7: 3
_: 3
/: 3
;: 3
b: 3
6: 3
%: 2
:: 2
3: 2
j: 2
&: 2
q: 2
X: 2
A: 2
@: 1
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby Jarlve » Thu Nov 09, 2017 12:50 pm

I have added the standard deviation to my analysis tool.

Unigram based measurements results of the 340 versus 1.000.000 randomizations:

Unigram repeats by rows, sigma of 18: 5.39
Sliding unigram repeats (slide window=17), sigma of 364: 5.66
Unigram row coverage (doranchak's new test), sigma of 322: 5.39
Unigram column coverage (doranchak's new test), sigma of 284: 0.46
Rows that have no unigram repeats, sigma of 9: 7.01
Columns that have no unigram repeats, sigma of 0: 0.60
Unique sequence length 17 repeats, sigma of 26: 7.89
Unigram distance, sigma of 15034: 4.23

Unigram based measurements results of the 340 versus plaintext 1 to 100 + sequential homophonic substitution (26% cycle randomization) raw ioc target 2236 totalling 100.000 samples:

Unigram repeats by rows, sigma of 18: 0.73
Sliding unigram repeats (slide window=17), sigma of 364: 0.39
Unigram row coverage (doranchak's new test), sigma of 322: 0.73
Unigram column coverage (doranchak's new test), sigma of 284: 0.01
Rows that have no unigram repeats, sigma of 9: 1.18
Columns that have no unigram repeats, sigma of 0: 0.53
Unique sequence length 17 repeats, sigma of 26: 2.51
Unigram distance, sigma of 15034: 2.32
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby doranchak » Mon Nov 13, 2017 11:50 pm

Thanks for that nice summary, Jarlve. The unigram observations have been very interesting.

I tested your unigram distance measurement on a per-symbol basis, by tracking the distance sums for individual symbols and comparing to shuffles. Here are the results, limited to outlier symbols that had sigma greater than 1 or less than -1:

Image

Observations:
1) Z340 has almost twice as many outliers under this criterion (abs(sigma) >= 1.0)
2) The most anomalous symbol behavior happens in Z408 for the empty square symbol, with sigma of -2.43. The reason is because it stands for plaintext "Y" which is clustered in the 2nd half of the cipher (due to repetitions of words MY and YOU).
Image
Image
3) In Z340, filled circle (sigma -1.66) and left filled circle (sigma -1.51) have low sigma due to apparent clustering of symbols. Z408 shows some clustering for empty square and backwards L.
Image
Image
Image
4) In Z340, higher sigma symbols such as H, >, P, and backwards K result from being unusually spread out (as has been already discussed).
Image
User avatar
doranchak
 
Posts: 2360
Joined: Thu Mar 28, 2013 5:26 am

Re: Unigram distance curiosity

Postby Jarlve » Tue Nov 14, 2017 3:57 am

Thank you doranchak,

These visuals will help people to understand the unigram distance measurement on an intuitive level.

doranchak wrote:1) Z340 has almost twice as many outliers under this criterion (abs(sigma) >= 1.0)

I am trying to find out if the high unigram distance of the 340 is related solely to the symbols that have large gaps between them or if it is a more widespread phenomenon. What do you think?
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

Re: Unigram distance curiosity

Postby smokie treats » Tue Nov 14, 2017 7:08 am

Here is another similar visualization, but with the new font thanks to Largo and doranchak. It just looks so awesome I can't get over it. Plus I got a big new high definition monitor plugged into my little laptop so I can see it really big.

unigram distance 1.png
You do not have the required permissions to view the files attached to this post.
User avatar
smokie treats
 
Posts: 1620
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Unigram distance curiosity

Postby doranchak » Tue Nov 14, 2017 8:23 am

Jarlve wrote:I am trying to find out if the high unigram distance of the 340 is related solely to the symbols that have large gaps between them or if it is a more widespread phenomenon. What do you think?


My hunch is it is more widespread. Here's an idea for a test:

1) Remove the symbols that have large gaps between them
2) Measure unigram distance for the now shorter cipher text
3) Compare to shuffles of the shorter cipher text
4) See what effect the symbol removal has on sigma.

Another approach could be this:
1) Select some integer "k"
2) Determine the set of k symbols that has greatest effect on sigma, by testing removals of all combinations of k symbols.
I.e., the goal of this test would be to identify a specific set of symbols that has the greatest effect on unigram distance.
User avatar
doranchak
 
Posts: 2360
Joined: Thu Mar 28, 2013 5:26 am

Re: Unigram distance curiosity

Postby doranchak » Tue Nov 14, 2017 10:58 am

smokie treats wrote:Here is another similar visualization, but with the new font thanks to Largo and doranchak. It just looks so awesome I can't get over it. Plus I got a big new high definition monitor plugged into my little laptop so I can see it really big.

Yes! That does look very good. And more impactful than a pile of numbers. :)
User avatar
doranchak
 
Posts: 2360
Joined: Thu Mar 28, 2013 5:26 am

Re: Unigram distance curiosity

Postby Jarlve » Thu Nov 16, 2017 5:22 am

doranchak wrote:My hunch is it is more widespread. Here's an idea for a test:

1) Remove the symbols that have large gaps between them
2) Measure unigram distance for the now shorter cipher text
3) Compare to shuffles of the shorter cipher text
4) See what effect the symbol removal has on sigma.

Good idea. I changed my unigram distance measurement to only log distances under 171. The 340 scores 12514 then.

Versus randomizations it is a 2.07 sigma. Not unexpected since unigram distance is a property of sequential homophonic substitution.

Versus a sequential homophonic substitution with 26% cycle randomization hypothesis it is a -0.19 sigma. This tells us that versus such a hypothesis, the observation is not widespread and local to the symbols that have large gaps between them (what I wanted to find out). The distribution seems important, I need to ponder this.
User avatar
Jarlve
 
Posts: 2544
Joined: Sun Sep 07, 2014 9:51 am
Location: Belgium

PreviousNext

Return to Zodiac Cipher Mailings & Discussion

Who is online

Users browsing this forum: letega, tGkTcy2W9B4p60o and 34 guests

cron