Homophonic substitution

Re: Homophonic substitution

Postby smokie treats » Mon Aug 10, 2015 5:54 am

I think that Zodiac used a relatively flat key to encode the 340, meaning that he used close to the same number of symbols for each letter. Maybe there was some variation, but generally hIgh frequency letters resulted in high count symbols, and low frequency letters resulted in low count symbols. That's why there are quite a few low count symbols. Then he realized that because he was using cycles, he was creating a lot of bigram repeats. For example, if you use three symbols for T, H and E respectively in a cyclic message, there are going to be a lot of repeating bigrams. Or even if you use four symbols for E and T and three symbols for H (Does that cause more bigram repeats in a cyclic message to do it that way?). Daikon, that's the hidden order.

He needed a way to change the message somehow so that it could not be solved so easily. So he used some combination of randomization and wildcards to mask all of the bigram repeats that he created with his unsophisticated cycle system. He was just an amateur who improvised one extra step that he didn't use with the 408. I don't know how to solve it, but that's what makes more sense to me. I scrap the randomly placed wildcard hypothesis. Yesterday I burned a lot of time messing around with different messages, keys, and lists of bigram repeats. I wonder if anyone has just tried to make a fairly flat key (not flat distribution of symbols throughout the message, flat distribution of symbols across the letters on the key) for the first 340 of the 408 and then tried to mask 24 bigrams with the same symbol. EDIT: Will masking the most common digraphs, like TH, rather than some of the less common digraphs, make a message more difficult for ZKD to solve? Can you get the same effect with fewer more strategically masked bigrams as compared to more less strategically masked bigrams?
Last edited by smokie treats on Mon Aug 10, 2015 7:52 am, edited 1 time in total.
User avatar
smokie treats
 
Posts: 1620
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Homophonic substitution

Postby masootz » Mon Aug 10, 2015 7:17 am

smokie treats wrote:He was just an amateur who improvised one extra step that he didn't use with the 408.


yes yes yes yes. there's no reason to assume he was an expert cryptographer. it doesn't require advanced knowledge to make a one time encryption that no one has to be able to decode. the 408 was broken because he did something obvious (using "i" and "kill") so with the 340 he wanted to add a step to make it harder. he might even have wanted to make it close to impossible but i don't think he wanted to make it completely impossible, because the part of his brain that was playing a game with the public and le wanted to feel superior. that feeling of superiority comes from giving someone something complicated that they're too "stupid" to figure out, not giving them something impossible. just my 2 cents.
masootz
 
Posts: 414
Joined: Mon Sep 08, 2014 11:19 am

Re: Homophonic substitution

Postby daikon » Mon Aug 10, 2015 11:53 am

smokie treats wrote:Anyway, I've done my second masking exercise. It was very educational. [...]
17 25 1 15 27 62 9 18 26 35 16 10 13 31 28 37 58


Cracked it! It's a quote from J.R.R Tolkien. Feeding this cipher to the solver as-is didn't produce any results. So I was ready to give up right there and report that I couldn't solve it, but then I thought, well, let's see if we can find a way to crack it after all. The method I came up with takes more time, but I got it to produce an approximate text that I was able to manually clean up further. I had the solver running overnight to attempt to crack Z340 using the same method, but it didn't produce anything meaningful. I'll keep it running for a while more, to see if something pans out. Should I disclose the method, or does someone else want to give it a try to solve this cipher?
User avatar
daikon
 
Posts: 179
Joined: Thu Jul 02, 2015 7:04 pm

Re: Homophonic substitution

Postby daikon » Mon Aug 10, 2015 12:00 pm

smokie treats wrote:Another question for anyone out there. Has anyone ever tried to do this before? Has anyone ever tried to figure out a way that Zodiac could have masked bigrams with wildcards in a fairly cyclic message? I mean, if someone has already tried to do this, I would prefer to just read about it instead of doing it. Just wondering if I am going down a rabbit hole that someone else has already gone down.


I could be mistaken, but I think you are trailblazing here. I'm not aware of any other attempts to explore the wildcards idea. I think I've seen it mentioned before, and someone might've done it privately, but I don't think I've come across any serious research in that direction.

By the way, what are the prime candidates for wildcards in Z340 in your opinion? The most frequent symbols: '+', 'B' and 'p'?
User avatar
daikon
 
Posts: 179
Joined: Thu Jul 02, 2015 7:04 pm

Re: Homophonic substitution

Postby daikon » Mon Aug 10, 2015 12:30 pm

smokie treats wrote:Will masking the most common digraphs, like TH, rather than some of the less common digraphs, make a message more difficult for ZKD to solve? Can you get the same effect with fewer more strategically masked bigrams as compared to more less strategically masked bigrams?


I would say, yes, masking the most common digraphs would make it harder to crack, but probably not hugely. I.e. it won't make a solvable cipher into unsolvable. Auto-solvers use N-gram stats from common English texts, so if you take away the most frequent stats by masking TH, it is more likely to get confused, but there will be still enough information in the cipher to get to the solution. For example, if you mask TH in "ITWAS*HEDAY", there are still N-grams before and after '*' that can be used to arrive at the correct solution.

Hmm, this makes me think it is best to place wildcards very evenly throughout the cipher, so it masks/destroys as many higher-order N-grams as possible, so that ZKD/AZD have nothing reliable to use to score the solution. Every 4th or 5th symbol would be ideal. I highly doubt Z would realize that though. I think he would be much more likely preoccupied with masking bigrams, to hide "KILL", etc., so that it won't be solved as easily as Z408.

A while back I actually thought of another way of masking repeating bigrams, that doesn't involve wildcards. What if Z used some of the cipher symbols to represent more than one letter? A bigram perhaps. '+' would *not* be one of them though, as it occurs too often in Z340. Even the most frequent bigram, 'TH', isn't that frequent. Also '+' doubles itself 3 times, and 'THTH' just doesn't happen. But let's focus on wildcards for now. I just wanted to mention this to perhaps research later.
User avatar
daikon
 
Posts: 179
Joined: Thu Jul 02, 2015 7:04 pm

Re: Homophonic substitution

Postby smokie treats » Mon Aug 10, 2015 2:26 pm

Thanks for working on this with me. My updated list of wildcard suspects is:

19 +
26 W
20 B
51 F
5 q

In that order. When I total all of the cycle scores for each symbol when compared to every other symbol ( e.g. 1 and 2, 1 and 3 ... 1 and 63),these symbols have the lowest overall total scores in their respective categories sorted by count from high to low.

Symbol 26 could be a 1:1 substitute or a wildcard because it has very low overall cycle scores and does not cycle at all with any other symbol. After doing the masking exercise, I decided to add it to the list because I cannot assume that a wildcard must be high count. To try to mask even 30 bigrams when trying to get a final bigram repeat count of 47 was a bit challenging. So with 19 having a count of 24, I see that 26 could easily be a wildcard if Zodiac did it this way. In fact, it could be the only other wildcard besides 19.

Symbol 20 does not cycle with another symbol in any remarkable way.

Symbol 51 does cycle with symbol 23 O a bit: 23 23 51 51 23 51 23 51 23 51 23 51 51 51 23 51 23 51 23 23, which is not particularly remarkable given all of the random cycles generated. On the other hand, Zodiac randomized the cycles as well. 51 is close to the bottom of the list.

Symbol 5 only cycles with symbol 29 < about 1/3rd of the way into the message: 5 5 29 5 5 5 5 29 5 29 5 29 5 29 5 29 5, which includes ten consecutive alternations. That can happen about 1-2 times throughout the entire 340 if I scramble the 340. A lot of the cycles get more randomized the further you go down the message. This one is the opposite. I suspect that Zodiac could have done the reverse with only this one cycle. Or it is random. If this isn't an intentional cycle, then the 5 q may be a wildcard. I find it difficult to believe that 5 could be a wildcard, yet just coincidentally also cycles with symbol 29. The double q q could just be randomization. It is at the bottom of the list.

Try +, W and B. See what happens. I am thinking that because of bigram repeat stats, it is likely either + and W or + and B. Likely not all three.

The Tolkien message was more cyclic than the 340, and doesn't have the same symbol count stats. It's not particularly easy to mimic the 340 with this cipher method, but I am going to keep working on it. If I can make a message, distribute the symbols across the key that will closely mimic the 340 symbol count stats, then cycle the message and randomize it to mimic the cycle stats, then mask enough bigrams to mimic the bigram repeat stats, and it cannot be solved, then we may be in business.

I have tried to randomly place all of the 19, 20, 51 and 5's on a perfectly cyclic message and the message won't solve. But I didn't check the bigram repeat stats. I shold probably do back and do that to see what they were.
User avatar
smokie treats
 
Posts: 1620
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Homophonic substitution

Postby daikon » Mon Aug 10, 2015 5:30 pm

smokie treats wrote:Try +, W and B. See what happens. I am thinking that because of bigram repeat stats, it is likely either + and W or + and B. Likely not all three.


Ok, I'm running my solver with the following sets of possible wildcards for Z340: {'+','B','W'}, {'+','W'}, {'+','F'} and {'+','B','F'}. It'll take a few hours to get enough restarts to see if it's going anywhere, since I have to test each set separately and it's a slow method. I've already tested {'+','B'} and {'+','B','p'} overnight, since they are the most frequent symbols in Z340. 'q' only occurs twice, so it shouldn't be in the way of solving the cipher even if it's a wildcard. 'W' is borderline, as it occurs 6 times, but mostly in the end of the cipher, and twice in the beginning, so it should leave a big section in the middle perfectly solvable. But I thought I'd test it anyway. I'll let you know if there are any results!
User avatar
daikon
 
Posts: 179
Joined: Thu Jul 02, 2015 7:04 pm

Re: Homophonic substitution

Postby smokie treats » Mon Aug 10, 2015 6:04 pm

daikon wrote:'q' only occurs twice


Symbol 5 for me is the Symbol at Column 1, Row 1. I guess it looks more like a backwards P. There is a double "qq" on Row 4, which is part of what made me thing that it may be a wildcard. Because Zodiac treated it differently.

Sorry about that, but I doubt that q is a wildcard anyway. Let's see what happens with your solver with what we have now and I will be interested in what you tried. Thanks!
User avatar
smokie treats
 
Posts: 1620
Joined: Thu Feb 19, 2015 1:34 pm
Location: Lawrence, Kansas

Re: Homophonic substitution

Postby daikon » Mon Aug 10, 2015 7:23 pm

smokie treats wrote:
daikon wrote:'q' only occurs twice

Symbol 5 for me is the Symbol at Column 1, Row 1. I guess it looks more like a backwards P. There is a double "qq" on Row 4, which is part of what made me thing that it may be a wildcard. Because Zodiac treated it differently.


Oh, I see, that's 'p' in WebToy transcription. It's the 3rd most frequent symbol in Z340, so I think it should be included in the tests based on that. I already included it in one set I tested: {'+','B','p'}, and I'm planning to test {'+','p'} as well later.

smokie treats wrote:Let's see what happens with your solver with what we have now and I will be interested in what you tried. Thanks!


I think there is no harm in describing it. If you don't want to learn how I cracked the most recent cipher in this thread (starts with "17 25 1 ...") just yet then STOP READING NOW. :) The first idea I had, and it's a bit technical, but I'll try to describe it in the simplest terms. Basically, a "wildcard" means "anything goes here". Solvers generally can't handle that because they use specific 4- or 5-gram stats from a collection of existing books, news articles, usenet postings, wikipedia articles, etc. (usually called a corpus). Let's use 4-grams just for the sake of simplicity. These tables look like this:
...
NTER 2617
ENTS 2616
ECON 2602
NING 2595
COMP 2558
...
For each 4-gram you have the number of times it was found in the corpus. If you divide that number by the total count of all 4-grams in the corpus, you'll get the frequency, or probability, of each 4-gram in English language. You can use that data to score each proposed solution to be a coherent English text (as opposed to some gibberish). The idea is that if we split the solution into 4-grams (overlapping), and then add up the frequencies/probabilities for each 4-gram, we'll get the overall score which will be higher for texts that look more like English (i.e. contain frequent 4-grams found in English texts), and lower score for gibberish. There is actually a proper mathematical basis for all this, based on statistics and probabilities, but the simplification also makes sense from purely common sense point of view.

The problem with wildcards is that you can not use the 4-gram tables to score the solution, as they contain only actual letters. At first I thought about building new tables, with wildcards included. But then I realized that I can already use the existing tables. You just need to iterate through all possible letter substitutions for each wildcard, and look for the maximum score out of all of them. So, for example, if you need to score "AB*D", where "*" is a wildcard, you substitute it with each letter in the alphabet, and then consult the 4-grams table. Let's say you have these entries (purely fictional) that match "AB*D", and their respective scores:
ABAD 37
ABOD 21
ABID 12
You take the maximum score, which is 37, and as an added bonus, you get the best candidate for the wildcard ('A'). You add the score to the overall score, and that's it! I was super excited about this, but turns out it doesn't work. Why? The 4-grams overlap in the solution. For example, if you have: "ABC*FGH", you need to add up scores for the following 4-grams: "ABC*", "BC*F", "C*FG" and "*FGH". It is very likely that you'll end up with maximum scores that have different letters in the place of the wildcard, and therefore with several best candidates for the exact same wildcard. Which one do you use? I was thinking about figuring out the combined maximum for each letter, but it was already getting quite complicated, and slow... And then it hit me, there is an easier way!

What I ended up doing was the old idea of replacing each '+' symbol with a new unique symbol. It essentially turns it into a wildcard (i.e. each occurrence of '+' will be its own letter) without the solver knowing anything about wildcards. You just don't stop there and replace the second wildcard with a new set of unique symbols. And the third one, if you have to. Yes, you end up with a very high multiplicity cipher (i.e. many unique symbols). But I thought if you do enough restarts, the solver might converge on the correct solution. Or somewhat correct, as the case may be. And I was right! For the Tolkien cipher, in about an hour, I got 2 top solves that looked remarkably similar. When I compared the parts that were the same in both, I started to recognize English sentences, with a few letters off. For example "HOBBIT", being a fairly rare word, was consistently solved to "HOBBIS". :) You can even try it on your own with ZKD or AZD.

Let's hope this works for Z340. :)
User avatar
daikon
 
Posts: 179
Joined: Thu Jul 02, 2015 7:04 pm

Re: Homophonic substitution

Postby daikon » Mon Aug 10, 2015 7:29 pm

Forgot to mention that this method also worked for the earlier "purple haze" cipher, when I "expanded" the top 3 out of 4 wildcards (37, 49 and 51). Took about 3 hours to get a solution. So I think if Z340 was indeed encrypted with several wildcards, and we can correctly guess the top 3, we should be able to solve it. Here's hoping, right? :)
User avatar
daikon
 
Posts: 179
Joined: Thu Jul 02, 2015 7:04 pm

PreviousNext

Return to Zodiac Cipher Mailings & Discussion

Who is online

Users browsing this forum: Chaucer, Goodkidmaadtoschi, Google [Bot], Shawn, tGkTcy2W9B4p60o and 44 guests

cron