smokie treats wrote:daikon wrote:'q' only occurs twice
Symbol 5 for me is the Symbol at Column 1, Row 1. I guess it looks more like a backwards P. There is a double "qq" on Row 4, which is part of what made me thing that it may be a wildcard. Because Zodiac treated it differently.
Oh, I see, that's 'p' in
WebToy transcription. It's the 3rd most frequent symbol in Z340, so I think it should be included in the tests based on that. I already included it in one set I tested: {'+','B','p'}, and I'm planning to test {'+','p'} as well later.
smokie treats wrote:Let's see what happens with your solver with what we have now and I will be interested in what you tried. Thanks!
I think there is no harm in describing it. If you don't want to learn how I cracked the most recent cipher in this thread (starts with "17 25 1 ...") just yet then STOP READING NOW. :) The first idea I had, and it's a bit technical, but I'll try to describe it in the simplest terms. Basically, a "wildcard" means "anything goes here". Solvers generally can't handle that because they use specific 4- or 5-gram stats from a collection of existing books, news articles, usenet postings, wikipedia articles, etc. (usually called a corpus). Let's use 4-grams just for the sake of simplicity. These tables look like this:
...
NTER 2617
ENTS 2616
ECON 2602
NING 2595
COMP 2558
...
For each 4-gram you have the number of times it was found in the corpus. If you divide that number by the total count of all 4-grams in the corpus, you'll get the frequency, or probability, of each 4-gram in English language. You can use that data to score each proposed solution to be a coherent English text (as opposed to some gibberish). The idea is that if we split the solution into 4-grams (overlapping), and then add up the frequencies/probabilities for each 4-gram, we'll get the overall score which will be higher for texts that look more like English (i.e. contain frequent 4-grams found in English texts), and lower score for gibberish. There is actually a proper mathematical basis for all this, based on statistics and probabilities, but the simplification also makes sense from purely common sense point of view.
The problem with wildcards is that you can not use the 4-gram tables to score the solution, as they contain only actual letters. At first I thought about building new tables, with wildcards included. But then I realized that I can already use the existing tables. You just need to iterate through all possible letter substitutions for each wildcard, and look for the maximum score out of all of them. So, for example, if you need to score "AB*D", where "*" is a wildcard, you substitute it with each letter in the alphabet, and then consult the 4-grams table. Let's say you have these entries (purely fictional) that match "AB*D", and their respective scores:
ABAD 37
ABOD 21
ABID 12
You take the maximum score, which is 37, and as an added bonus, you get the best candidate for the wildcard ('A'). You add the score to the overall score, and that's it! I was super excited about this, but turns out it doesn't work. Why? The 4-grams overlap in the solution. For example, if you have: "ABC*FGH", you need to add up scores for the following 4-grams: "ABC*", "BC*F", "C*FG" and "*FGH". It is very likely that you'll end up with maximum scores that have different letters in the place of the wildcard, and therefore with several best candidates for the exact same wildcard. Which one do you use? I was thinking about figuring out the combined maximum for each letter, but it was already getting quite complicated, and slow... And then it hit me, there is an easier way!
What I ended up doing was the old idea of replacing each '+' symbol with a new unique symbol. It essentially turns it into a wildcard (i.e. each occurrence of '+' will be its own letter) without the solver knowing anything about wildcards. You just don't stop there and replace the second wildcard with a new set of unique symbols. And the third one, if you have to. Yes, you end up with a very high multiplicity cipher (i.e. many unique symbols). But I thought if you do enough restarts, the solver might converge on the correct solution. Or somewhat correct, as the case may be. And I was right! For the Tolkien cipher, in about an hour, I got 2 top solves that looked remarkably similar. When I compared the parts that were the same in both, I started to recognize English sentences, with a few letters off. For example "HOBBIT", being a fairly rare word, was consistently solved to "HOBBIS". :) You can even try it on your own with ZKD or AZD.
Let's hope this works for Z340. :)