IS THE VM ENCODED? ( and how to solve it in that case)

Jan B. Hurych


So far, when we were trying to solve the mystery of the VM, we did not pay too much attention to the intentions of the author. Was he really planning to write the VM such a way that nobody but nobody could solve it except himself? He could but I doubt it: too much effort was put in it and such vanity could not justify such luxury. For the same reason, as a private diary or workbook it would be quite an excessive effort and expenses, considering that after the author's death, it would have no purpose whatsoever, since nobody could read it anyway. On the other hand, we can reverse the question and ask this: was he planning for it to be eventually read - meaning decoded - and how? This is more likely. even if we consider the superficial secrecy was intentional.



We know the VM is written in rather unknown script which was most likely invented and engineered such way that it created symbols that are - as to the writing effort - without question the simplest alphabetical symbols at all. The author used several sections (sectors) that are connected together very elegant and simple way; it almost looks like he studied the script beforehand anf fully developed it before writing the VM . The simple proof for that is the fact there are no apparent writing mistakes in the text and no further script development during the whole writing of the book is apparent. That simplicity of course has the penalty: the individual letters must be written separately to avoid ambiguity of connected letters with similar sections. All that effort also shows the author meant for the VM to be read without any errors or problems.

Any additional problems, that is. Since it is unknown script, it works basically like mono-alphabetical cipher: find the meaning of the symbols and you can read the text. For that purpose the transcript EVA (and later others) were made. Of course, the letters substituting the symbols were chosen arbitrarily, so the "encipherment" is still present. Several years ago, I did the letter frequency statistic for EVA transcript and got the letter distribution that was very close to that of medieval Latin. Now it was nothing easier than to "translate" the transcript into the plaintext (provided it was indeed written in Latin). But . . .

But the resulting conversion did not give me the sensible plaintext: just few recognizable short Latin words or their fractions. It was disappointing but not surprising since:

a) it might not have been in Latin after all
b) the letter frequency order was difficult to establish for letters that have very close frequencies. By switching two neighbors, only the complete corruption of the words was achieved
c) additional plaintext text workout (cipher, grill, anagramming, etc.) was possibly present.
Of course, the above options can combine.



Recently, it occurred to me another two more possibilities: the VM might have been written in code or in artificial language. When simple substitution cipher was improved in middle ages into nomenclator (where some important words are replaced by their codes instead), the list of codewords was growing to such extent that soon it created the whole codebook. The unknown language of the VM serves itself already like a code - the codebook is the vocabulary of that language (be it natural or artificial language).

If indeed the VM is using codewords, it also suggests the possible method of solution: instead of letter frequency, we will use the word frequency in our cracking method. By frequency of course I do not mean the frequency of the words with the same length, that would be pointless exercise. Instead, we will count the words that are exactly the same, that is they do have the same order of the same letters. The "top" words (with highest frequency) then could be replaced by the "top" words of the language in question.

Such "cradle" could be then guess-filled similar way the cryptoanalysts guess the missing letters in the partly deciphered cryptogram. The length of (code)words in the VM and the plaintext does not need to be the same, it's the meaning that counts. Of course, the word frequency method is not even as accurate as the letter frequency (which is more accurate only for "top" letters anyway) but in the case the VM is encoded, there is no other way around it. The using of codewords is of course similar to using the artificial language, except is might be easier if they use natural language since we have something to compare with. For artificial language or non-sensical codes, we still have one possibility: the texts with the similar content would use codes with similar frequency and/or similar meaning.

One question remains: if author used the codewords, how big must have been the codebook? If he indeed intended the VM to be "readable", it looks like again he most probably used only the nomenclator: that is only some codewords and the rest is simply enciphered by monoalphabetical cipher . . .

11th February 2010.