THE VM RESEARCH: The new philosophy and new methods.

Jan. B. Hurych

From the very beginning, the VM research split into several branches, namely historical, linguistic, pictorial, herbal, astrological, text solving and several others. Some were determined by the simple fact that the VM contains several sections. Pretty soon another split happened, namely between those attempting to solve the text. Some believed it was all written in natural language, some went further and suspected possible encoding, any kind of coding. Both of course needed the VM transcript and during the years, several transcripts were provided bringing the new problem: which transcript should be used?

1) The first group was looking for natural language to fit the unknown text which was of course complicated since the script - the alphabet - was unknown as well. So the VM solution was hindered by the fact that we do not know the VM alphabet. In fact, unknown script is the kind of encoding itself, something like mono-alphabetical substitution cipher. To solve the problem, we need to assign to the transcribed text (say in Latin alphabet) the corresponding characters to get the plaintext. This is usually done "manually" which is rather lengthy and cumbersome process. In basic cryptology we use with advantage the frequency tables, that is of course if we know what language was used which for the VM we do not know. The tables of frequencies of course vary for different languages. So we would have to try different tables in order to discover any recognizable words in that particular language.

Of course we have to choose the language first and see what we get. Even so, some character frequencies are so close we have to use the "guess and try" method and judge from the sense of the partially decrypted sentences in a given language, in order to find the correct letters. The other critical point is the size of the sample. Military experts believe that the sample longer than 500 words is usually large enough, providing the decryptor is skilled in the given language. In the VM which is long enough even the sample size could be large enough, again provided we know the language. So the researchers pick the language first and resort to finding individual words, either by following some hints, questionable statistics or similar kind of leads. So far no language was found and we do not have any reliable conversion yet. In the case that only artificial language was used, there is actually a very slim chance the VM would be ever translated.

2) Those who believe the VM is further encoded (that is: on the top of the unknown script) have it difficult as well. Even if they get the solution, they might not know it until they cross-check it with several languages and hit the right one. No wonder that some in desperation proclaimed that the VM text could be just an encoded gibberish. That is of course the dead end: their decoded gibberish may still be the wrong gibberish and we have no means to find if it is not :-).

There are of course many ways how to encode the plaintext and while many methods were tried, there may be still more to come. If we leave for simplicity the steganography aside, the general idea of encryption is to replace the plaintext by some other text, either by encoding it:

a) via codebook, where code-words have no mujtual relationship (by algorithm, formula or procedure) with original text since they are only arbitrarily assigned in the codebook. As with the artificial language where we do not have a vocabulary, here we do not have the proper codebook and the decoding is only a guesswork with no practical use.
b) via cipher, which is usually given by formula, algorithm, grill or any other rule of conversion. Again, not knowing the language of the plaintext (and its vocabulary), the results of the application of any rule cannot be properly verified. We can of course try to crosscheck the resulting plaintext with several languages, so the process is rather opposite to method used in (1). The resulting plaintext then should be really plain and clear with not too many ambiguities - otherwise the solution could not be valid.

In both cases above (a and b), several famous cryptographers, namely the military experts, tried their art on the VM and failed. The reason is obvious: while the unknown script can be replaced simply by a transcript, the unknown language is still the main obstacle. In other words: the unknown script works like a cipher and the unknown language like a codebook, unfortunately both at the same time. With military ciphers, the language and alphabet are usually known. There are only few possibilities in the war, if we put aside some exceptions (say Navajo language operators during the WW II, truly the great idea). Also in war, there are only several methods of cracking unless the completely new cipher is invented. Even the codebook can be partly assembled via intercepted messages since mostly the military terms are used in them and the codebooks books can be even captured with the help of various agents. No such help is available for the VM.

Needless to say, we expected the big help from historical research. The main candidate, Latin language, was however found not to be the plaintext language, certainly not for something simple like the monoalphabetical substitutional cipher. The transposition cipher was not seriously considered or tested due to too many combinations possible. The help provided by pictures of plants or those in astronomical section was not yet sufficiently used. As for the author, we have nothing but a rumor.

The main problem is apparently in the VM research philosophy. While we clearly have here the case of three unknowns (script, language and possible coding), we usually have to freeze the two variables and do the research in one domain only. Such research is of course isolated, limited and also impractical. To study the language behavior is not the easy part and the statistics may be helpful but so far it created more questions than answers. Of course, we can guess some vowels, few characters and maybe some suffixes, but but not the entire alphabet or grammar. Surprisingly, if we apply the results from one to another folio, we immediately hit the snags. If it is the case, we usually do not check if our results may satisfy some other hypothesis. Instead, we try only several languages and stop there instead of thinking that there may be some additional encoding after all.

With researchers-decryptors, similar failures happen. While the (1) group works mainly with vocabulary and grammar, here we chose first the encoding method and try to apply it. If we fail, we blame it on the fact that we do not have the proper language. Some even go for more complicated encoding, a grille or even double encoding. All that without going back and finishing the work on lower level

All this said, I do not intend to raise here any criticism: we have here the most complicated and enigmatic case that was never tried before. We simply do not know any better than to use the old, proven methods. What we should blame is of course our research philosophy - it simply does not fit our problem. And how about our methods - they do not even follow the existing philosophy. And yet another question pops up: what did we really learn from other sciences, from their progress, from their new methods?

The most powerful tool is actually the feedback. We all use it - so to say - but not in continual mode. Our advance is in leaps: ahead, back and ahead again, rather inconsistently and mostly only if he hit the snag. Our criteria of errors are usually not too strict and even rather vague. And yet, analog and later digital computers made their progress mainly because of continuos feedbacks, iterations and loop operations, those well known secrets behind any simulation. They enable us to solve the problems faster and better. Simply said, we should not only mechanize the solving process but also optimize it.

That brings the inevitable stage: the use computers. So far, they were utilized mostly for statistical evaluations of the VM text, various measurements and counts, comparisons, sorting and searching. We can also try computerize the decoding, say by trying the different cipher-solvers. But that apparently did not work. Can we use for instance use the reverse engineering for reconstruction of the plaintext? Well, that's what we are doing when we are searching for suffixes and whatnot, but apparently the existing methods are not up to our task. Is there something else we missed?

Yes, it is. We missed the learning process. True, we are learning during the process but that is not purposeful, true learningis process. Or is our every next step based exclusively on what we learned before? Can we iterate our investigations so they converge to something better, more enriched, more positive? We can, but again, :manually" it takes time, isn't fast enough and so it is not too efficient. We can mechanize our processes but without learning, we cannot optiimize them. Is there anything that will provide such learning stage we are missing right now?

Yes, there is, they call it artificial neural networks (NN) and they are now commonly used as advanced tools for solving processes that do not have mathematical, logical or other formulas or procedures, simply because we do not know them too well or do not understand them or they are functions of many variables. Now isn't our VM the typical case of such problem? All we can get are the organized output data, the transcript. And that is all neural networks really need! They adapt to the problem, correct themselves accordingly, all that via the process of learning and testing, in iterative cycles. They can map the complicated input-output function so well that they will eventually simulate the whole searched-for process. Such "trained" network can be then used for the extrapolations, forecasts, estimates, simulations, planning or other tasks.

Can we use Neural Networks for cracking of the VM? Well, they do good jobs in both linguistics (translations, syntax, grammatical rules, speech recognition, e.t.c.) as well as in cryptography (deciphering, breaking the code, new methods for encoding, e.t.c.). Of course, they are extensively used also somewhere else: statistics, stock predictions, weather forecast, process control, medical diagnosis and what not.

Based generally on independent units - neurons, with many inputs and usually one output, they perform various mathematical or statistical functions based on combination of input levels. They have basically two modes of learning: supervised (both input and outputs are provided) or unsupervised (only outputs are provided,a s it is our case with the VM). They can work for parallel processing and they can use various combinations of internal functions to do the job.

Do they have disadvantages? Yes, they may not always be accurate or the learning takes long time. They cannot solve certain problems and they are usually not transparent - that is they give us the solution but we cannot really know how it was reached. The solution is locked in the trained memory and if NN are retrained for something else, the solution is lost. Those problems of course can be avoided by proper statement of the problem as well as by the proper kind of neural networks.

Now the crucial question: can they help with our VM problems? Especially when we do have output (the transcript) but no input (teh plaintext)? Yes, but hardly the way they are used now in other fields of application. How could we go about it? Say we use the transcript of one folio and as for input, we provide some estimated data and let the NN try to learn the relationship. Then we apply the learned function to another folio and see what we get. If it does not give us anything similar - I use that term loosely - to the first input , our guess was probably wrong. Not too much of the result, right, but at least it will eliminate the wishful thinking that so often leads the research completely astray. Yes, so we have to play "what-if" game. Is it just possible that we may train Nsso tehy can simulate the unknown language and grammar for the group (1) research or the unknown encryption process for group (2) research? I believe it is. We might be even able to find out if the input is only a gibberish (group 3 :-). But I believe what we will find is that the input has some strange - but still valid - grammatical or cryptographoical rules. All that can be now done much faster than "manually" and may provide us with results that can be further evaluated or used and not just be dependent on our - sometimes faulty - judgment.

As you can see, I do not have any definite method how to do it. It may or may not work. We will have to study first the ways the experts use NN for solving of their problems, be it linguistic, translational, encryptic or statistical. Of course, our problems are not similar to theirs but maybe wewill overcome some of our difficulties and the results may even inspire us further and enable the dicovery of the true solution . . .

4th February, 2008.