Linear A is a language discovered around 1900 by
Arthur Evans, and thought to have been used in ancient Crete over three and a half thousand years ago, before the invasion of the Myceneans. Although a similar form of writing (Linear B) was deciphered in 1952, what makes Linear A so interesting is that nobody has yet been able to discover exactly what language is being spoken (although various people have made suggestions).
I first came across Linear A through my classics teacher at school, but having re-read Simon Singh's "The Code Book" (a great read for a wide range of people), I eventually got around to having a closer look at it.
Now, I'm no linguistical expert (I only just passed my Classical Greek GCSE), but I have got some experience in various techniques related to computational linguistics, so I thought I would see what I could make of it. Not exactly having an abundant amount of free time, I haven't had a really close look yet, but I thought I would still post this as I find it remarkably interesting (which is something considering I dropped every language I had studied at school as soon as I started my A-levels)
I soon came across John Younger's wonderful online source of
Linear A texts, and copied them into a format that I could run though my own software to analyse easily. I soon found several problems, though:
- There are very few repeated (complete) words
- There are over 63 common letters (each letter corresponds to a syllable), which means you need far more text to perform statistics on than you would in English, with only 26
- Most of the surviving texts are in the form of accounting documents, consisting of a person's name, a symbol representing a type of good, and a number - so little regular language use.
So far I have focused on trying to extend the known texts with a couple of letters either side (where the text is missing, or cannot be read with certainty). Initially I worked out the most common letters (unigrams). Here is the list of the number of times I have counted each letter in the text I have analysed (I only looked at texts that had more than one letter together):
| Count | Letter (using the notation of J. Younger) |
| 166 | A |
| 147 | JA |
| 141 | NA |
| 137 | KU |
| 132 | I |
| 123 | TA |
| 121 | SA |
| 114 | SI |
| 112 | DA |
| 104 | MA |
| 102 | KA |
| 101 | RE |
| 100 | RA |
| 96 | TI |
| 96 | KI |
| 91 | DI |
| 88 | TE |
A standard way to continue when you find a letter in the origional text that cannot be read with certainty, may be to work down the list of letters and pick the most commonly used letter that seems possible, however this is a very basic way to approach the task.
It would then seem to make sense to look at bigrams, and to see (given the previous or next letter) what is the most commonly occuring letter, however you can instantly see a potential problem with this. With only 166 occurences of the most common letter, each pair of letters would only be expected to occur twice even if every letter was equally likely to follow the common letter. Another problem is shown by looking at the most common pair of letters. I found that the most common pair of letters was "KU-RO" - occuring 39 times - but it turns out that KU-RO is believed to be a word that translates as "Total", so this turns out to be an ineffective way of looking at the text.
Having said all of this, I have found that looking at Linear A is a highly addictive hobby, and I expect I will spend many more days looking at it over the coming years - and since new texts are still being found, the stats we can get from looking at the text will keep getting better.
TrackBack ping me at:
http://www.timwintle.co.uk/blog.pl/General/linear-a-bigrams.trackback