keskiviikko 16. huhtikuuta 2014

Unsupervised cycling in Nepal

Last week I attended to a conference in Nepal called CICLing. This is my second time in the conference, last being 2012 in Delhi, so I knew something what to expect, both in terms of the conference and the country, Nepal after all is not so different from India (though it is! But more of that later). The conference itself is reasonably sized, such that you can meet most of the active participants, there's only one track and most people will usually follow or stay in vicinity. Lots of social events, focus is at least somewhat on linguistics and applications too, not just engineering. Might sound a bit suspicious to waste money to go to faraway places to socialise, but the amount of support for local native languages and contacts is good, so it's well within reason.

Ok, the first impression of the conference is definitely unsupervised. I think at least 33 % of the presentations and posters will have contained that word. As it is, there's nothing wrong in unsupervised methods, if they do what the promise is it's all fine, if there's no supervised way to do the same, good. However, it's not always the case. One of the main promise of unsupervised methods is that they are cheap and easy and require no experts. But if you've spent 3 years coming up with unsupervised methods, is that not work, and very expensive expert work at it? But maybe it's language independent and we get rest of the world's languages for free? I've rarely seen that to be the case though, there's always something that a native speaking informant needs to explain, why are these words being split here, why's that disappearing. For tasks that I'm familiar, this time with informant trying to tell the engineer what's wrong is longer than what would it would take for one computational linguist and native informant to create same system in supervised manner with better quality. For morphology, what was presented here, it's especially obvious, the unsupervised learnings of morphology have not to date reached the quality of rule-based systems. And even so, the unsupervised systems will anyways need to be annotated by native informant and linguist, cause the morphs themselves don't tell us much, for the 99 % of the real applications, we need to know if they are words, present tense 3rd person singular affixes or question clitic particles. Building unsupervised systems for morphology in few years with recall of 80 % vs. building one in a gsoc period of 3 months with 95 % recall, come on now really, which one seems more sane? But unsupervised learning it's an interesting question, to paraphrase one answer, and it is indeed, but interesting like solving a hard sudoku puzzle without lookahead, or interesting like trying to throw darts blindfolded with hands tied and still reaching 501 sharp, in finite time, not interesting as in useful to scientific community. That being said, I fully support the goal when it's formulated as a consistency check for the rule-based system, that will be useful indeed, since humans are prone to errors, likely to make wrong generalisations, miss some generalisations, all that, a statistically oriented system for morphology might aiui reveal this kind of errors.

Not to totally just bash unsupervised crowd, another keynote speaker did present slides about common sense knowledge system. They reminded me of what I'd learnt while studying language technology in University of Joensuu–some 10 years ago–on a course called History of Language Technology (and if it weren't in a book called failed projects, it should've been). This project has all the interesting things there are in language technology. Computers understanding common sense. Logical formulas. Inferences, deductions, networks, seems very charming and the way it should be. Then you realise that each of the formulas and networks cover just one very small part of the world, and these are all hand-build formulas, so there must be millions of them, and they are nothing else than English words written in capitals, with subjects and objects of sentences rearranged to look like mathematical functions, with some quantifiers sprinkled around. And you can collect those for a life time to cover 1 % of the common sense. And that's quite possibly the main reason engineers want to get rid of linguists, so that nobody would ever start building those systems. Though I don't know if we can unsupervised learn those either.

With smaller presentations, there was well enough good things for me to find that trip was worth it. Few uses of morphological segmentation in machine translation, Kazakh spell-checking and Bishnupriya Manapuri FST system for example. If I ever have time to follow them up. And some more. Chunking and bilingual dictionaries building.

Outside scientific festivals, Nepal was quite positive surprise, I expected something like India, but it's actually a nicer version of India, more tourist-friendly and less crowded. A bunch of nice bars and shops in backpacker ghetto, hawkers and scammers won't bother you after one or two sharp no's. And there was an intercollege music contest which seemed to feature mainly metal, with three headliners. Audience was cool, nice pitting and all. Music wasn't all that bad. We never solved the structure of Vomiting Snakes. And unusually nice group of conference people for such social events after official events that I met.

My slides are at the usual place in my github repo, with a questions doge. That wasn't all that succesfull, but I've already used up the whole spell-checking crud, so let's get to see what world has to offer.