perjantai 12. joulukuuta 2014

Making mistakes as interesting research question in computational linguistics

There's probably no other field of science where making mistakes and doing stupid things can be so profitable as computational linguistics. Bad tagging in your gold corpora? "It is an interesting research topic on how to make use of this data", don't fix it, just use it. Disambiguation always throws away forms you need? Research question! Pre-processing butchers your data beyond repair? You know what to do, call it future research and publish rubbish results, no matter that you did the pre-processing yourself and could make it more sensible easily.

Let's start with what machine translation is about right now. Before one can translate texts it has to be mangled at least through two atrocious processors, just because, and we call it state of the art too: truecasing and tokenising. You know, instead of having access to the text when translating, you'll deal with text where random letters have been lowercased and uppercased, and yes, you don't have access to knowledge which ones, that has already been lost. But wait, there's more, we have also added spaces to quite random spots in the text without telling you. That's for the statistical machine translation, then after the remains of the string has been translated, moved around, removed and so on, you get to guess where the uppercases should be and maybe if punctuation is still at right position, and what spaces to remove. Or you could rule-based tokenise stuff, maybe ignore whole lot of spaces treating them as not important, who knows.

This like all problems is just caused by the fact that most systems want to transport data in text files or pipes, so we either have to use ad hoc ascii symbol mess to mark up all this casing operations and splits, or discard stuff. What one would really want though, is to have data in a sensible data structure that retains the information: sensible tokenisation and casing of "This string with Name and e.g., full-stops in it." is not "this string with name and e . g . full - stops in it .", it's python structure like [(This:this), ' ', string, ' ', with, ' ', Name, ' ', and, 'e.g.', ',', ' ', full-stops, ' ', in, ' ', it, .], spaces retained where supposed to be between tokens and not where there aren't, original casing retained when mangled. Should be simple but isn't.

Disambiguation is just another pet peeve of mine: if time flies like arrows and butterflies like flowers, you cannot have time flies liking arrows anymore since you have decided earlier that flies is a verb and like is an adverb, and there's no changing decisions in this game.

Throwing away information because you cannot come up with encoding scheme for it is not a good idea. If you want to throw away information because it's too slow to make informed decisions, try to measure the slowness to go along with it. None of the scenarios presented above require discarding good data in today's computers. Yet we are riddled with it. Oh well.

perjantai 31. lokakuuta 2014

The folly of reproducing bugs, recreating errors and coming up with something from nothing

The more I dive into the task of high quality machine translation the more I get annoyed about the standards by which the machine translation are measured. For those who don't know, the quality of machine translation is solely measured on how well the system can recreate a translation made by human translators. Fair enough, if the machine comes up with the same translation as professional translator we can say it has done a marvelous job indeed. Except if maybe translator made a mistake. But that's not all, a job of translator, a professional one who makes high quality translations at the very least, is to translate content to target audience so they can read it. This can often mean adding new data to fill in the audience of facts that perhaps the audience of source language text know better, a machine coming up with that will be a smart machine indeed, but I don't see that happening. Human translators can also drop a lot of words when it's too wordy, e.g. certain ways of saying stuff in English will almost sound like you're explaining things to a child in Finnish if you take too exact translations, but again, a machine smart enough to realise that is not what I foresee in the near future. And then there's a lot of rewording, humans will kind of know when you want to translate verbs into nouns and reword whole sentence cause the way of expressing things is kind of weird for this language, well the machine that realises this may be plausible, in fact, if we just throw enough data to a statistical system it will realise that the sentence is odd, and may have seen the rewording.

The reason why I'm writing this is, I finally for the first time took a serious look at the data that is used to measure the quality of machine translation, that is the europarl stuff. The vast majority of it is horrifying, to the extent that I as human cannot even begin to explain how, if you are given this English sentence can you come up with anything distantly resembling that Finnish sentence, or even matching up with say half of the words in it. If I cannot explain the translation it is at least obvious that we cannot build a rule-based system to map between these two, but even with statistics to be able to learn that, say, if you are talking of tv channels shown in the hotels of members of European parliament and specifically a tv channel named FOO, and English text doesn't give any details of the channel, you are expected to translate FOO as FOO, this is a dutch tv channel broadcasting mainly news, what kind of statistics would really give you good evidence of doing that is rather weird, but not getting that right will probably reduce your score for that translation to 0! While that specific example is not in europarl hopefully, there is a session that is about tv channels and there are a lot of cases like that all over, and indeed machine translation is tasked on finding a way to create algorithms that would faithfully reproduce that kind of rewordings and additions, it's whole lot of nonsense how the systems are evaluated really.

So, machine translation IMO is never particularly suited for much of paraphrasing, rewording, adding information or that sorts of task, moreover we should really concentrate on making systems that a) faithfully carry the information across to the reader as it is and only then b) make it sound like it is grammatically correct and colloquial in target language. Trying to optimise systems to make all these high quality rephrasings and stuff is really a foolish goal to have, rather than to just make sure that the systems are good at not losing any information nor inverting any meanings, which is the biggest problem as I see with current system mainly caused by the fact that they are trying to solve this all at once. Like, who cares if English is more likely to say "don't forget to frobble" where Finnish speakers would go "remember to frobble", but with the current scoring system we do get penalised hugely for not getting that right, and perhaps, getting the common statistical machine mistranslation "don't remember to frobble", oh, it gives us more points of course! So that's what we're optimising our system for. Sweet.

By the way did I mention that if we'd score professional translators by the same measures we use for machine translation they would usually get scores that we deem so low that they're not worth of publishing. Yeah, ain't that a good measure.

Yeah yeah, so this is not specific to machine translation but all computational linguistics in its glory, this madness follows us anywhere we go it does. It is a good thing that we want to systematically measure how well we're doing, instead of just throwing random things at random ad hoc implementations of stuff and writing 5 page essays for conferences describing why things got better with stuff, but it is indeed an exercise in futility when things go towards the trying to recreate bugs or mistakes just because. This is the case of most things like so-called morphological analysis, there are no good standards or metrics on it, so whoever writes the first "gold standard", which in morphology will either mean just a dump of some systems output with systematic errors and all, or another choice is to actually have human annotators do the standard, which may be slightly better. Unfortunately the big mistake here is that people doing linguistics don't really understand how things work, e.g., how to really really prove that analysis is correct and there's some measurable evidence that proves it, but the way human annotators work is just by intuition, And so we end often enough with the enjoyable task of reproducing either bugs of another system or linguistic intuition.

In conclusion, all the millions of money that is spent on computational linguistics, most of it goes in engineering aimed at reproducing bugs and mistakes faithfully, your money at work, isn't it.

maanantai 13. lokakuuta 2014

Tense and time-travelling

Today's episode of the Big bang theory reminded me of something I wrote years ago about time-travelling and tenses in human languages (I wonder if it's still available somewhere, I had beginnings of the listing of a 1000 and one useful verb forms for time-travellers) , that in turn also relates to how misguided even most of the school books about human languages are. One popular representation of this is people saying whether there's future in your language or not, like readers of language log are aware, even some respectable publications fell into that fake story about how English has future and languages that don't are economically worse. The scene on tbbt is about telling the time travelling paradox of back to the future with reference to alternate timelines, and it is clearly obvious that English language has necessary tense structures for telling combinations past future perfects (or what's its name, I'll look it up once the script or subs of the episode are online) as understandably as present, future or past or pluperfect. So what's the point of school grammar teaching that English has future and Finnish hasn't? There isn't, Finnish has plenty of auxiliary verbs as capable as English will for being explicit about future, not calling it a future tense is just good for silly false anecdotes about languages. And, by the way, the discussion about tenses useful for time travelling is ripped from The hitchhiker's guide to galaxy.

keskiviikko 17. syyskuuta 2014

Translation machines


In machine translation there's usually ongoing some sort of a competition or disagreement between statistical and knowledge-based approaches. Statistics is based on a logic that if you have a lot of good quality material and you push it through the machine it creates even better quality material from the best bits and makes it more usable. Like a meat grinder. However that rarely happens so that the source material is actually as good quality as anyone would assume. The primary source of machine translations is europarl, a corpus of European parliament sessions transcribed and translated, and the that is trained on that can merely recall what it's seen there, statistical machine translation, you see, cannot make very good creative decisions, it can in fact only remake same decisions it has seen in translations fed to it. So giving such machine translation text that is not European parliament session it usually fails to make good connections. Much like this McDonalds' meat substitute here:

In knowledge-based, linguistic or rule-based machine translation there's an opposite problem. The people working on knowledge are interested mainly in looking at very small amount of data, the gold or the crown jewels of the language are what is interesting, neglecting the uninteresting words and stuff that makes up 95 % of good machine translation. Further problem is that their jewels aren't truly precious but fakes, that is, the interestingness is based on old misclassifications that they hold on to, to make problems more interesting, such as fake ambiguity made by wrong classifications. So they end up working like these guys from South Park.



maanantai 23. kesäkuuta 2014

To end the science festival season: EAMT 2014

With absolutely ovewhelmingly too many conferences and too little to blog about, EAMT was my first MT conference now. We were in Dubrovnik, nice touristy city, too much sea food, quite a bit of rain. Anyways, I would've thinked something like EAMT would be relatively large, given things like EACL are too, but it was single track event, with manageable amount of people. Our PR team already wrote about official side of things in CNGL blog post of EAMT 2014 (how weird is that to have someone to work on that kind of stuff in academia, eh?). There's the thing that I said there that I think is indeed one of the main goals for my research, the end user applications we know are quite doable and have seemed like science fiction since 50's but we still aren't there and the reason isn't quite clear. We have our machine translation chains, with or without analysers, OCR, speech recognition, TTS, there's no reason we shouldn't have mobiles that can read your foreign food menus and translate them, listen to server and you interpret between you the whole interaction–we have all this and in reasonable quality as well, it's not sci-fi anymore. I know I travel a lot and while I make an effort to learn basics of languages* as I go, I would much enjoy such app. So we'll try to have one. Now the problem with this is, all machine translation systems are built to understand and translate only documents of European parliament, so they fail when they see non-prose texts like food menus. Even though translating menus and ingredients would be easier. And interaction with waiters too. No sentence  structures, just words and some phrases you know. Should think about building for that, hmm.

keskiviikko 11. kesäkuuta 2014

Some LREC 2014 ideas

In this continuation of endless stream of conferences I seem to be having this year, LREC is second to last before the autumn. And LREC is one of the biggest in our field, so even though most of main conference stuff is pretty basic, it's the only place to distribute language resources and stuff that's basic engineering and data harvesting, yeah, lot of collaborations and social networking goes on. I don't have much interesting insights of the conference content, we presented some basic infra work and that simple lexc optimisation hack I thought of years back. We saw few resources, Hungarian data is still not available for most part, Lakota has lexicon of and twol rules and all of 50 words or so. Trmorph has been developed further.

sunnuntai 11. toukokuuta 2014

Bad and worse code in scientific programming

I was just reading a couple of articles about software engineering: The Low Quality of Scientific Code and Why Bad Scientific Code Beats Code Following Best Practices. Some of you may remember me from such presentations as FSCONS 2013, where I talked about the very same thing. I very much agree with the first text of course, most code I have to look at is rather dreadful, and it is very much a surprise it ever works. And one of the things I've learnt after moving to new projects from HFST (and apertium) is that, while there's a lot of bad code in there, it's actually still among the better ones with all the floss software engineering conventions that actually got implemented and at least someone every once in a while cares to follow. Take for example my trying to learn statistical machine translation, the most prominent project in the field is called moses. No tests, cannot be installed, consists mostly of kludgy scripts that only work occasionally, often by a side effect of some other script ran before in the same directory. When you want to use moses from other project you make a note of where you unpacked its source and hope that everyone who uses it will have same random scripts in same places, in one of the billion script directories there is. The thing is, it's not much harder to actually do things properly, you don't need to hire a software engineer to understand that there should be a test that runs your program and give it some input and gets the expected output and doesn't crash and all simple things like that. It's not intellectually difficult thing to grasp and the implementation doesn't take more than few minutes which is saved in each of the update and debugging cycles. So yeah, I don't have much to rant now, the blogs already said the things.

lauantai 3. toukokuuta 2014

Some EACL ideas

So I went to EACL in Gothenburg without having anything much to present there. One of the things that I usually do when I have to fly around the world with long layovers is update my compling projects. Maybe it’s inspired by Norvig’s spell-checker, who knows. This time I of course spent time on making apertium-fin-swe. The array of apertium-based Finnish translators is starting to shape up to be nice it is. This invokes the traditional problem that there’s no reuse of code in NLP, and there are no standards for whatever analyses are.

keskiviikko 16. huhtikuuta 2014

Unsupervised cycling in Nepal

Last week I attended to a conference in Nepal called CICLing. This is my second time in the conference, last being 2012 in Delhi, so I knew something what to expect, both in terms of the conference and the country, Nepal after all is not so different from India (though it is! But more of that later). The conference itself is reasonably sized, such that you can meet most of the active participants, there's only one track and most people will usually follow or stay in vicinity. Lots of social events, focus is at least somewhat on linguistics and applications too, not just engineering. Might sound a bit suspicious to waste money to go to faraway places to socialise, but the amount of support for local native languages and contacts is good, so it's well within reason.

Ok, the first impression of the conference is definitely unsupervised. I think at least 33 % of the presentations and posters will have contained that word. As it is, there's nothing wrong in unsupervised methods, if they do what the promise is it's all fine, if there's no supervised way to do the same, good. However, it's not always the case. One of the main promise of unsupervised methods is that they are cheap and easy and require no experts. But if you've spent 3 years coming up with unsupervised methods, is that not work, and very expensive expert work at it? But maybe it's language independent and we get rest of the world's languages for free? I've rarely seen that to be the case though, there's always something that a native speaking informant needs to explain, why are these words being split here, why's that disappearing. For tasks that I'm familiar, this time with informant trying to tell the engineer what's wrong is longer than what would it would take for one computational linguist and native informant to create same system in supervised manner with better quality. For morphology, what was presented here, it's especially obvious, the unsupervised learnings of morphology have not to date reached the quality of rule-based systems. And even so, the unsupervised systems will anyways need to be annotated by native informant and linguist, cause the morphs themselves don't tell us much, for the 99 % of the real applications, we need to know if they are words, present tense 3rd person singular affixes or question clitic particles. Building unsupervised systems for morphology in few years with recall of 80 % vs. building one in a gsoc period of 3 months with 95 % recall, come on now really, which one seems more sane? But unsupervised learning it's an interesting question, to paraphrase one answer, and it is indeed, but interesting like solving a hard sudoku puzzle without lookahead, or interesting like trying to throw darts blindfolded with hands tied and still reaching 501 sharp, in finite time, not interesting as in useful to scientific community. That being said, I fully support the goal when it's formulated as a consistency check for the rule-based system, that will be useful indeed, since humans are prone to errors, likely to make wrong generalisations, miss some generalisations, all that, a statistically oriented system for morphology might aiui reveal this kind of errors.

Not to totally just bash unsupervised crowd, another keynote speaker did present slides about common sense knowledge system. They reminded me of what I'd learnt while studying language technology in University of Joensuu–some 10 years ago–on a course called History of Language Technology (and if it weren't in a book called failed projects, it should've been). This project has all the interesting things there are in language technology. Computers understanding common sense. Logical formulas. Inferences, deductions, networks, seems very charming and the way it should be. Then you realise that each of the formulas and networks cover just one very small part of the world, and these are all hand-build formulas, so there must be millions of them, and they are nothing else than English words written in capitals, with subjects and objects of sentences rearranged to look like mathematical functions, with some quantifiers sprinkled around. And you can collect those for a life time to cover 1 % of the common sense. And that's quite possibly the main reason engineers want to get rid of linguists, so that nobody would ever start building those systems. Though I don't know if we can unsupervised learn those either.

With smaller presentations, there was well enough good things for me to find that trip was worth it. Few uses of morphological segmentation in machine translation, Kazakh spell-checking and Bishnupriya Manapuri FST system for example. If I ever have time to follow them up. And some more. Chunking and bilingual dictionaries building.

Outside scientific festivals, Nepal was quite positive surprise, I expected something like India, but it's actually a nicer version of India, more tourist-friendly and less crowded. A bunch of nice bars and shops in backpacker ghetto, hawkers and scammers won't bother you after one or two sharp no's. And there was an intercollege music contest which seemed to feature mainly metal, with three headliners. Audience was cool, nice pitting and all. Music wasn't all that bad. We never solved the structure of Vomiting Snakes. And unusually nice group of conference people for such social events after official events that I met.

My slides are at the usual place in my github repo, with a questions doge. That wasn't all that succesfull, but I've already used up the whole spell-checking crud, so let's get to see what world has to offer.

maanantai 24. maaliskuuta 2014

The first weeks in Ireland

This blog is a continuation to my hugely successful squiggly colorful underlinings blog I had in University of Helsinki. Since I've now got my PhD and finished drawing red underlines, not to mention that the University of Helsinki will more or less likely remove my access now that I don't live there anymore. Ok, so this first blog post is not going to contain science, much linguistics or anything, so you are expecting that you can skip to the rest of this. Just a rant of the good and the bad in moving from (a University job in) Finland to (a University job in) Ireland. I'll resume the normal order once I get back to doing science, not to worry.