perjantai 31. lokakuuta 2014

The folly of reproducing bugs, recreating errors and coming up with something from nothing

The more I dive into the task of high quality machine translation the more I get annoyed about the standards by which the machine translation are measured. For those who don't know, the quality of machine translation is solely measured on how well the system can recreate a translation made by human translators. Fair enough, if the machine comes up with the same translation as professional translator we can say it has done a marvelous job indeed. Except if maybe translator made a mistake. But that's not all, a job of translator, a professional one who makes high quality translations at the very least, is to translate content to target audience so they can read it. This can often mean adding new data to fill in the audience of facts that perhaps the audience of source language text know better, a machine coming up with that will be a smart machine indeed, but I don't see that happening. Human translators can also drop a lot of words when it's too wordy, e.g. certain ways of saying stuff in English will almost sound like you're explaining things to a child in Finnish if you take too exact translations, but again, a machine smart enough to realise that is not what I foresee in the near future. And then there's a lot of rewording, humans will kind of know when you want to translate verbs into nouns and reword whole sentence cause the way of expressing things is kind of weird for this language, well the machine that realises this may be plausible, in fact, if we just throw enough data to a statistical system it will realise that the sentence is odd, and may have seen the rewording.

The reason why I'm writing this is, I finally for the first time took a serious look at the data that is used to measure the quality of machine translation, that is the europarl stuff. The vast majority of it is horrifying, to the extent that I as human cannot even begin to explain how, if you are given this English sentence can you come up with anything distantly resembling that Finnish sentence, or even matching up with say half of the words in it. If I cannot explain the translation it is at least obvious that we cannot build a rule-based system to map between these two, but even with statistics to be able to learn that, say, if you are talking of tv channels shown in the hotels of members of European parliament and specifically a tv channel named FOO, and English text doesn't give any details of the channel, you are expected to translate FOO as FOO, this is a dutch tv channel broadcasting mainly news, what kind of statistics would really give you good evidence of doing that is rather weird, but not getting that right will probably reduce your score for that translation to 0! While that specific example is not in europarl hopefully, there is a session that is about tv channels and there are a lot of cases like that all over, and indeed machine translation is tasked on finding a way to create algorithms that would faithfully reproduce that kind of rewordings and additions, it's whole lot of nonsense how the systems are evaluated really.

So, machine translation IMO is never particularly suited for much of paraphrasing, rewording, adding information or that sorts of task, moreover we should really concentrate on making systems that a) faithfully carry the information across to the reader as it is and only then b) make it sound like it is grammatically correct and colloquial in target language. Trying to optimise systems to make all these high quality rephrasings and stuff is really a foolish goal to have, rather than to just make sure that the systems are good at not losing any information nor inverting any meanings, which is the biggest problem as I see with current system mainly caused by the fact that they are trying to solve this all at once. Like, who cares if English is more likely to say "don't forget to frobble" where Finnish speakers would go "remember to frobble", but with the current scoring system we do get penalised hugely for not getting that right, and perhaps, getting the common statistical machine mistranslation "don't remember to frobble", oh, it gives us more points of course! So that's what we're optimising our system for. Sweet.

By the way did I mention that if we'd score professional translators by the same measures we use for machine translation they would usually get scores that we deem so low that they're not worth of publishing. Yeah, ain't that a good measure.

Yeah yeah, so this is not specific to machine translation but all computational linguistics in its glory, this madness follows us anywhere we go it does. It is a good thing that we want to systematically measure how well we're doing, instead of just throwing random things at random ad hoc implementations of stuff and writing 5 page essays for conferences describing why things got better with stuff, but it is indeed an exercise in futility when things go towards the trying to recreate bugs or mistakes just because. This is the case of most things like so-called morphological analysis, there are no good standards or metrics on it, so whoever writes the first "gold standard", which in morphology will either mean just a dump of some systems output with systematic errors and all, or another choice is to actually have human annotators do the standard, which may be slightly better. Unfortunately the big mistake here is that people doing linguistics don't really understand how things work, e.g., how to really really prove that analysis is correct and there's some measurable evidence that proves it, but the way human annotators work is just by intuition, And so we end often enough with the enjoyable task of reproducing either bugs of another system or linguistic intuition.

In conclusion, all the millions of money that is spent on computational linguistics, most of it goes in engineering aimed at reproducing bugs and mistakes faithfully, your money at work, isn't it.

Ei kommentteja:

Lähetä kommentti