perjantai 12. joulukuuta 2014

Making mistakes as interesting research question in computational linguistics

There's probably no other field of science where making mistakes and doing stupid things can be so profitable as computational linguistics. Bad tagging in your gold corpora? "It is an interesting research topic on how to make use of this data", don't fix it, just use it. Disambiguation always throws away forms you need? Research question! Pre-processing butchers your data beyond repair? You know what to do, call it future research and publish rubbish results, no matter that you did the pre-processing yourself and could make it more sensible easily.

Let's start with what machine translation is about right now. Before one can translate texts it has to be mangled at least through two atrocious processors, just because, and we call it state of the art too: truecasing and tokenising. You know, instead of having access to the text when translating, you'll deal with text where random letters have been lowercased and uppercased, and yes, you don't have access to knowledge which ones, that has already been lost. But wait, there's more, we have also added spaces to quite random spots in the text without telling you. That's for the statistical machine translation, then after the remains of the string has been translated, moved around, removed and so on, you get to guess where the uppercases should be and maybe if punctuation is still at right position, and what spaces to remove. Or you could rule-based tokenise stuff, maybe ignore whole lot of spaces treating them as not important, who knows.

This like all problems is just caused by the fact that most systems want to transport data in text files or pipes, so we either have to use ad hoc ascii symbol mess to mark up all this casing operations and splits, or discard stuff. What one would really want though, is to have data in a sensible data structure that retains the information: sensible tokenisation and casing of "This string with Name and e.g., full-stops in it." is not "this string with name and e . g . full - stops in it .", it's python structure like [(This:this), ' ', string, ' ', with, ' ', Name, ' ', and, 'e.g.', ',', ' ', full-stops, ' ', in, ' ', it, .], spaces retained where supposed to be between tokens and not where there aren't, original casing retained when mangled. Should be simple but isn't.

Disambiguation is just another pet peeve of mine: if time flies like arrows and butterflies like flowers, you cannot have time flies liking arrows anymore since you have decided earlier that flies is a verb and like is an adverb, and there's no changing decisions in this game.

Throwing away information because you cannot come up with encoding scheme for it is not a good idea. If you want to throw away information because it's too slow to make informed decisions, try to measure the slowness to go along with it. None of the scenarios presented above require discarding good data in today's computers. Yet we are riddled with it. Oh well.