Salmon snake adventures continue in free and hansa town as Hamburger: tokenisation

There's probably no other field of science where making mistakes and doing stupid things can be so profitable as computational linguistics. Bad tagging in your gold corpora? "It is an interesting research topic on how to make use of this data", don't fix it, just use it. Disambiguation always throws away forms you need? Research question! Pre-processing butchers your data beyond repair? You know what to do, call it future research and publish rubbish results, no matter that you did the pre-processing yourself and could make it more sensible easily.

Let's start with what machine translation is about right now. Before one can translate texts it has to be mangled at least through two atrocious processors, just because, and we call it state of the art too: truecasing and tokenising. You know, instead of having access to the text when translating, you'll deal with text where random letters have been lowercased and uppercased, and yes, you don't have access to knowledge which ones, that has already been lost. But wait, there's more, we have also added spaces to quite random spots in the text without telling you. That's for the statistical machine translation, then after the remains of the string has been translated, moved around, removed and so on, you get to guess where the uppercases should be and maybe if punctuation is still at right position, and what spaces to remove. Or you could rule-based tokenise stuff, maybe ignore whole lot of spaces treating them as not important, who knows.

This like all problems is just caused by the fact that most systems want to transport data in text files or pipes, so we either have to use ad hoc ascii symbol mess to mark up all this casing operations and splits, or discard stuff. What one would really want though, is to have data in a sensible data structure that retains the information: sensible tokenisation and casing of "This string with Name and e.g., full-stops in it." is not "this string with name and e . g . full - stops in it .", it's python structure like [(This:this), ' ', string, ' ', with, ' ', Name, ' ', and, 'e.g.', ',', ' ', full-stops, ' ', in, ' ', it, .], spaces retained where supposed to be between tokens and not where there aren't, original casing retained when mangled. Should be simple but isn't.

Disambiguation is just another pet peeve of mine: if time flies like arrows and butterflies like flowers, you cannot have time flies liking arrows anymore since you have decided earlier that flies is a verb and like is an adverb, and there's no changing decisions in this game.

Throwing away information because you cannot come up with encoding scheme for it is not a good idea. If you want to throw away information because it's too slow to make informed decisions, try to measure the slowness to go along with it. None of the scenarios presented above require discarding good data in today's computers. Yet we are riddled with it. Oh well.

In this continuation of endless stream of conferences I seem to be having this year, LREC is second to last before the autumn. And LREC is one of the biggest in our field, so even though most of main conference stuff is pretty basic, it's the only place to distribute language resources and stuff that's basic engineering and data harvesting, yeah, lot of collaborations and social networking goes on. I don't have much interesting insights of the conference content, we presented some basic infra work and that simple lexc optimisation hack I thought of years back. We saw few resources, Hungarian data is still not available for most part, Lakota has lexicon of and twol rules and all of 50 words or so. Trmorph has been developed further.

Salmon snake adventures continue in free and hansa town as Hamburger

perjantai 12. joulukuuta 2014

Making mistakes as interesting research question in computational linguistics

keskiviikko 11. kesäkuuta 2014

Some LREC 2014 ideas