keskiviikko 11. kesäkuuta 2014

Some LREC 2014 ideas

In this continuation of endless stream of conferences I seem to be having this year, LREC is second to last before the autumn. And LREC is one of the biggest in our field, so even though most of main conference stuff is pretty basic, it's the only place to distribute language resources and stuff that's basic engineering and data harvesting, yeah, lot of collaborations and social networking goes on. I don't have much interesting insights of the conference content, we presented some basic infra work and that simple lexc optimisation hack I thought of years back. We saw few resources, Hungarian data is still not available for most part, Lakota has lexicon of and twol rules and all of 50 words or so. Trmorph has been developed further.



The main course of the conference was probably just us sitting writing future proposals, starting new work shops and stuffs. You'll hear about em some time soon. But the most important part was few days of discussions about rather under-researched issue of tokenisation (preprocessing in general). You see, tokenisation is the one part of nearly all processes of of computational linguistics. In fact that's been the one un-reproducible part that Ted Pedersen (I think) has mentioned in one of his essays about to being able to reproduce results. People in statistical language engineering have solved this problem neatly by using their unsupervised trained models for everything. The randomly reproducible standard is whatever moses scripts do: tokenizers and truecasing, probably there's sentence boundary stuffs there too. They are the "state of the art" that don't get used outside statistical machine translation though. And it's horrible guess works for things that are easy. We can tell full stop of abbreviations from sentence ending marker apart, as we can recognise sentence initial upper case from proper noun upper cases. We have the data to know it. We also know which spaces we can kind of include inside some words as if they weren't there, that's yet another list of things. But mashing that all together in orderly fashion, seems to hard. So we keep on guessing.

Most of the problems I suppose come from the fact that we always get stuck wanting to handle the rare odd exceptions neatly, so assume that words "look up" will form one unit in English meaning e.g., checking a word from dictionary, but in sentence that'll almost never appear like "look up and not down", it's actually two words, and we'd really want it to work too but since all our processing is based on lossy pipeline of things that make irreversible decisions it's hopeless. Actually the only sane solution lies in not making lossy decisions, just saying that it's there but unlikely (sounds scary probabilistics yikes), but somehow this is bad for many computational linguists, like, "avoiding of making hard decisions", no it's not, it's just encoding what we really know a bit more precisely, since most of our knowledge is not of form, "this is impossible interpretation, no exceptions", but rather, "this is unlikely, probably so unlikely that some speakers of the language will not understand it, maybe without explaining", but it's the kind of information I'd rather encode, in the intelligent systems we do. So there.

Didn't get to eat whale or puffin this time, but beers were good and locals were fun. I saw touristy things, like water in various forms moving in various directions, that's pretty exciting. I heard that there's a festival in Iceland: http://www.eistnaflug.is/. The name must mean something funny.

Ei kommentteja:

Lähetä kommentti