lauantai 3. toukokuuta 2014

Some EACL ideas

So I went to EACL in Gothenburg without having anything much to present there. One of the things that I usually do when I have to fly around the world with long layovers is update my compling projects. Maybe it’s inspired by Norvig’s spell-checker, who knows. This time I of course spent time on making apertium-fin-swe. The array of apertium-based Finnish translators is starting to shape up to be nice it is. This invokes the traditional problem that there’s no reuse of code in NLP, and there are no standards for whatever analyses are.

When translating Finnish to Swedish you see the same problems as with Finnish to English. Word-forms like talossa (in house, formed as talo+ssa) needs to translate the suffixed in before the noun; for Swedish i hus. That’s not a problem. But when you do that and get isossa vihreässä talossa (in big green house from iso+ssa vihreä+ssä talo+ssa) and use the same logic, oh, you get "in big in green in house" instead. Here’s the thing you need to do then, in Finnish: clump together words that have the same suffixes, such as ssa/ssä in Finnish. Linguists call this syntactic parsing or phrase detection or whatnot, we usually say chunking when we don't have good theory underneath. Chunks are also not the same in every application and every translation. But in this case my colleague implemented them for Finnish to Swedish parsing but I'd need to copy paste it to my Finnish to English. But pasting is evil! There must be a better way.

Another thing in all this language stuffs that was popular topic in EACL is the analyses that language systems produce, these things we are somehow supposed to use for some purpose even though no one has thought of the usage. The things that say that house is a noun (N or +N or [N] or /NNZ or ...). One of my favorite gripes in linguistics, both the fact that these "analyses" are encoded in ad hoc strings, and that they are not proper analyses in that they would form proper orthogonal separate classes from words and word-forms, they don’t describe atomic features of the words or parts of word they refer to. Even basic things like parts-of-speech describe some "generalisations" like "this word may be used before other words to describe it or it may be comparable or it may be something different but also may not be any of these" (adjective), that's not an useful piece of information. Useful piece of information is is if it is a comparable word. Or if it can be used before another word to describe it. Nevertheless, the prominent project for standardising stuffs here is https://code.google.com/p/universal-pos-tags/, it may not be perfect but maybe following it will help anyways. Another level of analyses is at https://code.google.com/p/uni-dep-tb/.

Thirdly, partially unrelatedly, I’ve been thinking of what we’re supposed to do at this point of university life, and I’d like to do something that’s actually useful and interesting and want to have my name attached to. So I’ve been gathering interest and feedback in to forming a workshop on Finnish (or Uralic) Computational Linguistics, and different ways to structure and upkeep continued uniformity and avoid overlapping work in language technology resources for these languages. I’ve also thought of forming a SIG in ACL, that would be a thing I would expect to have for a interest in scientific community anyways. If you’ve interest in Finnish or Uralic Computational Linguistics workshop where we might like publish all the resources and papers that didn't fit to e.g., LREC now and have some tutorials and discussions on crossing the resources for better use and less duplication, and some ways we need to get better gold standards for lot of things within this community since we have none, you should mail me or so. I will probably anyways go ahead soon-ish.

Ei kommentteja:

Lähetä kommentti