keskiviikko 17. syyskuuta 2014

Translation machines


In machine translation there's usually ongoing some sort of a competition or disagreement between statistical and knowledge-based approaches. Statistics is based on a logic that if you have a lot of good quality material and you push it through the machine it creates even better quality material from the best bits and makes it more usable. Like a meat grinder. However that rarely happens so that the source material is actually as good quality as anyone would assume. The primary source of machine translations is europarl, a corpus of European parliament sessions transcribed and translated, and the that is trained on that can merely recall what it's seen there, statistical machine translation, you see, cannot make very good creative decisions, it can in fact only remake same decisions it has seen in translations fed to it. So giving such machine translation text that is not European parliament session it usually fails to make good connections. Much like this McDonalds' meat substitute here:

In knowledge-based, linguistic or rule-based machine translation there's an opposite problem. The people working on knowledge are interested mainly in looking at very small amount of data, the gold or the crown jewels of the language are what is interesting, neglecting the uninteresting words and stuff that makes up 95 % of good machine translation. Further problem is that their jewels aren't truly precious but fakes, that is, the interestingness is based on old misclassifications that they hold on to, to make problems more interesting, such as fake ambiguity made by wrong classifications. So they end up working like these guys from South Park.