| Abstract: | Conditional probabilistic models for word
alignment are popular due to the elegant
way of handling them in the training stage.
However, they have weaknesses such as
garbage collection and scale poorly beyond
single word based models (DeNero
et al., 2006): not all parameters should actually
be used.
To alleviate the problem, in this paper we
explore regularity terms that penalize the
used parameters. They share the advantages
of the standard training in that iterative
schemes decompose over the sentence
pairs. We explore the models IBM-1 and
HMM, then generalize to models we term
Bi-word models, where each target word
can be aligned to up to two source words.
We give two optimization strategies for the
arising tasks, using EM and projected gradient
descent. While both are well-known,
to our knowledge they have never been
compared experimentally for the task of
word alignment. As a side-effect, we show
that, against common belief, for parametric
HMMs the M-step is not solved by renormalizing
expectations.
We demonstrate that the regularity terms
improve on the f-measures of the standard
HMMs and that they improve translation
quality. |