Grupo Extremeño de Enseñanza de Idiomas Asistido por Ordenador


Despite the amount of money invested in research and development, machine translation is still far from being a real option for businesses, even when you have a human reviewer behind the process (eg. King et al, 2003). In recent years, a new line of research on machine translation has been undertaken by a group of researchers in New York (the company “Meaningful Machines”), whose preliminary results look very promising. The research aim is to develop this idea in relation to the design of an automated translation system that has a real value (which means that the results offered are usable at the corporate level [see reports Alpaca, 1996]).

The originality of this method lies in the attention or focus placed on the context of the words in texts, their source and destination for the translation, so that the words are not translated as single units, but always within a context, co-text, or n-grams previously decoded. Thus, for example, correct morpho-syntactic formations such as gender (eg, the "casa roja") are produced not because the system may implement a rules-based approach, but because, in contrast, the system relies on previous n-gram training that follows statistical significance. In this case, the n-gram "the red house" is segmented and transformed into "rojo /a " and "casa" in the dictionary, and then moved on to be captured as "la casa roja" in a massive corpus of Spanish (statistically, this order is more significant, 99.8% more than "roja casa").

This method seeks to overcome obstacles in machine translation, often associated with discursive syntax and style. Thus, the translation would not be a mere transfer of words and structures from one language to another, but of meanings and usages, consistent / cohesive with the target language. Another example would be the passive voice in English (e.g., "the house was built), which could be correctly translated as "la casa fue construida" by many machine translation systems. However, the CBMT application would rely on a set of statistics for the body of results and yield a more widespread use of "casa" and “construir", by producing the best option "se construyó la casa", which would automatically result from the effective processing of the massive corpus.

This system would even anticipate that an expression cannot be correctly produced by checking that there are not any available statistically sound options in the corpus. As a result, in the statistical comparison, the system would return the signatures or contexts of the given expression and re-enact a different search so that synonyms may be yielded. For example, if "put off the meeting" cannot be transferred as “posponer la reunion” because the dictionary did not transfer “put off” with this meaning, or because the corpus did not show that option, the system would seek other contexts for “the meeting” or for other words that precede the phrasal verb. In this case, the tool would look for other options, like “aplazar”, or “llevar a”, etc, for the expression in the corpus.