Keywords: automatic evaluation, quality of translation, machine translation, BLEU, F-measure, TER.
he idea of machine translation (MT) of natural languages first appeared in the seventeenth century, but became a reality only at the end of the twentieth century. Today, computer programs are widely used to automate the translation process. Although great progress has been made in the field of machine translation, fully automated translations are far from being perfect. Nevertheless, countries continue spending millions of dollars on various automatic translation programs. In the early 1990s, the U.S. government sponsored a competition among MT systems. Perhaps, one of the valuable outcomes of that enterprise was a corpus of manually produced numerical evaluations of MT quality, with respect to a set of reference translations . The development of MT systems has given impetus to a large number of investigations, thereby encouraging many researchers to seek for reliable methods for automatic MT quality evaluation.
Machine translation evaluation serves two purposes: the relative estimate allows one to find out whether one MT system is better than the other, and the absolute estimate (having a value ranging from 0 to 1) gives an absolute measure of efficiency (for example, when equal to unity, it means perfect translation).
Although great progress has been made in the field of machine translation, fully automated translations are far from being perfect.
Traditionally, the bases for evaluating MT quality are adequacy (the translation conveys the same meaning as the original text) and fluency (the translation is correct from the grammatical point of view). Most modern methods of MT quality assessment rely on reference translations. Earlier approaches to scoring a ‘candidate’ text with respect to a reference text were based on the idea of similarity of a candidate text (the text translated by an MT system) and a reference text (the text translated by a professional translator), i.e., the similarity score was to be proportional to the number of matching words . At about the same time, a different idea was put forward. It was based on fact that matching words in the right order in the candidate and reference sentences should have higher scores than matching words out of order .
Perhaps the simplest version of the same idea is that a candidate text should be rewarded for containing longer contiguous subsequences of matching words. Papineni et al. reported that a particular version of this idea, which they call ‘BLEU,’ correlates very highly with human judgments. Doddington  proposed another version of this idea, now commonly known as the ‘NIST’ score. Although the BLEU and NIST measures might be useful for comparing the relative quality of different MT outputs, it is difficult to gain insight from such measures .
In this paper we consider different methods of MT quality assessment and analyze the translations of candidate and reference texts. In the following sections, we describe several automatic MT evaluation methods: some of them are based on string matching, others, such as n-gram models, are based on the use of information retrieval. Next, we will assess the quality of translation by using an automatic program.
2. Methods of automatic MT quality evaluation
To date, the main approach to the quality assessment of language models for MT systems relies on the use of statistical methods. In this case, the model is, in fact, a probability distribution on a set of all sentences of a language. Naturally, it is impossible to employ the model in this way; therefore, use is made of more compact algorithms. Let us briefly consider what models are currently used in commercial and experimental systems of MT quality assessment with unlimited dictionaries.
2.1 Method of approximate string matching
In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is a technique of finding strings that match a pattern approximately (rather than exactly). The problem of finding approximate string matching is typically divided into two sub-problems: finding an approximate substring inside a given string and finding dictionary strings that match the pattern approximately .
The word error rate (WER) is a metric based on this approach. The WER is calculated as the sum of insertions, deletions, and substitutions, normalized by the length of the reference sentence. If the WER is equal to zero, the translation is identical to the reference text. The main problem lies in the fact that the resulting estimate is not always in the range from 0 to 1. In some cases, when the translation is wrong, the WER can be greater than 1.
Another version of the WER is the WERg metric, in which the sum of insertions, deletions and substitutions is normalized by the Levenshtein distance, i.e., the length of the edits. In information theory and computational linguistics, the Levenshtein distance (editorial distance, or edit distance) between two strings is defined as the minimum number of edits needed to transform one string into the other, with allowable edit operations being insertion, deletion, or substitution of a single character . The advantage of this metric is that the value of the translation quality will always be in the range from 0 to 1 (even in the worst case of coincidence, or in the absence of translation, the value will not exceed unity).
Experiments performed by Blattsom et al. have shown that the WERg metric is not reliable and does not agree with the estimates obtained when the machine translation is analyzed by humans .
The position-independent error rate (PER) neglects the order of the words in the string matching operation. In this case, the difference between the candidate text and the reference text, normalized by the length of the reference translation, is calculated .
Another metric that is widely used in assessing the translation quality is the translation error rate (TER). This metric makes it possible to measure the number of edits required to change a system output into one of the given reference translations .
In fact, any string matching metric can be used for assessing the MT quality. One such example is the “string kernel,” which allows one to take into account different levels of natural language (e.g., morphological, lexical, etc.), or the relationship between synonyms .
2.2 N-gram models
In n-gram language models, use is made of an explicit assumption that the probability of the next word in a sentence depends on the previous n-1 words. In practice, the models with n = 1, 2, 3 and 4 are used. For the English language, the most successful are three-gram or four-gram models. Today, almost all systems of MT quality assessment rely on n-gram models. In this case, the probability of the whole sentence is calculated as the product of the probabilities of its constituent n-grams.
The main advantages of n-gram models are their relative simplicity and the possibility of constructing a model that can be trained on a sufficiently large corpus of a language. However, such models are not devoid of drawbacks. The n-gram models make it impossible to simulate semantic and pragmatic relationships in a language. In fact, if a dictionary contains N words, the number of possible pairs of words will be N2 . Even if only 0.1% of them actually occur in the language, the minimum volume of the language corpus, necessary to obtain statistically valid estimates, will amount to 125 billion words or about 1 terabyte. For three-gram models, the minimum corpus will reach hundreds of thousands of terabytes .
To overcome the drawbacks, use is made of well-developed smoothing techniques, which enables the assessment of the model parameters under the conditions of insufficient or non-existent data.
The main metrics based on n-grams are BLEU, NIST, F-measure, and METEOR.
BLEU (Bilingual Evaluation Understudy) is an algorithm for automatic evaluation of the quality of a machine translation, which is compared to the reference translation, using n-grams. This metric of MT quality assessment was first proposed and implemented by Papineni et al. .
Measuring translation quality is a challenging task, primarily due to the lack of definition of an ‘absolutely correct’ translation. The most common technique of translation quality assessment is to compare the output of automated and human translations of the same document. But this is not as simple as may seem: One translator's translation may differ from that of another translator. This inconsistency between different reference translations presents a serious problem, especially when different reference translations are used to assess the quality of automated translation solutions.
A document translated by specially designed automated software can have a 60% match with the translation done by one translator and a 40% match with that of another translator. Although both professional translations are technically correct (they are grammatically correct, they convey the same meaning, etc.), 60% overlap of words is a sign of higher MT quality. Thus, although reference translations are used for comparison, they cannot be a completely objective and consistent measurement of the MT quality.
The BLEU metric scores the MT quality on a scale from 0 to 1. The closer the score to unity, the greater is the overlap with the reference translation and, therefore, the better the MT system. To cut the long story short, the BLEU metric measures how many words coincide in the same line, with the best score given not to matching words but to word sequences. For example, a string of four words in the translation that matches the human reference translation (in the same order) will have a positive impact on the BLEU score and is weighted more heavily (and scored higher) than a one- or two-word match .
The NIST (National Institute of Standards and Technology) precision measure is a metric used to evaluate the MT variants . NIST was intended as an improved version of BLEU. In this case, the arithmetic mean of n-grams is calculated. An important difference from the BLEU metric is the fact that NIST also relies on the frequency component (precision and recall). If BLEU simply calculates the n-gram precision by adding an equal weight for each exact match, NIST also calculates how informative each matching n-gram is.
For example, even if the bigram ‘on the’ coincides with the same phrase in the reference text, the translation still receives a lower score than the correct matching of the bigram ‘size distribution,’ because the latter phrase is less likely to occur.
The F-measure is a metric which calculates the harmonic mean of precision and recall . The metric is based on the search for the best match between the candidate and reference translations (the ratio between the total number of matching words to the length of the translation and the reference text). Sometimes it is useful to combine the precision and recall of the same averaged value .
The metric for evaluation of translation with explicit ordering (METEOR) is an improved version of the F-measure . This system was designed to address some of the weaknesses in the BLEU metric. The METEOR scores the output by matching the automated and reference translations word-for-word. When more than one reference translation is available, the automated translation is compared with each of them and the best result is reported .
One can have different attitudes to the different metrics, but at this point BLEU, METEOR and NIST are most widely used. It is these metrics that are compared with all the other MT quality assessment systems. The developers of the F-measure claim that their metric shows the best agreement with the assessment made by a human . However this is not always the case. The F-measure does not work well with the smallest average edit distance . Empirical data show that more attention should be paid to the completeness (recall) of the translation. Studies suggest that the recall is most often the parameter, which allows one to determine the quality of translation .
3. Automatic evaluation of the quality of statistic (Google) and rule-based (Prompt) MT systems
Translation is an intellectual challenge, and, therefore, skepticism about the possibility of using a computer for automated translation is quite natural. However, the creators of MT systems have managed to endow their systems with a form of understanding, and machine translation now belongs to a class of artificial intelligence programs.
Currently, we can speak of two approaches to written translation: the first one is machine translation based on the rules of the source and target languages and the second approach involves statistical machine translation.
The earliest “translation engines” in machine-based translations were all based on the direct, so-called “transformer,” approach. Input sentences of the source language were transformed directly into output sentences of the target language, using a simple form of parsing. The parser did a rough analysis of the source sentence, dividing it into subject, object, predicate, etc. Source words were then replaced by target words selected from a dictionary, and their order rearranged so as to comply with the rules of the target language. This approach was used for a long time, only to be finally replaced by a less direct approach, which is called “linguistic knowledge.” Modern computers, which have more processing power and more memory, can do what was impossible in the 1960s. Linguistic-knowledge translators have two sets of grammar rules: one for the source language, and the other for the target language. In addition, modern computers analyze not only grammar (morphological and syntactic structure) of the source language but also the semantic information. They also have information about the idiomatic differences between the languages, which prevents them from making silly mistakes. The representative of rule-base approach to machine translation is the Prompt software developed by the leading Russian developer of linguistic IT solutions.
The second approach is based on a statistical method: by analyzing a large number of parallel texts (identical texts in the source and target languages), the program selects the variants that coincide most often and uses them in the translation. It does not apply grammatical rules, since its algorithms are based on statistical analysis rather than traditional rule-based analysis. In addition, the lexical units here are word combinations, rather than separate words. One of the well-known examples of this approach is “Google Translate,” which is based on an approach called statistical machine translation. However, the translated sentences are sometimes so discordant that it is impossible to understand them .
In this section using concrete examples we will compare the quality of translations made by such MT systems as Google ( http://translate.google.ru/) and Prompt (www.translate.ru).
For the analysis, we selected five titles, abstracts, and keywords from the ‘Kvantovaya Elektronika’ journal , which is first published in Russian and then translated into English by a group of professional translators.
Эволюция функции распределения наночастиц Au в жидкости под действием лазерного излучения
Аннотация . Теоретически и экспериментально исследован процесс фрагментации наночастиц в жидкости под действием импульсного лазерного нагрева. Моделирование процесса проведено на основе решения кинетического уравнения для функции распределения наночастиц по размерам с учетом температурной зависимости теплофизических параметров среды. Показано, что фрагментация происходит через отделение от расплавленной наночастицы фрагментов меньшего размера. Результаты моделирования находятся в хорошем согласии с экспериментальными данными, полученными при фрагментации наночастиц золота в воде под действием излучения лазера на парах меди при пиковой интенсивности излучения в среде ~106 Вт/см2.
Ключевые слова : наночастицы, коллоидные растворы, лазерная абляция металлов, плазмонный резонанс, фрагментация.
Взаимодействие неколлинеарных фемтосекундных лазерных филаментов в сапфире
Аннотация . Численно и экспериментально исследовано взаимодействие двух когерентных фемтосекундных лазерных импульсов, распространяющихся под малым углом друг к другу в кристалле сапфира в режиме филаментации. Получены распре деления поверхностной плотности энергии и концентрации свободных электронов в образующихся лазерно-плазменных каналах. Обнаружено образование дополнительных филаментов вне плоскости первоначального распространения импульсов.
Ключевые слова : филаментация, фемтосекундное излучение, лазерная плазма, взаимодействие филаментов.
Влияние электрического поля на приповерхностные процессы при лазерной обработке металлов
Аннотация . Показано, что при изменении напряженности внешнего электрического поля различной полярности от 0 до 106 В/м в ходе воздействии лазерного излучения со среднй плотностью потока ~106 Вт/см2 на поверхности ряда металлов (Cu, Al, Sn, Pb) изменение особенностей эволюции плазменного факела на ранних стадиях носит количественный, а не качественный характер. В то же время характерные размеры капель вещества мишени, вынесенных из облученной зоны, существенно (в несколько раз) уменьшаются при увеличении амплитуды напряженности внешнего электрического поля независимо от его полярности.
Ключевые слова : лазерное излучение, электрическое поле, плазмообразование, гравитационно-капиллярные волны.
Об ассоциациях невзаимодействующих частиц (кристаллоподобные нейтронные структуры)
Аннотация . Обсуждается физическая реализуемость ассоциаций невзаимодействующих друг с другом частиц, возникающая в соответствии с соотношением неопределенности при 'корпоративном' пространственном ограничении ансамбля частиц в целом. Рассмотрение проводится на примере ансамбля ультрахолодных нейтронов, помещенных в общую потенциальную яму бесконечной глубины. Представлены количественные оценки и указаны ожидаемые свойства образующихся кристаллоподобных пространственно-периодических структур.
Ключевые слова : квантовая нуклеоника, ультрахолодные нейтроны, лазерные способы производства ультрахолодных нейтронов, нейтронные ассоциации, нейтроны в потенциальной яме бесконечной глубины.
Эллиптически поляризованные кноидальные волны в среде с пространственной дисперсией кубической нелинейности
Аннотация . Найдены новые частные аналитические решения системы нелинейных уравнений Шредингера, соответствующие эллиптически поляризованным кноидальным волнам в изотропной гиротропной среде с пространственной дисперсией кубической нелинейности и частотной дисперсией второго порядка при выполнении условий формирования волноводов единого профиля для каждой из циркулярно поляризованных компонент светового поля.
Ключевые слова : кубическая нелинейность, пространственная дисперсия, нелинейные уравнения Шредингера, эллиптическая поляризация, кноидальные волны.
The corresponding translations were taken from http://iopscience.iop.org/1063-7818/42/2 .
For an automatic analysis, we used the relevant software that is publicly available from http://www.languagestudio.com/LanguageStudioDesktop.aspx#Pro.
Language StudioTM Lite is a free tool that provides key metrics for translation quality. This tool can be used to measure not only the quality, but also the improvements in quality because custom translation engines are constantly being updated via the quality improvement feedback cycle. Language StudioTM Lite currently supports such metrics as BLEU, F-Measure, and TER.
From the point of view of syntax, the abstracts presented for the analysis are characterized mainly by simple sentences, i.e., smth is presented or smth is investigated. Besides, most frequently used are compound sentences with an object clause, for example, it is shown that ... or it is found that … . As to the vocabulary, translators most often use one-word termswaveguide, two-word termslight wave, uncertainty relation, and three-word termstarget material droplets, whereas four-word termscrystal-like spatially periodic structureare extremely rare.
For the program to correctly score the translations, we preliminary processed the reference translations and candidate translations made by Google and PROMPT. Each sentence started a new paragraph, and the texts were converted into .txt format.
Initially, we compared the reference translation and the outputs from Google and PROMPT, using n-gram metrics. The results of the translation evaluation summary are presented below.
1. Migration Policy Institute. [Accessed April 16, 2013];Limited English proficient individuals in the United States: linguistic diversity at the county and state Level. http://www.migrationinformation.org/integration/languageportal. Published February 27, 2013.
2. US Department of Justice. [Accessed April, 29, 2013];Limited English proficiency—LEP—thinking outside the box. http://www.lep.gov/resources/recipbroch.html.
3. Civil Rights Act of 1964, Pub L. No. 88–352, 78 Stat 241. [Accessed March 3, 2013];1964 http://www.archives.gov/education/lessons/civil-rights-act.
4. [Accessed April 14, 2013];Executive Order 13166, President Clinton. 2000 http://www.lep.gov/13166/eo13166.html.
5. Goel M, Wee C, McCarthy E, Davis R, Ngo-Metzger Q, Phillips R. Racial and ethnic disparities in cancer screening the importance of foreign birth as a barrier to care. J Gen Intern Med. 2003;18(12):1028–1035.[PMC free article][PubMed]
6. Tinoco L Joint Commission Resources, Inc. Providing Culturally and Linguistically Competent Health Care. Oakbrook Terrace, IL: Joint Commission Resources; 2006.
7. Jacobs EA, Shepard DS, Suaya JA, Stone EL. Overcoming language barriers in health care: costs and benefits of interpreter services. Am J Public Health. 2004;94(5):866–869.[PMC free article][PubMed]
8. Willson A. Fundamental causes’ of health disparities. Int Sociol. 2009;24(1):93–113.
9. Wang J, Miller NA, Hufstader M, Bian Y. The health status of Asian Americans and Pacific Islanders and their access to health services. Soc Work Public Health. 2008;23(1):15–43.
10. DeCamp LR, Choi H, Davis MM. Medical home disparities for Latino children by parental language of interview. J Health Care Poor Underserved. 2011;22(4):1151–1166.[PMC free article][PubMed]
11. Saechao F, Sharrock S, Reicherter D, Livingston JD, et al. Stressors and barriers to using mental health services among diverse groups of first-generation immigrants to the United States. Community Ment Health J. 2012;48(1):98–106.[PubMed]
12. Bircher H. Prenatal care disparities and the migrant farm worker community. MCN Am J Matern Child Nurs. 2009;34(5):303–307.[PubMed]
13. Lora CM, Daviglus ML, Kusek JW, et al. Chronic kidney disease in United States Hispanics: a growing public health problem. Ethn Dis. 2009;19(4):466–472.[PMC free article][PubMed]
14. Lopez-Quintero C, Berry EM, Neumark Y. Limited English proficiency is a barrier to receipt of advice about physical activity and diet among Hispanics with chronic diseases in the United States. J Am Diet Assoc. 2009;109(10):1769–1774.[PubMed]
15. Yoo HC, Gee GC, Takeuchi D. Discrimination and health among Asian American immigrants: disentangling racial from language discrimination. Soc Sci Med. 2009;68(4):726–732.[PMC free article][PubMed]
16. Glover S, Bellinger JD, Bae S, Rivers PA, Singh KP. Perceived health status and utilization of specialty care: racial and ethnic disparities in patients with chronic diseases. Health Educ J. 2010;69(1):95–106.
17. Chen J, Rizzo J, Rodriguez H. The health effects of cost-related treatment delays. Am J Med Qual. 2011;26(4):261–271.[PubMed]
18. Shin HB, Kominski RA US Department of Commerce. Language Use in the United States: 2007. Washington, DC: US Census Bureau; 2010. American Community Survey Reports ACS-12.
19. Centers for Disease Control and Prevention, Office for State, Tribal, Local and Territorial Support. [Accessed March 14, 2013];10 Essential Public Health Services. http://www.cdc.gov/nphpsp/essentialservices.html. Published December 9, 2010.
20. Depraetere I. Perspectives on Translation Quality. Berlin, Germany: De Gruyter Mouton; 2011.
21. Garcia I. Is machine translation ready yet? Target. 2010;22(1):7–21.
22. Wu C, Xia F, Deleger L, Solti I. Statistical machine translation for biomedical text: are we there yet? AMIA Annu Symp Proc. 2011;2011:1290–1299.[PMC free article][PubMed]
23. Cribb VM. Machine translation: the alternative for the 21st century? Tesol Q. 2000;34(3):560–568.
24. Green S, Heer J, Manning C. The efficacy of human post-editing for language translation. Paper presented at: Association of Computer Machinery’s CHI 2013 Conference; April 27, 2013; Paris, France.
25. Aziz W, Sousa SCM, Specia L. PET: a tool for post-editing and assessing machine translation. Paper presented at: Proceedings of LREC’12: Eighth International Conference on Language Resources and Evaluation; May 2012; Istanbul, Turkey.
26. Rosemblat G, Gemoets D, Browne AC, Tse T. Machine translation-supported cross-language information retrieval for a consumer health resource. AMIA Annu Symp Proc. 2003;2003:564–568.[PMC free article][PubMed]
27. Kirchhoff K, Axelrod A, Turner AM, Saavedra F. Application of statistical machine translation to public health information: a feasibility study. J Am Med Inform Assoc. 2011;18(4):473–478.[PMC free article][PubMed]
28. Brownstein M, Capurro D, Cole K, Karaz H, Turner AM. Improving access to translated materials through machine translation. Paper presented at: Joint Conference on Health; October 2011; Vancouver, WA.
29. Muller C. Machine translation post-editing holds the key to MT success. CSOFT; [Accessed April 13, 2013]. http://www.csoftintl.com/Machine-Translation-Post-Editing-Holds-the-Key-to-MT-Success.pdf. Published 2010.
30. Nirenburg S, Somers HL, Wilks Y. Readings in Machine Translation. Cambridge, MA: MIT Press; 2003.
31. St Amant K. Linguistic and Cultural Online Communication Issues in the Global Age. Hershey, PA: Information Science Reference; 2007.
32. Anastasiou D, Gupta R. Comparison of crowdsourcing translation with machine translation. J Inform Sci. 2011;37(6):637–659.
33. Gany FM, Shah SM, Changrani J. New York City’s immigrant minorities. Reducing cancer health disparities. Cancer. 2006;107(suppl 8):2071–2081.[PubMed]