Sentence splitting in Arabic to Spanish translation

  1. Juan Roldán 1
  2. Manuel Feria García 1
  1. 1 Universidad de Granada
    info

    Universidad de Granada

    Granada, España

    ROR https://ror.org/04njjy449

Revista:
Revista española de lingüística aplicada

ISSN: 0213-2028

Año de publicación: 2023

Volumen: 36

Número: 2

Páginas: 585-614

Tipo: Artículo

DOI: 10.1075/RESLA.21008.ROL DIALNET GOOGLE SCHOLAR

Otras publicaciones en: Revista española de lingüística aplicada

Resumen

Modern Standard Arabic makes extensive use of coordination particles whereas punctuation marks are scarce and erratic, leading to long clauses. This is generally assumed to hinder Sentence Boundary Detection and to promote sentence splitting when translating from Arabic into English. Previous literature on translation from Arabic to Spanish is practically inexistent. We have tested this hypothesis regarding translation from Arabic to Spanish on a sample of 282,714 graphic words extracted from a bilingual corpus of 8,681,110 graphic words and found that each Arabic sentence yielded an average of 1.5 Spanish sentences. Furthermore, our data shows the potential impact of directionality in that sentence splitting when translating from Arabic into Spanish is 50% more frequent than from English into Arabic. We also determined statistically that five elements (wa [و], ḥaythu [حيث], kamā [كما], wa-qad [وقد], and wa-dhalika [وذلك]) are the most salient potential markers for sentence splitting in the resulting Spanish translations. Our findings should be particularly interesting for Computational Linguistics and translator training.

Referencias bibliográficas

  • Abdul-Raof, H. (1998) Subject, theme and agent in modern standard Arabic. Curzon Press.
  • Ahrenberg, L. (2017) Comparing machine translation and human translation: A case study. InI. Temnikova, C. Orasan, G. Corpas, & S. Vogel (Eds.), Proceedings of the First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT) (pp.21–28). Association for Computational Linguistics. Retrieved fromhttps://www.acl-bg.org/proceedings/2017/RANLP_W3%202017/pdf/HiT-IT003.pdf. 10.26615/978‑954‑452‑042‑7_003 https://doi.org/10.26615/978-954-452-042-7_003
  • Alazzawie, A. (2014) The discourse marker wa in standard Arabic – A syntactic and semantic analysis. Theory and Practice in Language Studies, 4(10), 2008–2015. 10.4304/tpls.4.10.2008‑2015 https://doi.org/10.4304/tpls.4.10.2008-2015
  • Alfuraih, R. (2020) The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics. Language Resources & Evaluation, 541, 801–830. 10.1007/s10579‑019‑09472‑6 https://doi.org/10.1007/s10579-019-09472-6
  • Alghamdi, M., & Teahan, W. (2017) Experimental evaluation of Arabic OCR systems. PSU Research Review, 1(3), 229–241. 10.1108/PRR‑05‑2017‑0026 https://doi.org/10.1108/PRR-05-2017-0026
  • Al-Harthi, M., & Alsaif, A. (2019) The design of the SauLTC application for the English-Arabic learner translation corpus. InM. El-Haj, P. Rayson, E. Atwell, & L. Alsudias (eds.), Proceedings of the 3rd Workshop on Arabic Corpus Linguistics (pp.80–88). Association for Computational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/W19-5610.pdf
  • Al-Khuli, M. (1998) Al-tārakīb al-shāʾiʿa fi l-lugha al-ʿarabiyya. Dirāsa iḥṣāʾiyya [Most common structures in Arabic language. A statistical study]. Dār Al-Falāḥ.
  • Alotaiby, F., Foda, S., & Alkharashi, I. (2010) Clitics in Arabic language: A statistical study. Proceedings of Pacific Asia Conference on Language, Information and Computation (PACLIC), 241, 595–602.
  • Al-Raisi, F., Lin, W., & Bourai, A. (2018) A monolingual parallel corpus of Arabic. Procedia Computer Science, 1421, 334–338. 10.1016/j.procs.2018.10.487 https://doi.org/10.1016/j.procs.2018.10.487
  • Altammami, S., Atwell, E., & Alsalka, A. (2019) Text segmentation using N-grams to annotate Hadith corpus. InM. El-Haj, P. Rayson, E. Atwell, & L. Alsudias (eds.), Proceedings of the 3rd Workshop on Arabic Corpus Linguistics (pp.31–39). Association for Computational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/W19-5605.pdf
  • Awad, D. (2015) The evolution of Arabic writing due to European influence: The case of punctuation. Journal of Arabic and Islamic Studies, 151, 117–136. 10.5617/jais.4650 https://doi.org/10.5617/jais.4650
  • Baker, M. (1993) Corpus linguistics and translation studies: Implications and applications. InM. Baker, G. Francis, & E. Tognini-Bonelli (eds.), Text and technology: In honour of John Sinclair (pp.233–250). John Benjamins. 10.1075/z.64.15bak https://doi.org/10.1075/z.64.15bak
  • Bisiada, M. (2013) From hypotaxis to parataxis: An investigation of English–German syntactic convergence in translation [Doctoral dissertation]. Retrieved fromhttps://www.research.manchester.ac.uk/portal/files/54546816/FULL_TEXT.PDF
  • (2016) Lösen Sie Schachtelsätze möglichst auf: The impact of editorial guidelines on sentence splitting in German business article translations. Applied Linguistics, 37(3), 354–376. 10.1093/applin/amu035 https://doi.org/10.1093/applin/amu035
  • Bloch, I. (2005) Sentence splitting as an expression of translationese: Seminar paper. InBlack Box Seminar, Bar Ilan University. Retrieved fromhttps://www.biu.ac.il/hu/stud-pub/tr/tr-pub/bloch-split.htm
  • Buckwalter, T., & Parkinson, D. (2011) A frequency dictionary of Arabic: core vocabulary for learners. Routledge.
  • Chen, Y., & Eisele, A. (2012) MultiUN v2: UN documents with multilingual alignments. InN. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidiset (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp.2500–2504). European Language Resources Association (ELRA). Retrieved fromwww.lrec-conf.org/proceedings/lrec2012/pdf/641_Paper.pdf
  • [Google Scholar] Choueka, Y., Conley, E., & Dagan, I. (2000) A comprehensive bilingual word alignment system. Application to disparate languages: Hebrew and English. InJ. Véronis (ed.), Parallel text processing. alignment and use of translation corpora (pp.69–96). Kluwer Academic Publishers. 10.1007/978‑94‑017‑2535‑4_4 https://doi.org/10.1007/978-94-017-2535-4_4
  • Darwish, K., & Gao, W. (2014) Simple effective microblog named entity recognition: Arabic as an example. InN. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidiset (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp.2513–2517). European Languages Resources Association (ELRA). Retrieved fromwww.lrec-conf.org/proceedings/lrec2014/pdf/186_Paper.pdf
  • Dickins, J., Sándor, H., & Higgins, I. (2017) Thinking Arabic translation. a course in translation method: Arabic to English. Routledge.
  • Eisele, A., & Chen, Y. (2010) MultiUnited nations: A multilingual corpus from United Nation documents. InN. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (eds.), Proceedings of the Seventh conference on International Language Resources and Evaluation (pp.2868–2872). European Language Resources Association (ELRA). Retrieved fromwww.lrec-conf.org/proceedings/lrec2010/pdf/686_Paper.pdf
  • Fabricius-Hansen, C. (1999) Information packaging and translation: Aspects of translational sentence splitting (German-English/Norwegian). InM. Doherty (ed.), Sprachspezifissche Aspekte der Informationsverteilung (pp.175–214). Akademie Verlag. 10.1515/9783050078137‑008 https://doi.org/10.1515/9783050078137-008
  • Farghaly, A., & Shaalan, K. (2009) Arabic natural language processing: Challenges and solutions. ACM TraSActions on Asian Language Information Processing (TALIP), 8(4), 1–22. 10.1145/1644879.1644881 https://doi.org/10.1145/1644879.1644881
  • Feria, M. (2014) Planning the acquisition and enhancement of language skills for translation and interpreting trainees: the case of Arabic. InV. Aguilar, W. Saleh, M. A. Manzano, L. M. Pérez Cañada, & P. Santillán Grimm (eds.), Arabele 2012: enseñanza y aprendizaje de la lengua árabe (pp.197–221). Universidad de Murcia.
  • Frankenberg-Garcia, A. (2019) A corpus study of splitting and joining sentences in translation. Corpora, 14(1), 1–30. 10.3366/cor.2019.0159 https://doi.org/10.3366/cor.2019.0159
  • Gale, W., & Kenneth, C. (1993) A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
  • García Barrero, D., Feria García, M., & Turell, M. (2012) Using function words and punctuation marks in Arabic forensic authorship attribution. InR. Sousa-Silva, R. Faria, N. Gavaldà, & B. Maia (eds.), Proceedings of the 3rd European Conference of the International Association of Forensic Linguists (pp.42–56). Universidade de Porto.
  • Ghaly, H. (2014) Canvas: A fast and accurate geometric sentence alignment system using lexical cues within complex misalignment settings. CUNY Academic Works.
  • Habash, N. (2010) Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187. 10.1007/978‑3‑031‑02139‑8 https://doi.org/10.1007/978-3-031-02139-8
  • Halliday, M. & Hasan, R. (1976) Cohesion in English. London: Longman.
  • Hareide, L., & Hofland, K. (2012) Compiling a Norwegian-Spanish parallel corpus. Methods and challenges. InM. Oakes, & J. Meng (eds.), Quantitative methods in corpus-based translation studies (pp.75–114). John Benjamins. 10.1075/scl.51.04har https://doi.org/10.1075/scl.51.04har
  • Heine, B., & Kuteva, T. (2002) World lexicon of grammaticalization. Cambridge University Press. 10.1017/CBO9780511613463 https://doi.org/10.1017/CBO9780511613463
  • Keskes, I. (2015) Discourse analysis of Arabic documents and application to automatic summarization (Doctoral dissertation). Retrieved fromhttps://core.ac.uk/download/pdf/42969051.pdf Kunilovskaya, M., & Morgoun, N. (2013) Gains and pitfalls of sentence-splitting in translation. Perm National Research Polytechnic University Herald. Issues in Linguistics and Pedagogy, 8(50), 152–166.
  • Merkel, M. (2001) Comparing source and target texts in a translation corpus. InA. S. Hein (ed.), Proceedings of the 13th Nordic Conference of Computational Linguistics, NODALIDA (pp.81–85). Association for Computational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/W01-1716.pdf
  • Neme, A., & Paumier, S. (2020) Restoring Arabic vowels through omission-tolerant dictionary lookup. Language Resources and Evaluation, 541, 487–551. 10.1007/s10579‑019‑09464‑6 https://doi.org/10.1007/s10579-019-09464-6
  • Parkinson, D. (1981) VSO to SVO in modern standard Arabic: A study in diglossia syntax. Al-Arabiyya, 141, 24–37.
  • Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. (2014) MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. InN. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (eds.), LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp.1094–1101). European Language Resources Association. Retrieved fromwww.lrec-conf.org/proceedings/lrec2014/pdf/593_Paper.pdf
  • Ramm, W. (2004) Sentence-boundary adjustment in Norwegian-German and German-Norwegian translations: First results of a corpus-based study. InK. Aijmer, & H. Hasselgard (eds.), Translation and Corpora (pp.129–147). Acta Universitatis Gothoburgensis.
  • Rafalovitch, A., & Dale, R. (2009) United Nations General Assembly resolutions: A six-language parallel corpus’. InProceedings of the MT Summit XII (pp.292–299). International Association of Machine Translation. Retrieved fromwww.mt-archive.info/MTS-2009-Rafalovitch.pdf
  • Read, J., Dridan, R., Oepen, S., & Solberg, L. (2012) Sentence boundary detection: A long solved problem?InM. Kay, & C. Boitet (eds.), Proceedings of COLING 2012: Posters (pp.985–994). COLING 2012 Organization Committee. Retrieved fromhttps://www.aclweb.org/anthology/C12-2096.pdf
  • Ryding, K. (2005) A reference grammar of modern standard Arabic. Cambridge University Press. 10.1017/CBO9780511486975 https://doi.org/10.1017/CBO9780511486975
  • Sainz-Quinn, C. & Feria García, M. (2020) Translating Arabic named entities into English and Spanish: Translation consistency at the United Nations. InS. Hanna, H. El-Farahaty, & A. W. Khalifa (eds.), Routledge Handbook of Arabic Translation (pp.381–396). Routledge.
  • Salameh, M., Zantout, R., & Mansour, N. (2011) Improving the accuracy of English-Arabic statistical sentence alignment. The International Arab Journal of Information Technology, 8(2), 171–177.
  • Samy, D., Moreno-Sandoval, A., & Guirao, J. M. (2004) An alignment experiment of a Spanish-Arabic parallel corpus. InProceedings of the International Conference on Arabic Language Resources and Tools (pp.85–89). NEMLAR. Retrieved fromelvira.lllf.uam.es/ESP/Publicaciones/AlignmentPaper04.pdf
  • Samy, D. (2005) Named entities: Structure and translation. A study based on a parallel corpus (Arabic-English-Spanish). InProceedings from the Corpus Linguistics Conference Series. Birmingham. Retrieved fromwww.lllf.uam.es/ESP/Publicaciones/NamedEntitiesParallelCorpus.pdf
  • Samy, D., Moreno-Sandoval, A., Guirao, J. M., & Alfonseca, E. (2006) Building a parallel multilingual corpus (Arabic-Spanish-English). InN. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odijk, & D. Tapias (eds.), Proceedings of the 5th International Conference on Language Resources and Evaluations (LREC’06). GeNAO. Retrieved fromwww.lllf.uam.es/~doaa/Publications/SamyMultilingualLREC06.pdf
  • Samy, D., & González Ledesma, A. (2008) Pragmatic annotation of discourse markers in a multilingual parallel corpus (Arabic-Spanish-English). InN. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (eds.), Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. Retrieved fromwww.mt-archive.info/LREC-2008-Samy.pdf
  • Sánchez-Ratia, J. (2018) El árabe en la traducción al español de las Naciones Unidas. Retrieved fromhttps://ls-sts.unog.ch/basic-page/el-arabe-en-la-traduccion-al-espanol-de-las-naciones-unidas
  • Scott, M. (2008) WordSmith Tools 5.0. Lexical Analysis Software.
  • Semmar, N., & Fluhr, C. (2007) Arabic to French sentence alignment: Exploration of a cross-language information retrieval approach. InV. Cavalli-Sforza, & I. Zitouni (eds.), Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources (pp.73–80). Retrieved fromhttps://www.aclweb.org/anthology/W07-0810.pdf. 10.3115/1654576.1654589 https://doi.org/10.3115/1654576.1654589
  • Serbina, T. (2014) Sentence splitting in the translation pair English-German. In4th Using Corpora in Contrastive and Translation Studies Conference. Abstract Book (pp.61–62). Lancaster University. Retrieved fromucrel.lancs.ac.uk/uccts4/doc/UCCTS4-abstract-book.pdf
  • Shaalan, K. (2014) A survey of Arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510. 10.1162/COLI_a_00178 https://doi.org/10.1162/COLI_a_00178
  • Solfjeld, K. (2008) Sentence splitting and discourse structure in translations. Languages in Contrast, 8(1), 21–46. 10.1075/lic.8.1.03sol https://doi.org/10.1075/lic.8.1.03sol
  • Taji, D., El Gizuli, J., & Habash, N. (2018) An Arabic dependency treebank in the travel domain. InN. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Retrieved fromlrec-conf.org/workshops/lrec2018/W30/pdf/14_W30.pdf
  • Touir, A., Mathkour, H., & Al-Sanea, W. (2008) Semantic-based segmentation of Arabic texts. Information Technology Journal, 71, 1009–1015. 10.3923/itj.2008.1009.1015 https://doi.org/10.3923/itj.2008.1009.1015
  • Xu, J., Fraser, A., & Weischedel, R. (2001) TREC 2001 Cross-lingual retrieval at BBN. InNIST TREC 2001 Proceedings (pp.68–77). Retrieved fromhttps://trec.nist.gov/pubs/trec10/papers/BBNTREC2001.pdf
  • Zantout, R., & Guessoum, A. (2015) Obstacles facing Arabic machine translation: Building a neural network-based transfer module. InS. Izwaini (ed.), Papers in Translation Studies (pp.229–251). Cambridge Scholars Publishing.