Semantic Relations Predict the Bracketing of Three-Component Multiword Terms

  1. Rojas Garcia, Juan
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2022

Número: 69

Páginas: 141-152

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

En unidades terminológicas poliléxicas (UTP) con tres o más formantes en lengua inglesa (p.ej., sea level rise), establecer la dependencia entre dichos formantes requiere de un análisis lingüístico y de conocimiento especializado del área concreta en que se emplean las UTP. Esta desambiguación estructural, o bracketing, implica el agrupamiento de los formantes para reducir la UTP a su estructura básica de modificador+núcleo, como en [sea level] [rise]. Conocer el bracketing de una UTP no solo facilita su comprensión y traducción a otras lenguas, sino que también mejora el desempeño de los sistemas de traducción automática y de los analizadores sintácticos. Por tanto, en este artículo presentamos un estudio piloto que explora si el bracketing de una UTP con tres formantes, al emplearse como argumento en una oración, puede predecirse a partir de la información semántica codificada en dicha oración. Se muestra que, con un modelo random forest, la relación semántica de la UTP con otro argumento en la misma oración, el dominio léxico del verbo y el rol semántico de la UTP son capaces de predecir el bracketing de las 190 UTP ternarias que se usan como argumento en una muestra de 188 oraciones, anotadas semánticamente y extraídas de un corpus sobre ingeniería de costas (con un valor de F1 del 100%). Además, únicamente la relación semántica que mantiene una UTP ternaria con otro argumento en la misma oración posee una enorme capacidad para predecir su bracketing mediante un árbol de decisión binario (con un valor de F1 del 94,12%).

Referencias bibliográficas

  • Agirre, E., T. Baldwin, and D. Martinez (2008). Improving parsing and PP attachment performance with sense information. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 317-325). ACL.
  • Barrière, C., and P.A. Ménard (2014). Multiword noun compound bracketing using Wikipedia. In Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014) (pp. 72-80). ACL.
  • Bergsma, S., E. Pitler, and D. Lin (2010). Creating robust supervised classifiers via web-scale n-gram data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 865-874). ACL.
  • Brants, T., and A. Franz (2006). Web 1T 5-gram Version 1. Linguistic Data Consortium. Faber, P., and R. Mairal (1999). Constructing a Lexicon of English Verbs. Mouton de Gruyter.
  • Faber, P., P. León-Araúz, and J.A. Prieto (2009). Semantic relations, dynamicity, and terminological knowledge bases. Current Issues in Language Studies, 1, 1-23.
  • Faruqui, M., and C. Dyer (2015). Non-distributional word vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (pp. 464-469). ACL.
  • Fellbaum, C.A. (1998). Semantic network of English: The mother of all WordNets. Computers and the Humanities, 32, 209-220. Fernández, A., S. García, M. Galar, R.C. Prati, B.
  • Krawczyk, and F. Herrera (2018). Learning from ImbalancedData Sets. Springer. Fillmore, C.J. (1968). The case for case. In E. Bach, and R. Harms (Eds.), Universals in Linguistic Theory (pp. 1-89). Holt, Rinehart, and Winston.
  • Girju, R., D.I. Moldovan, M. Tatu, and D. Antohe (2005). On the semantics of noun compounds. Computer Speech and Language, 19(4), 479-496.
  • Green, N. (2011). Effects of noun phrase bracketing in dependency parsing and machine translation. In 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Proceedings of Student Session (pp. 69-74). ACL.
  • Hellmann, S., C. Stadler, J. Lehmann, and S. Auer (2009). DBpedia live extraction. In R. Meersman, T. Dillon, and P. Herrero (Eds.), On the Move to Meaningful Internet Systems (OTM 2009) (Vol. 5871, pp. 1209-1223). Springer. Lecture Notes in Computer Science.
  • James, G., D. Witten, T. Hastie, and R. Tibshirani (2015). An Introduction to Statistical Learning. Springer.
  • Kim, S.N., and T.Baldwin (2013). A lexical semantic approach to interpreting and bracketing English noun compounds. Natural Language Engineering, 19(3), 385-407.
  • Krippendorff, K. (2012). Content Analysis: An Introduction to its Methodology. Sage. Kroeger, P.R. (2005). Analyzing Grammar: An Introduction. Cambridge University Press.
  • Kuhn, M. (2021). caret: Classification and Regression Training. R package version 6.0-90.
  • Lapata, M., and F. Keller (2004). The web as a baseline: Evaluating the performance of unsupervised web-based models for a range of NLP tasks. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL (HLT-NAACL 2004) (pp. 121-128). ACL.
  • Lauer, M. (1994). Conceptual Association for Compound Noun Analysis. CoRR.
  • Lauer, M. (1995). Corpus statistics meet the noun compound: Some empirical results. In Proceedings of the 33rd Annual Meeting of the ACL (pp. 47-54).ACL.
  • Lazaridou, A., E.M. Vecchi, and M. Baroni (2013). Fish transporters and miracle homes: How compositional distributional semantics can help NP parsing. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013) (pp. 1908-1913). ACL.
  • Leech, G. (1981). Semantics: The Study of Meaning. Penguin.
  • León-Araúz, P., A. San Martín, and A. Reimerink (2018). The EcoLexicon English corpus as an open corpus in Sketch Engine. In Proceedings of the 18th EURALEX International Congress (pp. 893-901). Euralex.
  • León-Araúz, P., M. Cabezas-García, and P. Faber (2021). Multiword-term bracketing and representation in terminological knowledge bases. In Electronic Lexicography in the 21st Century. Proceedings of the eLex 2021 Conference (pp. 139-163). Lexical Computing CZ.
  • Lin, D., K.WChurch, H. Ji, S. Sekine, D. Yarowsky, S. Bergsma, K. Patil, E. Pitler, R. Lathbury, V. Rao, K. Dalwani, and S. Narsale (2010). New tools for web-scale n-grams. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (pp. 2221-2227). ELRA.
  • Marcus, M. (1980). A Theory of Syntactic Recognition for Natural Language. MIT Press. Marcus, M.P., M.A. Marcinkiewicz, and B. Santorini (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313-330.
  • Ménard, P.A., and C. Barrière (2014). Linked open data and web corpus data for noun compound bracketing. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14) (pp. 702-709). ELRA.
  • Michel, J.B., Y.K. Shen, A.P. Aiden, A. Veres, M.K. Gray, T.G.B. Team, J.P. Pickett, D. Holberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M.A. Nowak, and E.L. Aiden (2010). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176-182.
  • Nakov, P., and M. Hearst (2005). Search engine statistics beyond the n-gram: Application to noun compound bracketing. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL) (pp. 17-24).ACL.
  • Pitler, E., S. Bergsma, D. Lin, and K.W. Church (2010). Using web-scale n-grams to improve base NP parsing performance. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010) (pp. 886-894). ACL. Pustejovsky, J., P. Anick, and S. Bergler (1993). Lexical semantic techniques for corpus analysis. Computational Linguistics, 19(2), 331-358.
  • Resnik, P.S. (1993). Selection and Information: A Class-based Approach to Lexical Relationships. Ph.D. Thesis. University of Pennsylvania. Roget, P.M. (1852). Roget’s Thesaurus of English Words and Phrases. Available in Project Gutemberg. https://www.gutenberg.org/ebooks/10681.
  • Rojas-Garcia, J. (forthcoming). Semantic representation of context for the inclusion of named rivers in a terminological knowledge base. Frontiers in Psychology.
  • Ruppenhofer, J., M. Ellsworth, M.R.L. Petruck, C.R. Johnson, and J. Scheffczyk (2010). FrameNet II: Extended Theory and Practice. International Computer Science Institute.
  • Thompson, P., S.A. Iqbal, J. McNaught, and S. Ananiadou (2009). Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics, 10, 349.
  • Vadas, D., and J.R Curran (2007). Large-scale supervised models for noun phrase bracketing. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING-2007) (pp. 104-112). PACLING.
  • Vadas, D., and J.R. Curran (2008). Parsing noun phrase structure with CCG. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 335-343). ACL.