Extracción de Términos Relacionados Semánticamente con Colpónimos:Evaluación en un Corpus Especializado de Pequeño Tamaño

  1. Rojas Garcia, Juan
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2021

Número: 67

Páginas: 139-151

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

EcoLexicon es una base de conocimiento terminológica sobre el medioambiente, cuyo diseño permite la contextualización geográfica de colpónimos, esto es, bahías con nombre propio (BNP) (v.gr., Bahía de Pensacola). Se aplicaron modelos semánticos distribucionales (MSD), basados en recuentos y predictivos, a un corpus especializado de pequeño tamaño en inglés para extraer términos relacionados con las BNP y sus relaciones semánticas. Puesto que la evaluación de MSD en corpus especializados de pequeño tamaño ha sido menos explorada, en este artículo se identifican tanto la combinación de parámetros como las cinco medidas de similitud adecuadas para extraer términos que mantengan con las BNP las relaciones tiene_lugar_en, localizado_en y atributo_de. Los MSD se evalúan con tres conjuntos de datos anotados manualmente. Los resultados indican que: los modelos basados en recuentos superan a los modelos predictivos; las medidas de similitud brindan resultados semejantes, excepto la distancia euclídea; y la detección de una relación específica depende del tamaño de la ventana contextual.

Referencias bibliográficas

  • Alrabia, M., N. Alhelewh, A. Al-Salman, and E. Atwell. 2014. An empirical study on the Holy Quran based on a large classical Arabic corpus. International Journal of Computational Linguistics, 5(1): 1-13.
  • Asr, F., J. Willits, and M. Jones. 2016. Comparing predictive and co-occurrence-based models of lexical semantics trained on child-directed speech. In A. Papafragou, D. Grodner, D. Mirman, and J. Trueswell (eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society (CogSci), Philadelphia (Pennsylvania), 1092-1097.
  • Baroni, M., G. Dinu, and G. Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, 238-247.
  • Baroni, M., and A. Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4): 673-721.
  • Beltagy, I., K. Lo, and A. Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, 3615-3620.
  • Benjamini, Y., and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1): 289-300.
  • Benoit K., K. Watanabe, H. Wang, P. Nulty, A. Obeng, S. Müller, and A. Matsuo. 2018. quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30): 774.
  • Bernier-Colborne, G., and P. Drouin. 2016. Evaluation of distributional semantic models: a holistic approach. In Proceedings of the 5th International Workshop on Computational Terminology (Computerm), Osaka (Japan), 52-61.
  • Bertels, A., and D. Speelman. 2014. Clustering for semantic purposes: Exploration of semantic similarity in a technical corpus. Terminology, 20(2): 279-303.
  • Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Transactions of the ACL, 5: 135-146.
  • Bollegala, D., T. Maehara, Y. Yoshida, and K. Kawarabayashi. 2015. Learning word representations from relational graphs. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Palo Alto, 2146-2152.
  • Chen, Z., Z. He, X. Liu, and J. Bian. 2018. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Medical Informatics and DecisionMaking, 18(Suppl 2): 65.
  • Chiu, B., G. Crichton, A. Korhonen, and S. Pyysalo. 2016. How to train good word embeddings for biomedical NLP. In Proceedings of the 15th Workshop on Biomedical NLP, Berlin, 166-174.
  • Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug): 2493-2537.
  • Devlin, J., M.W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In arXiv preprint arXiv:1810.04805v2.
  • El Bazi, I., and N. Laachfoubi. 2016. Arabic named entity recognition using word representations. International Journal of Computer Science and Information Security, 14(8): 956-965.
  • Erk, K., S. Padó, and U. Padó. 2010. A flexible, corpus-driven model of regular and inverse selectional preferences. Computational Linguistics, 36(4): 723-763.
  • Evert, S. 2008. Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook. Berlin: Walter de Gruyter, 1212-1248.
  • Evert, S., P. Uhrig, S. Bartsch, and T. Proisl. 2017. E-VIEW-alation – A large-scale evaluation study of association measures for collocation identification. In Proceedings of the eLex 2017 Conference, Leiden, 531-549.
  • Faber, P. (ed.). 2012. A Cognitive Linguistics View of Terminology and Specialized Language. Berlin/Boston: De Gruyter Mouton.
  • Faber, P., P. León-Araúz, and J.A. Prieto. 2009. Semantic relations, dynamicity, and terminological knowledge bases. Current Issues in Language Studies, 1: 1-23.
  • Ferret, O. 2015. Réordonnancer des thésaurus distributionnels en combinant différents critères. TAL, 56(2): 21-49.
  • Gries, S., and A. Stefanowitsch. 2010. Cluster analysis and the identification of collexeme classes. In S. Rice, and J. Newman (eds.), Empirical and Experimental Methods in Extraction of Terms Semantically Related to Colponyms: Evaluationin a Small Specialized Corpus 149 Cognitive/Functional Research. Stanford (California): CSLI, 73-90.
  • Gwinn N., and C. Rinaldo. 2009. The Biodiversity Heritage Library: Sharing biodiversity with the world.IFLA Journal, 35(1):25-34.
  • Huang, A. 2008. Similarity measures for text document clustering. In Proceedings of the New Zealand Computer Science Research Student Conference 2008, Christchurch, 49-56.
  • Ide, N., and J. Pustejovsky (eds.). 2017. Handbook of Linguistic Annotation. Dordrecht: Springer.
  • Kiela, D., and S. Clark. 2014. A systematic study of semantic vector space model parameters. In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), Gothenburg, 21-30.
  • Kilgarriff, A., P. Rychlý, P. Smrz, and D. Tugwell. 2004. The Sketch Engine. In Proceeding of the 11th EURALEX International Congress, Lorient, 105-115.
  • Krieger, M.G., and M.J.B. Finatto. 2004. Introdução à Terminologia: teoria & prática. São Paulo: Contexto.
  • Lapesa, G., S. Evert, and S. Schulte im Walde. 2014. Contrasting syntagmatic and paradigmatic relations: Insights from distributional semantic models. In Proceedings of the 3rd Joint Conference on Lexical and Computational Semantics, Dublin, 160-170.
  • León-Araúz, P., A. San Martín, and A. Reimerink. 2018. The EcoLexicon English corpus as an open corpus in Sketch Engine. In Proceedings of the 18th EURALEX International Congress, Ljubljana, 893-901.
  • Levy, O., Y. Goldberg, and I. Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the ACL, 3: 211-225.
  • Manning, C.D., P. Raghavan, and H. Schütze. 2009. Introduction to Information Retrieval. Cambridge (England): Cambridge University Press.
  • Manning, C.D., M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations,Baltimore, 55-60.
  • Mikolov, T., E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. 2018. Advances in pre-training distributed word representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, 52-55.
  • Mikolov, T., K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. In Workshop Proceedings of International Conference on Learning Representations. Scottsdale.
  • Miller, G.A., and W.G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1): 1-28.
  • Mitchell, J., and M. Lapata. 2008. Vector-based models of semantic composition. In Proceeding of ACL-08,Columbus (Ohio), 236-244.
  • Nakov, P. 2013. On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Language Engineering, 19: 291–330.
  • Nematzadeh, A., S.C. Meylan, and T.L. Griffiths. 2017. Evaluating vector-space models of word representation, or, the unreasonable effectiveness of counting words near other words. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society, London, 859-864.
  • Nguyen, N.T.H, A.J. Soto, G. Kontonatsios, R. Batista-Navarro, and S. Ananiadou. 2017. Constructing a biodiversity terminological inventory. PLoS ONE, 12(4): e0175277.
  • Nooralahzadeh, F., L. Øvrelid, and J.T. Lønning. 2018. Evaluation of domain-specific word embeddings using knowledge resources. In Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, 1438-1445.
  • Pearson, J. 1998. Terms in context. Amsterdam: John Benjamins.
  • Pennington, J., R. Socher, and C.D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods for Natural Language Processing (EMNLP), Doha (Qatar), 1532-1543.
  • Prokopidis, P., V. Papavassiliou, A. Toral, M.P. Riera, F. Frontini, F. Rubino, and G. Thurmair. 2012. Final report on the corpus acquisition & annotation subsystem and its components. Technical Report WP-4.5, PANACEA Project.
  • Pilehvar, M.T., and N. Collier. 2016. Improved semantic representation for domain-specific entities. In Proceedings of the 15th Workshop on BiomedicalNatural Language Processing, Berlin, 12-16.
  • Rohde, D., L. Gonnerman, and D. Plaut. 2006. An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8: 627-633.
  • Rojas-Garcia J., and P. Faber. 2019a. Extraction of terms for the construction of semantic frames for named bays. Argentinian Journal of Applied Linguistics, 7(1): 27-57.
  • Rojas-Garcia J., and P. Faber. 2019b. Extraction of terms related to named rivers. Languages, 4(3): 46.
  • Rojas-Garcia J., and P. Faber. 2019c. Evaluation of distributional semantic models for the extraction of semantic relations for named rivers from a small specialized corpus. Procesamiento del Lenguaje Natural, 63: 51-58.
  • Room, A. 1996. An Alphabetical Guide to the Language of Name Studies. Lanham/London: The Scarecrow Press.
  • Sahlgren, M., and A. Lenci. 2016. The effects of data size and frequency range on distributional semantic models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin (Texas), 975-980.
  • Sager, J.C., D. Dungworth, and P.F. McDonald. 1980. English Special Languages. Principles and Practice in Science and Technology. Wiesbaden: Brandstetter Verlag.
  • Strehl, A., J. Ghosh, and R. Mooney. 2000. Impact of similarity measures on web-page clustering. In AAAI-2000: Workshop on Artificial Intelligence for Web Search, Austin, 58-64.