Extracción de Términos Relacionados Semánticamente con Colpónimos:Evaluación en un Corpus Especializado de Pequeño Tamaño

  1. Rojas Garcia, Juan
Revue:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Année de publication: 2021

Número: 67

Pages: 139-151

Type: Article

D'autres publications dans: Procesamiento del lenguaje natural

Résumé

EcoLexicon is a terminological knowledge base on environmental science, whose design permits the geographic contextualization of data. For the geographic contextualization of named entities such as colponyms (i.e., named bays such as Pensacola Bay) in EcoLexicon, both count-based and prediction-based distributional semantic models (DSMs) were applied to a small-sized, English specialized corpus to extract terms related to each colponym mentioned in it and their semantic relations. Since the evaluation of DSMs in small, specialized corpora has received little attention, this study identified both parameter combinations in DSMs and five similarity/distance measures suitable for the extraction of terms which related to colponyms through the semantic relations takes_place_in, located_at, and attribute_of. The models were thus evaluated using three gold standard datasets. The results showed that: count-based models outperformed prediction-based ones; the similarity/distance measures performed quite similar except for the Euclidean distance; and the detection of a specific relation depended on the context window size.

Références bibliographiques

  • Alrabia, M., N. Alhelewh, A. Al-Salman, and E. Atwell. 2014. An empirical study on the Holy Quran based on a large classical Arabic corpus. International Journal of Computational Linguistics, 5(1): 1-13.
  • Asr, F., J. Willits, and M. Jones. 2016. Comparing predictive and co-occurrence-based models of lexical semantics trained on child-directed speech. In A. Papafragou, D. Grodner, D. Mirman, and J. Trueswell (eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society (CogSci), Philadelphia (Pennsylvania), 1092-1097.
  • Baroni, M., G. Dinu, and G. Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, 238-247.
  • Baroni, M., and A. Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4): 673-721.
  • Beltagy, I., K. Lo, and A. Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, 3615-3620.
  • Benjamini, Y., and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1): 289-300.
  • Benoit K., K. Watanabe, H. Wang, P. Nulty, A. Obeng, S. Müller, and A. Matsuo. 2018. quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30): 774.
  • Bernier-Colborne, G., and P. Drouin. 2016. Evaluation of distributional semantic models: a holistic approach. In Proceedings of the 5th International Workshop on Computational Terminology (Computerm), Osaka (Japan), 52-61.
  • Bertels, A., and D. Speelman. 2014. Clustering for semantic purposes: Exploration of semantic similarity in a technical corpus. Terminology, 20(2): 279-303.
  • Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Transactions of the ACL, 5: 135-146.
  • Bollegala, D., T. Maehara, Y. Yoshida, and K. Kawarabayashi. 2015. Learning word representations from relational graphs. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Palo Alto, 2146-2152.
  • Chen, Z., Z. He, X. Liu, and J. Bian. 2018. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Medical Informatics and DecisionMaking, 18(Suppl 2): 65.
  • Chiu, B., G. Crichton, A. Korhonen, and S. Pyysalo. 2016. How to train good word embeddings for biomedical NLP. In Proceedings of the 15th Workshop on Biomedical NLP, Berlin, 166-174.
  • Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug): 2493-2537.
  • Devlin, J., M.W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In arXiv preprint arXiv:1810.04805v2.
  • El Bazi, I., and N. Laachfoubi. 2016. Arabic named entity recognition using word representations. International Journal of Computer Science and Information Security, 14(8): 956-965.
  • Erk, K., S. Padó, and U. Padó. 2010. A flexible, corpus-driven model of regular and inverse selectional preferences. Computational Linguistics, 36(4): 723-763.
  • Evert, S. 2008. Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook. Berlin: Walter de Gruyter, 1212-1248.
  • Evert, S., P. Uhrig, S. Bartsch, and T. Proisl. 2017. E-VIEW-alation – A large-scale evaluation study of association measures for collocation identification. In Proceedings of the eLex 2017 Conference, Leiden, 531-549.
  • Faber, P. (ed.). 2012. A Cognitive Linguistics View of Terminology and Specialized Language. Berlin/Boston: De Gruyter Mouton.
  • Faber, P., P. León-Araúz, and J.A. Prieto. 2009. Semantic relations, dynamicity, and terminological knowledge bases. Current Issues in Language Studies, 1: 1-23.
  • Ferret, O. 2015. Réordonnancer des thésaurus distributionnels en combinant différents critères. TAL, 56(2): 21-49.
  • Gries, S., and A. Stefanowitsch. 2010. Cluster analysis and the identification of collexeme classes. In S. Rice, and J. Newman (eds.), Empirical and Experimental Methods in Extraction of Terms Semantically Related to Colponyms: Evaluationin a Small Specialized Corpus 149 Cognitive/Functional Research. Stanford (California): CSLI, 73-90.
  • Gwinn N., and C. Rinaldo. 2009. The Biodiversity Heritage Library: Sharing biodiversity with the world.IFLA Journal, 35(1):25-34.
  • Huang, A. 2008. Similarity measures for text document clustering. In Proceedings of the New Zealand Computer Science Research Student Conference 2008, Christchurch, 49-56.
  • Ide, N., and J. Pustejovsky (eds.). 2017. Handbook of Linguistic Annotation. Dordrecht: Springer.
  • Kiela, D., and S. Clark. 2014. A systematic study of semantic vector space model parameters. In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), Gothenburg, 21-30.
  • Kilgarriff, A., P. Rychlý, P. Smrz, and D. Tugwell. 2004. The Sketch Engine. In Proceeding of the 11th EURALEX International Congress, Lorient, 105-115.
  • Krieger, M.G., and M.J.B. Finatto. 2004. Introdução à Terminologia: teoria & prática. São Paulo: Contexto.
  • Lapesa, G., S. Evert, and S. Schulte im Walde. 2014. Contrasting syntagmatic and paradigmatic relations: Insights from distributional semantic models. In Proceedings of the 3rd Joint Conference on Lexical and Computational Semantics, Dublin, 160-170.
  • León-Araúz, P., A. San Martín, and A. Reimerink. 2018. The EcoLexicon English corpus as an open corpus in Sketch Engine. In Proceedings of the 18th EURALEX International Congress, Ljubljana, 893-901.
  • Levy, O., Y. Goldberg, and I. Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the ACL, 3: 211-225.
  • Manning, C.D., P. Raghavan, and H. Schütze. 2009. Introduction to Information Retrieval. Cambridge (England): Cambridge University Press.
  • Manning, C.D., M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations,Baltimore, 55-60.
  • Mikolov, T., E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. 2018. Advances in pre-training distributed word representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, 52-55.
  • Mikolov, T., K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. In Workshop Proceedings of International Conference on Learning Representations. Scottsdale.
  • Miller, G.A., and W.G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1): 1-28.
  • Mitchell, J., and M. Lapata. 2008. Vector-based models of semantic composition. In Proceeding of ACL-08,Columbus (Ohio), 236-244.
  • Nakov, P. 2013. On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Language Engineering, 19: 291–330.
  • Nematzadeh, A., S.C. Meylan, and T.L. Griffiths. 2017. Evaluating vector-space models of word representation, or, the unreasonable effectiveness of counting words near other words. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society, London, 859-864.
  • Nguyen, N.T.H, A.J. Soto, G. Kontonatsios, R. Batista-Navarro, and S. Ananiadou. 2017. Constructing a biodiversity terminological inventory. PLoS ONE, 12(4): e0175277.
  • Nooralahzadeh, F., L. Øvrelid, and J.T. Lønning. 2018. Evaluation of domain-specific word embeddings using knowledge resources. In Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, 1438-1445.
  • Pearson, J. 1998. Terms in context. Amsterdam: John Benjamins.
  • Pennington, J., R. Socher, and C.D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods for Natural Language Processing (EMNLP), Doha (Qatar), 1532-1543.
  • Prokopidis, P., V. Papavassiliou, A. Toral, M.P. Riera, F. Frontini, F. Rubino, and G. Thurmair. 2012. Final report on the corpus acquisition & annotation subsystem and its components. Technical Report WP-4.5, PANACEA Project.
  • Pilehvar, M.T., and N. Collier. 2016. Improved semantic representation for domain-specific entities. In Proceedings of the 15th Workshop on BiomedicalNatural Language Processing, Berlin, 12-16.
  • Rohde, D., L. Gonnerman, and D. Plaut. 2006. An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8: 627-633.
  • Rojas-Garcia J., and P. Faber. 2019a. Extraction of terms for the construction of semantic frames for named bays. Argentinian Journal of Applied Linguistics, 7(1): 27-57.
  • Rojas-Garcia J., and P. Faber. 2019b. Extraction of terms related to named rivers. Languages, 4(3): 46.
  • Rojas-Garcia J., and P. Faber. 2019c. Evaluation of distributional semantic models for the extraction of semantic relations for named rivers from a small specialized corpus. Procesamiento del Lenguaje Natural, 63: 51-58.
  • Room, A. 1996. An Alphabetical Guide to the Language of Name Studies. Lanham/London: The Scarecrow Press.
  • Sahlgren, M., and A. Lenci. 2016. The effects of data size and frequency range on distributional semantic models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin (Texas), 975-980.
  • Sager, J.C., D. Dungworth, and P.F. McDonald. 1980. English Special Languages. Principles and Practice in Science and Technology. Wiesbaden: Brandstetter Verlag.
  • Strehl, A., J. Ghosh, and R. Mooney. 2000. Impact of similarity measures on web-page clustering. In AAAI-2000: Workshop on Artificial Intelligence for Web Search, Austin, 58-64.