Semantic Relations Predict the Bracketing of Three-Component Multiword Terms

  1. Rojas Garcia, Juan
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2022

Issue: 69

Pages: 141-152

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

For English multiword terms (MWTs) of three or more constituents (e.g., sea level rise), a semantic analysis, based on linguistic and domain knowledge, is necessary to resolve the dependency between components. This structural disambiguation, often known as bracketing, involves the grouping of the dependent components so that the MWT is reduced to its basic form of modifier+head, as in [sea level] [rise]. Knowledge of these dependencies facilitates the comprehension of an MWT and its accurate translation into other languages. Moreover, the resolution of MWT bracketing provides a higher overall accuracy in machine translation systems and sentence parsers. This paper thus presents a pilot study that explored whether the bracketing of a ternary compound, when used as an argument in a sentence, can be predicted from the semantic information encoded in that sentence. It is shown that, with a random forest model, the semantic relation of the MWT to another argument in the same sentence, the lexical domain of the predicate, and the semantic role of the MWT were able to predict the bracketing of the 190 ternary compounds used as arguments in a sample of 188 semantically annotated sentences from a Coastal Engineering corpus (100% F1-score). Furthermore, only the semantic relation of an MWT to another argument in the same sentence proved enormous capability to predict ternary compound bracketing with a binary decision-tree model (94.12%F1-score).

Bibliographic References

  • Agirre, E., T. Baldwin, and D. Martinez (2008). Improving parsing and PP attachment performance with sense information. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 317-325). ACL.
  • Barrière, C., and P.A. Ménard (2014). Multiword noun compound bracketing using Wikipedia. In Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014) (pp. 72-80). ACL.
  • Bergsma, S., E. Pitler, and D. Lin (2010). Creating robust supervised classifiers via web-scale n-gram data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 865-874). ACL.
  • Brants, T., and A. Franz (2006). Web 1T 5-gram Version 1. Linguistic Data Consortium. Faber, P., and R. Mairal (1999). Constructing a Lexicon of English Verbs. Mouton de Gruyter.
  • Faber, P., P. León-Araúz, and J.A. Prieto (2009). Semantic relations, dynamicity, and terminological knowledge bases. Current Issues in Language Studies, 1, 1-23.
  • Faruqui, M., and C. Dyer (2015). Non-distributional word vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (pp. 464-469). ACL.
  • Fellbaum, C.A. (1998). Semantic network of English: The mother of all WordNets. Computers and the Humanities, 32, 209-220. Fernández, A., S. García, M. Galar, R.C. Prati, B.
  • Krawczyk, and F. Herrera (2018). Learning from ImbalancedData Sets. Springer. Fillmore, C.J. (1968). The case for case. In E. Bach, and R. Harms (Eds.), Universals in Linguistic Theory (pp. 1-89). Holt, Rinehart, and Winston.
  • Girju, R., D.I. Moldovan, M. Tatu, and D. Antohe (2005). On the semantics of noun compounds. Computer Speech and Language, 19(4), 479-496.
  • Green, N. (2011). Effects of noun phrase bracketing in dependency parsing and machine translation. In 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Proceedings of Student Session (pp. 69-74). ACL.
  • Hellmann, S., C. Stadler, J. Lehmann, and S. Auer (2009). DBpedia live extraction. In R. Meersman, T. Dillon, and P. Herrero (Eds.), On the Move to Meaningful Internet Systems (OTM 2009) (Vol. 5871, pp. 1209-1223). Springer. Lecture Notes in Computer Science.
  • James, G., D. Witten, T. Hastie, and R. Tibshirani (2015). An Introduction to Statistical Learning. Springer.
  • Kim, S.N., and T.Baldwin (2013). A lexical semantic approach to interpreting and bracketing English noun compounds. Natural Language Engineering, 19(3), 385-407.
  • Krippendorff, K. (2012). Content Analysis: An Introduction to its Methodology. Sage. Kroeger, P.R. (2005). Analyzing Grammar: An Introduction. Cambridge University Press.
  • Kuhn, M. (2021). caret: Classification and Regression Training. R package version 6.0-90.
  • Lapata, M., and F. Keller (2004). The web as a baseline: Evaluating the performance of unsupervised web-based models for a range of NLP tasks. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL (HLT-NAACL 2004) (pp. 121-128). ACL.
  • Lauer, M. (1994). Conceptual Association for Compound Noun Analysis. CoRR.
  • Lauer, M. (1995). Corpus statistics meet the noun compound: Some empirical results. In Proceedings of the 33rd Annual Meeting of the ACL (pp. 47-54).ACL.
  • Lazaridou, A., E.M. Vecchi, and M. Baroni (2013). Fish transporters and miracle homes: How compositional distributional semantics can help NP parsing. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013) (pp. 1908-1913). ACL.
  • Leech, G. (1981). Semantics: The Study of Meaning. Penguin.
  • León-Araúz, P., A. San Martín, and A. Reimerink (2018). The EcoLexicon English corpus as an open corpus in Sketch Engine. In Proceedings of the 18th EURALEX International Congress (pp. 893-901). Euralex.
  • León-Araúz, P., M. Cabezas-García, and P. Faber (2021). Multiword-term bracketing and representation in terminological knowledge bases. In Electronic Lexicography in the 21st Century. Proceedings of the eLex 2021 Conference (pp. 139-163). Lexical Computing CZ.
  • Lin, D., K.WChurch, H. Ji, S. Sekine, D. Yarowsky, S. Bergsma, K. Patil, E. Pitler, R. Lathbury, V. Rao, K. Dalwani, and S. Narsale (2010). New tools for web-scale n-grams. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (pp. 2221-2227). ELRA.
  • Marcus, M. (1980). A Theory of Syntactic Recognition for Natural Language. MIT Press. Marcus, M.P., M.A. Marcinkiewicz, and B. Santorini (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313-330.
  • Ménard, P.A., and C. Barrière (2014). Linked open data and web corpus data for noun compound bracketing. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14) (pp. 702-709). ELRA.
  • Michel, J.B., Y.K. Shen, A.P. Aiden, A. Veres, M.K. Gray, T.G.B. Team, J.P. Pickett, D. Holberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M.A. Nowak, and E.L. Aiden (2010). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176-182.
  • Nakov, P., and M. Hearst (2005). Search engine statistics beyond the n-gram: Application to noun compound bracketing. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL) (pp. 17-24).ACL.
  • Pitler, E., S. Bergsma, D. Lin, and K.W. Church (2010). Using web-scale n-grams to improve base NP parsing performance. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010) (pp. 886-894). ACL. Pustejovsky, J., P. Anick, and S. Bergler (1993). Lexical semantic techniques for corpus analysis. Computational Linguistics, 19(2), 331-358.
  • Resnik, P.S. (1993). Selection and Information: A Class-based Approach to Lexical Relationships. Ph.D. Thesis. University of Pennsylvania. Roget, P.M. (1852). Roget’s Thesaurus of English Words and Phrases. Available in Project Gutemberg. https://www.gutenberg.org/ebooks/10681.
  • Rojas-Garcia, J. (forthcoming). Semantic representation of context for the inclusion of named rivers in a terminological knowledge base. Frontiers in Psychology.
  • Ruppenhofer, J., M. Ellsworth, M.R.L. Petruck, C.R. Johnson, and J. Scheffczyk (2010). FrameNet II: Extended Theory and Practice. International Computer Science Institute.
  • Thompson, P., S.A. Iqbal, J. McNaught, and S. Ananiadou (2009). Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics, 10, 349.
  • Vadas, D., and J.R Curran (2007). Large-scale supervised models for noun phrase bracketing. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING-2007) (pp. 104-112). PACLING.
  • Vadas, D., and J.R. Curran (2008). Parsing noun phrase structure with CCG. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 335-343). ACL.