Discovering topics in Twitter about the COVID-19 outbreak in Spain

  1. Marvin M. Agüero Torales
  2. David Vilares Calvo
  3. Antonio G. López Herrera
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2021

Número: 66

Páginas: 177-190

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

En este trabajo, analizamos lo que los usuarios han estado discutiendo en Twitter durante el comienzo de la pandemia causada por el COVID-19. Concretamente, analizamos tres fases diferenciadas de la crisis del COVID-19 en España: el propio tiempo de pre-crisis, el estallido de la enfermedad y el confinamiento. Para llevar esto a cabo, primero recolectamos una gran cantidad de tuits que son preprocesados. A continuación, agrupamos los tuits en distintas temáticas usando un modelo de Latent Dirichlet Allocation, y definimos estrategias generativas y discriminativas para extraer las palabras clave y oraciones más representativas para cada tema. Finalmente, incluimos un exhaustivo análisis cualitativo sobre dichos temas, y cómo estos se corresponden con distintas problemáticas surgidas en España en distintos momentos de la crisis.

Información de financiación

MMAT has been partially funded by Barcelona Supercomputing Center (BSC) through the Spanish Plan for advance ment of Language Technologies ‘Plan TL’ and the Secretaría de Estado de Digital-ización e Inteligencia Artificial (SEDIA). DV is supported by MINECO (TIN2017-85160-C2-1-R), by Xunta de Galicia (ED431C 2020/11), by Centro de Investigación de Galicia ‘CITIC’ (European Regional Development Fund-Galicia 2014-2020 Program, ED431G 2019/01), and by a 2020 Leonardo Grant for Researchers and Cultural Creators from the BBVA Foundation.

Financiadores

Referencias bibliográficas

  • Abd-Alrazaq, A., D. Alhuwail, M. Househ, M. Hamdi, and Z. Shah. 2020. Top concerns of tweeters during the covid-19 pandemic: infoveillance study. Journal of medical Internet research, 22(4):e19016.
  • Afzal, Z., V. Yadav, O. Fedorova, V. Kandala, J. van de Loo, S. A. Akhondi, P. Coupet, and G. Tsatsaronis. 2020. CORA: A deep active learning covid-19 relevancy algorithm to identify core scientific articles. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online, December. Association for Computational Linguistics.
  • Agencia EFE. 2020. La OMS pone en alerta a la red mundial de hospitales por un nuevo coronavirus en China. www.efe.com, January.
  • Almoguera, P. 2020. El coronavirus pone en jaque ahora a Japón y Corea del Sur. El Páıs, February.
  • Amara, A., M. A. H. Taieb, and M. B. Aouicha. 2020. Multilingual topic modelling for tracking covid-19 trends based on facebook data analysis.
  • Andrzejewski, D. and D. Buttler. 2011. Latent topic feedback for information retrieval. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 600–608.
  • Arun, R., V. Suresh, C. V. Madhavan, and M. N. Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In PacificAsia conference on knowledge discovery and data mining, pages 391–402. Springer.
  • Asgari-Chenaghlu, M., N. NikzadKhasmakhi, and S. Minaee. 2020. Covidtransformer: Detecting trending topics on twitter using universal sentence encoder. arXiv preprint arXiv:2009.03947.
  • Banda, J. M., R. Tekumalla, G. Wang, J. Yu, T. Liu, Y. Ding, K. Artemova, E. Tutubalina, and G. Chowell. 2020. A largescale COVID-19 Twitter chatter dataset for open scientific research an international collaboration, August.
  • Barde, B. V. and A. M. Bainwad. 2017. An overview of topic modeling methods and tools. In 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), pages 745–750.
  • BBC News. 2020. Li Wenliang: Coronavirus kills Chinese whistleblower doctor. BBC News, February.
  • Blei, D. M. and J. D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 113–120, New York, NY, USA. Association for Computing Machinery.
  • Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993– 1022.
  • Bohórquez, L. and O. Güell. 2020. El segundo caso de coronavirus en España es un británico que se contagió en los Alpes. El País, February.
  • Boon-Itt, S. 2020. A text-mining analysis of public perceptions and topic modeling during the covid-19 pandemic using twitter data. JMIR public health and surveillance, JMIR Preprints. 30/06/2020:21978.
  • Cao, J., T. Xia, J. Li, Y. Zhang, and S. Tang. 2009. A density-based method for adaptive lda model selection. Neurocomputing, 72(7-9):1775–1781.
  • Carbonell Gironés, L. 2020. Geograph- ical analysis of the opinion and influence of users on twitter during the coronavirus health crisis. Final project/degree, Escola Tècnica Superior d’Enginyeria Informàtica, Universitat Politècnica de València.
  • CatalunyaPress.es. 2020. Iberia suspende los vuelos a Shanghái por el coronavirus.
  • Cer, D., Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, and R. Kurzweil. 2018. Universal sentence encoder.
  • Chandrasekaran, R., V. Mehta, T. Valkunde, and E. Moustakas. 2020. Topics, trends, and sentiments of tweets about the covid-19 pandemic: Temporal infoveillance study. Journal of Medical Internet Research, 22(10):e22624.
  • Chen, E., K. Lerman, and E. Ferrara. 2020. Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set. JMIR Public Health and Surveillance, 6(2):e19273.
  • CNN. 2020. Medidas globales por el coronavirus: mantener distancia de un metro, cierre de escuelas y museos, evitar los besos y otras, March.
  • Cristian Fracassi. 2020. Charlotte valve, March.
  • Cué, C. E. 2020. El Gobierno informa de que es la única autoridad en toda España, limita los desplazamientos y cierra comercios, March.
  • Deerwester, S. 1988. Improving information retrieval with latent semantic indexing.
  • EFE/CMM. 2020. 400 guardias civiles de Castilla-La Mancha tienen Covid-19, según la AUGC.
  • El Boletín. 2020. China pone en cuarentena a más de 30 millones de personas por el coronavirus. January.
  • elEconomista.es. 2020. Las medidas de distanciamiento social podrían extenderse hasta 2022 de manera intermitente elEconomista.es.
  • Ellyatt, H. 2020. Russia closes border with China to prevent spread of the coronavirus, January.
  • Gao, Y., Y. Xu, and Y. Li. 2014. Pattern- based topics for document modelling in information filtering. IEEE Transactions on Knowledge and Data Engineering, 27(6):1629–1642.
  • Gestiona.es. 2020. Información para los afectados por ERTE debido al COVID19, March.
  • Griffiths, T. L. and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235.
  • Grover, P., A. K. Kar, Y. K. Dwivedi, and M. Janssen. 2019. Polarization and acculturation in us election 2016 outcomes–can twitter analytics predict changes in voting preferences. Technological Forecasting and Social Change, 145:438–460.
  • Hofmann, T. 1999. Probabilistic latent se- mantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, page 289–296, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Hutto, C. and E. Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14), volume 81, page 82.
  • Jagarlamudi, J., H. Daumé III, and R. Udupa. 2012. Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 204–213.
  • Justo, D. 2020. España sigue la tendencia a la baja: 4.273 nuevos contagios por coronavirus y 637 muertes, April.
  • Kerchner, D. and L. Wrubel. 2020. Coronavirus Tweet Ids.
  • Kleinberg, B., I. van der Vegt, and M. Mozes. 2020. Measuring Emotions in the COVID19 Real World Worry Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July. Association for Computational Linguistics.
  • La Razón. 2020. Emotivo reconocimiento a los sanitarios en forma de aplausos desde los balcones, March.
  • La Vanguardia. 2020. Boris Johnson recibe el alta y continuará recuperándose de la Covid-19 en su casa, April.
  • Linde, P. 2020. Sanidad confirma en La Gomera el primer caso de coronavirus en España. El País, February.
  • Loria, S. 2020. textblob documentation. Release 0.16, 2.
  • McInnes, L., J. Healy, and J. Melville. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints, February.
  • Mohammad, S. M. and P. D. Turney. 2013. Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 29(3):436–465.
  • Moody, C. E. 2016. Mixing dirichlet topic models and word embeddings to make lda2vec.
  • M.R.M. 2020. Un tigre del zoo de Nueva York tiene coronavirus, April.
  • Neubig, G., Y. Matsubayashi, M. Hagiwara, and K. Murakami. 2011. Safety information mining—what can nlp do in a disaster—. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 965–973.
  • Ordun, C., S. Purushotham, and E. Raff. 2020. Exploratory analysis of covid19 tweets using topic modeling, umap, and digraphs. arXiv preprint arXiv:2005.03082.
  • Pardeiro, M. 2020. El fracaso poĺıtico del MWC: ”No se va a suspender”. ”No cuelga de un hilo”.
  • Pham, P., P. Do, and C. D. Ta. 2018. W- pathsim: novel approach of weighted similarity measure in content-based heterogeneous information networks by applying lda topic modeling. In Asian conference on intelligent information and database systems, pages 539–549. Springer.
  • Polo, J. 2020. Coronavirus: La Zona Franca fabricará 100 respiradores diarios con impresoras 3D, March.
  • Pérez, B. 2020. La OMS rectifica y declara la emergencia global por el coronavirus, January.
  • Requeijo, A. 2020. La Polićıa y la Guardia Civil suman ya más de 400 positivos por coronavirus, March.
  • RTVE.es. 2020. Los ERTE por la crisis del coronavirus suman más de 240.000, March.
  • Safont Plumed, J. 2020. Muere el escritor chileno Luis Sepúlveda, a causa del coronavirus.
  • Salton, G., A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing. Commun. ACM, 18(11):613–620, November.
  • Sano, H. 2020. GLOBAL MARKETS-World stocks set for worst week since 2008 as virus fears grip markets. Reuters, February.
  • Soteras, A. 2020. COVID-19: 510 muertes en un d́ıa, la cifra más baja desde el 23 de marzo.
  • Steinskog, A., J. Therkelsen, and B. Gambäck. 2017. Twitter topic modeling by tweet aggregation. In Proceedings of the 21st nordic conference on computational linguistics, pages 77–86.
  • Tuarob, S., L. C. Pouchard, and C. L. Giles. 2013. Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries, pages 239–248.
  • Verspoor, K., K. B. Cohen, M. Conway, B. de Bruijn, M. Dredze, R. Mihalcea, and B. Wallace, editors. 2020a. Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online, December. Association for Computational Linguistics.
  • Verspoor, K., K. B. Cohen, M. Dredze, E. Ferrara, J. May, R. Munro, C. Paris, and B. Wallace, editors. 2020b. Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July. Association for Computational Linguistics.
  • Vilares, D. and C. Gómez-Rodŕıguez. 2018. Grounding the semantics of part-of-day nouns worldwide using twitter. In Pro- ceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, pages 123–128.
  • Wang, L. L., K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. M. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, N. X. R. Wang, C. Wilhelm, B. Xie, D. M. Raymond, D. S. Weld, O. Etzioni, and S. Kohlmeier. 2020. CORD-19: The COVID-19 open research dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July. Association for Computational Linguistics.
  • Wojcik, S. and A. Hughes. 2019. Sizing up twitter users. PEW research center, 24.
  • World Health Organization (WHO). 2020a. Advice for the public on COVID-19 – World Health Organization.
  • World Health Organization (WHO). 2020b. WHO statement regarding cluster of pneumonia cases in Wuhan, China. January. Accessed: 2020-08-28.
  • Yijun, G. and X. Tian. 2014. Study on keyword extraction with lda and textrank combination. Data Analysis and Knowledge Discovery, 30(7):41–47.
  • Yin, H., S. Yang, and J. Li. 2020. Detecting topic and sentiment dynamics due to covid-19 pandemic using social media. arXiv preprint arXiv:2007.02304.
  • Yu, J., Y. Lu, and J. Muñoz-Justicia. 2020. Analyzing spanish news frames on twitter during covid-19—a network study of el páıs and el mundo. International Journal of Environmental Research and Public Health, 17(15):5414.
  • Zhou, S., K. Li, and Y. Liu. 2009. Text categorization based on topic model. International Journal of Computational Intelligence Systems, 2(4):398–409.