Aplicación de técnicas de minería de texto al estudio de la violencia contra la mujer

  1. Mora Andrade, Stephanie Elizabeth
Supervised by:
  1. María del Carmen Pegalajar Jiménez Co-director
  2. María Amparo Vila Miranda Co-director

Defence university: Universidad de Granada

Fecha de defensa: 26 April 2024

Type: Thesis

Abstract

Violence Against Women (VAW) is a social problem that is present in many countries and has become a far-reaching phenomenon that requires attention and extensive study in order to raise awareness of its impact and consequences. Furthermore, the growing wave of cases of violence makes evident the urgent need to recognize its importance and to ensure the principles of equality, security, freedom, integrity and dignity of all human beings. The mentioned before motivated the present doctoral research to study and analyze the forms and different patterns that surround VAW. By applying different text mining and machine learning techniques to a wide variety of news collected from different digital newspapers, we obtained valuable and relevant information that has provided us with a deep insight into this latent social phenomenon at a global level. This research proposes the use of Text Mining techniques such as Text Classification, Topic Modelling and Association Rules to carry out a study of VAW taking as source articles of violence extracted from digital newspapers. Firstly, in this research, we used Web Scraping techniques to obtain the collection of documents to study. Once we obtained the collection of documents, the proposal is as follow; classification of the text into the different types of violence suffered by women, in the same way, through the application of topic modelling techniques, generate and identified latent topics within the collection of documents. Finally, with the application of association rule mining, the study of the different attributes and patterns involving violence against women is proposed. This proposal consists of the development of the following points: Initially, in order to carry out this research, we began with the collection of news published by digital newspapers. It was necessary to carry out a study of the different web page structures in order to identify in which node of the HTML structure the required text is located. In order to be able to define on which information nodes we will make the request and thus be able to obtain the specific text of each of the news items. Using web-scraping techniques was possible to collect 7000 news items (text documents) in unstructured format. Subsequently, the document collection was processed. This process was complex because the text could contain hundreds of words, with each word representing an attribute. Thus, the documents to study were of high dimensionality. This type of unstructured data is more complex to study; many of the attributes present in the text will not generate value to the research or could even affect the proper functioning of the machine learning algorithms. To reduce the impact of this problem, we applied a text processing process that allowed the selection of the most relevant characteristics within each of the texts collected for the study. In order to identify and determine the types of violence to classify the documents, we studied research on VAW to determine the types of violence suffered by women. We detected three types of violence: Physical, Sexual and Psychological, which can be related and present in a single event or document, so we chose a multi-class classification. For the detection of latent topics, we used modelling techniques. In this study, we applied the Latent Dirichlet Allocation (LDA). As a result, we obtained a list of topics and their 15 most representative terms, as well as certain characteristics of VAW. Then, the most relevant news within each topic was determined, and by extracting the most frequent words, we constructed identification tags for the generated topics. Finally, in the association rule mining process we study the different characteristics that may be involved in an act of violence. These include; the type of victim, the type of aggressor, the motives, and the weapon used, the type of violence, whether the victim has wounds on the body or whether the victim died or not. Based on the procedure described above, we applied association rules on a collection of 7000 documents. Subsequently, it was necessary to perform a dimensionality reduction, because each document can contain a large number of words. The reason for this reduction was the computing resources consumed by the application of association rule models on high-dimensional documents. In addition, the use of unimportant attributes in the generation of association rules could generate dubious results in the attribute dependencies. The results obtained in the process of developing this research were favorable, demonstrating that text-mining techniques are very useful tools in the study of violence against women, as they allowed us to study real facts of violence and obtain information that was previously unknown. Finally, it was possible to demonstrate the seriousness and the great scope of VAW, as well as to observe the need to apply measures that help to eradicate the universal phenomenon that stalks thousands of women and girls worldwide.