Interpretable and Effortless Techniques for Social Network Analysis

Aparicio, Manuel Francisco

Interpretable and Effortless Techniques for Social Network Analysis

Aparicio, Manuel Francisco

unter der Leitung von:

Juan Luis Castro Peña Doktorvater

Universität der Verteidigung: Universidad de Granada

Fecha de defensa: 21 von Dezember von 2022

Gericht:

José Jesús Castro Sánchez Präsident/in
Encarnación Hidalgo Tenorio Sekretärin
Cristophe Marsala Vocal

Art: Dissertation

Teseo: 765659 DIALNET DIGIBUG editor

Zusammenfassung

Social Networking Sites (SNS) are the most important way of communication nowadays. They have changed how we interact with our friends and family, and even how companies target their clients, conduct market analysis and make business decisions. The amount of data that is being generated every day is virtually unlimited, and it can be used to conduct social media analyses and/or to train Machine Learning (ML) models. However, many handicaps need to be alleviated. SNS data is, typically, unstructured and written in natural language, and it presents misspelled words, contractions, emojis, and new semantic units that sometimes are a heavy burden for learning algorithms. A large dataset and multiple preprocessing steps are essential for almost any ML application in SNS. Unfortunately, there is an inherent cost to gather and build labelled databases (human effort), and it constitutes a major drawback for low- to mid-budget ventures. Additionally, many applications may result in social consequences, thus they need to be audited. Both objectives fall into the interest of a multidisciplinary project called ª Nutcracker, that aims to detect, track, monitor an analyse radical discourse online. This dissertation is part of the project, and we propose in it effortless and interpretable mechanisms to tackle aforementioned disadvantages, using social network’s mechanics as leverage. First, we present a reasoning mechanism based on similarity between users, that will allow us to deduce properties of unknown users, hence reducing the effort required to build databases. Then, we present a new kind of feature extraction and selection method whose purpose is to reduce model complexity, thus enhancing model comprehensibility and transparency. Finally, we study the peculiarities of aggregated analysis and, particularly, how well can class prevalence count be estimated when working with SNS data. Our results show that we are able to build large databases in Twitter with a fraction of the effort; that we can train interpretable models as accurate as the baselines but one order of magnitude less complex; and that quantification is a novel approach that has much to offer to social network analysis, since it is able to adjust classification bias. We developed a proof-of-concept tool for effortless labelling and continuous user tracking, and we tested the platform by producing four high-quality weak-labelled datasets. The proposed techniques, methodologies and tools have been proven useful for disciplines such as computational linguistics, political science and cybersecurity. They are being used by members of our team and they have raised the attention of Spanish Civil Guard. Applications include building (and working with) supervised databases (e.g., social network analysis, market analysis, customer service, user profiling...); reaching full transparency in automatic decision-making algorithms (e.g., preemptive account closing, illegal activity tracking, hiring policies...); measuring overall user opinion or sentiment (e.g., during an event like a political debate); studying mental illnesses, detection of epidemic outbreaks, targeting customers, profiling brand ambassadors, or determining the impact of organised communities, among many others.