Happi Happi Bill Gates. (2024). Linking complementary datasets through the augmentation of knowledge graphs and multimodal representations.
Montpellier (FRA) ; Marseille : Université de Montpellier ; IRD, 211 p. multigr. Th. Informatique. Information, Structures, Systèmes, Université de Montpellier. 2024/12/09.
Diplôme
Th. Informatique. Information, Structures, Systèmes, Université de Montpellier. 2024/12/09.
At the end of the 20th century, the rise of the Internet enabled the creation of the web, a network of interconnected machines that exchange data in the form of documents. These documents allow humans to communicate and preserve information across generations. In the early 2000s, the concept of the Semantic Web emerged to enable machines to better understand and process these data. Models such as RDF (Resource Description Framework) were developed to represent information in the form of triples: subject, predicate, and object. With the explosion of data published on the web, several challenges have arisen, particularly regarding the management of descriptions of the same entity coming from various sources. The World Wide Web Consortium (W3C) formalized knowledge graphs, networks of annotated nodes and links, to structure and interlink these data. This thesis aims to improve the linking of RDF graphs to identify similar entities or instances converging toward the same reality, relying on the owl predicate. Entity or instance alignment, a rapidly growing field in the scientific community, seeks to address challenges related to data diversity, including linguistic and semantic variations. The goal is to integrate and make differently structured data interoperable. Although several tools exist, they remain limited by multi-level challenges, such as efficiently reducing the number of entity pairs to compare and analyzing literal values. Linguistic and contextual differences add an additional layer of complexity, requiring techniques capable of handling these variations. This field still presents opportunities for developing more sophisticated solutions, incorporating machine learning and semantic analysis techniques. In this thesis, we propose several contributions, ranging from specialized methods for simple datasets to the design of a general entity alignment model. Aware of the limitations of knowledge graphs, we propose an augmentation approach, starting with named entity recognition in literals. We developed GRU-SCANET, a new architecture that enhances accuracy and reduces word vector pre-training time. GRU-SCANET outperforms the state of the art on eight biological datasets.
We then evaluated the approach using SpaCy due to its broad entity detection capabilities. Additionally, we designed DLinker, which reduces the entity candidate pair comparison space. During the OAEI 2022 and PFIA/AFIA 2023 competitions, DLinker demonstrated its efficiency by reducing processing time to 1.6 seconds. However, when dealing with more detailed graphs, we developed GLinker, which applies graph embedding techniques to improve performance. GLinker, combined with a new similarity measure called HPP, achieved better performance compared to the Jaro-Winkler method. Despite these improvements, limitations remain in handling multilingual data and synonyms. To address these issues, we conducted a comparative study on classifiers influenced by different embedding techniques to propose a more general alignment model, LLM4EA. This model leverages the power of language models such as GPT-2 and BERT to improve entity alignment, overcoming linguistic and contextual challenges. In conclusion, this thesis presents various solutions for entity alignment, paving the way for future research.