Nainia A., Vignes-Lebbe R., Chenin Eric, Sahraoui M., Mousannif H., Zahir J. (2024). A transformer-based Nlp pipeline for enhanced extraction of botanical information using Camembert on french literature. NLP and Information Retrieval, 14 (6), 59-78. International Conference on NLP and Information Retrieval, 5., Sydney (AUS), 2024/03/23-24. ISSN 2079-9292.
Titre du document
A transformer-based Nlp pipeline for enhanced extraction of botanical information using Camembert on french literature
Année de publication
2024
Auteurs
Nainia A., Vignes-Lebbe R., Chenin Eric, Sahraoui M., Mousannif H., Zahir J.
Source
NLP and Information Retrieval, 2024,
14 (6), 59-78 ISSN 2079-9292
Colloque
International Conference on NLP and Information Retrieval, 5., Sydney (AUS), 2024/03/23-24
This research investigates the untapped wealth of centuries-old French botanical literature, particularly focused on floras, which are comprehensive guides detailing plant species in specific regions. Despite their significance, this literature remains largely unexplored in the context of AI integration. Our objective is to bridge this gap by constructing a specialized botanical French dataset sourced from the flora of New Caledonia. We propose a transformer-based Named Entity Recognition pipeline, leveraging distant supervision and CamemBERT, for the automated extraction and structuring of botanical information. The results demonstrate exceptional performance: for species names extraction, the NER model achieves precision (0.94), recall (0.98), and F1-score (0.96), while for fine-grained extraction of botanical morphological terms, the CamemBERT-based NER model attains precision (0.93), recall (0.96), and F1-score (0.94). This work contributes to the exploration of valuable botanical literature by underscoring the capability of AI models to automate information extraction from complex and diverse texts.
Plan de classement
Analyse de données [020STAT02]
;
Botanique [076BOTA]
;
Applications diverses [122APPLIC]
Localisation
Fonds IRD [F B010094502]
Identifiant IRD
fdi:010094502