Publications des scientifiques de l'IRD

Orozco-Arias S., Candamil-Cortes M. S., Jaimes P. A., Valencia-Castrillon E., Tabares-Soto R., Isaza G., Guyot Romain. (2022). Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning. Journal of Integrative Bioinformatics, 19 (3), 20210036 [15 p.].

Titre du document
Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning
Année de publication
2022
Type de document
Article référencé dans le Web of Science WOS:000824970100001
Auteurs
Orozco-Arias S., Candamil-Cortes M. S., Jaimes P. A., Valencia-Castrillon E., Tabares-Soto R., Isaza G., Guyot Romain
Source
Journal of Integrative Bioinformatics, 2022, 19 (3), 20210036 [15 p.]
Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.
Plan de classement
Sciences fondamentales / Techniques d'analyse et de recherche [020] ; Informatique [122]
Localisation
Fonds IRD [F B010085408]
Identifiant IRD
fdi:010085408
Contact