Package: MantaID 1.0.4

Zhengpeng Zeng

MantaID: A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs

The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.

Authors:Zhengpeng Zeng [aut, cre, ctb], Longfei Mao [aut, cph], Feng Yu [aut], Jiamin Hu [ctb], Xiting Wang [ctb]

MantaID_1.0.4.tar.gz
MantaID_1.0.4.zip(r-4.5)MantaID_1.0.4.zip(r-4.4)
MantaID_1.0.4.tgz(r-4.5-any)MantaID_1.0.4.tgz(r-4.4-any)
MantaID_1.0.4.tar.gz(r-4.5-noble)MantaID_1.0.4.tar.gz(r-4.4-noble)
MantaID_1.0.4.tgz(r-4.4-emscripten)
MantaID.pdf |MantaID.html✨
MantaID/json (API)

# Install 'MantaID' in R:

install.packages('MantaID', repos = c('https://molaison.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/molaison/mantaid/issues

Pkgdown site:https://molaison.github.io

Datasets:

Example - ID example dataset.
mi_data_attributes - ID-related datasets in biomart.
mi_data_procID - Processed ID data.
mi_data_rawID - ID dataset for testing.

On CRAN:

3.78 score 2 scripts 142 downloads 24 exports 157 dependencies

Last updated 6 months agofrom:1b1adb3c7a. Checks:1 OK, 6 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 17 2025
R-4.5-win	NOTE	Mar 17 2025
R-4.5-mac	NOTE	Mar 17 2025
R-4.5-linux	NOTE	Mar 17 2025
R-4.4-win	NOTE	Mar 17 2025
R-4.4-mac	NOTE	Mar 17 2025
R-4.4-linux	NOTE	Mar 17 2025

Exports:mi mi_balance_data mi_clean_data mi_filter_feat mi_get_confusion mi_get_ID mi_get_ID_attr mi_get_miss mi_get_padlen mi_plot_cor mi_plot_heatmap mi_predict_new mi_run_bmr mi_split_col mi_split_str mi_to_numer mi_train_BP mi_train_rg mi_train_rp mi_train_xgb mi_tune_rg mi_tune_rp mi_tune_xgb mi_unify_mod

Dependencies:AnnotationDbi askpass backports base64enc bbotk Biobase BiocFileCache BiocGenerics biomaRt Biostrings bit bit64 blob cachem caret checkmate class cli clock codetools colorspace config cpp11 crayon curl data.table DBI dbplyr dbscan diagram digest dplyr e1071 evaluate fansi farver fastmap filelock FNN foreach future future.apply generics GenomeInfoDb GenomeInfoDbData ggcorrplot ggplot2 globals glue gower gtable hardhat here hms httr httr2 igraph ipred IRanges isoband iterators jsonlite KEGGREST keras KernSmooth labeling lattice lava lgr lifecycle listenv lubridate magrittr MASS Matrix mclust memoise mgcv mime mlbench mlr3 mlr3hyperband mlr3learners mlr3measures mlr3misc mlr3tuning ModelMetrics munsell nlme nnet numDeriv openssl palmerpenguins paradox parallelly pillar pkgconfig plogr plyr png prettyunits pROC processx prodlim progress progressr proxy PRROC ps purrr R6 ranger rappdirs RColorBrewer Rcpp RcppEigen RcppTOML recipes reshape2 reticulate rlang rpart rprojroot RSQLite rstudioapi S4Vectors scales scutr shape smotefamily sparsevctrs SQUAREM stringi stringr survival sys tensorflow tfautograph tfruns tibble tidyr tidyselect timechange timeDate tzdb UCSC.utils utf8 uuid vctrs viridisLite whisker withr xgboost xml2 XVector yaml zeallot

Help page	Topics
ID example dataset.	Example
A wrapper function that executes MantaID workflow.	mi
Data balance. Most classes adopt random undersampling, while a few classes adopt smote method to oversample to obtain relatively balanced data;	mi_balance_data
Reshape data and delete meaningless rows.	mi_clean_data
ID-related datasets in biomart.	mi_data_attributes
Processed ID data.	mi_data_procID
ID dataset for testing.	mi_data_rawID
Performing feature selection in a automatic way based on correlation and feature importance.	mi_filter_feat
Compute the confusion matrix for the predicted result.	mi_get_confusion
Get ID data from the 'Biomart' database using 'attributes'.	mi_get_ID
Get ID attributes from the 'Biomart' database.	mi_get_ID_attr
Observe the distribution of the false response of the test set.	mi_get_miss
Get max length of ID data.	mi_get_padlen
Plot correlation heatmap.	mi_plot_cor
Plot heatmap for result confusion matrix.	mi_plot_heatmap
Predict new data with a trained learner.	mi_predict_new
Compare classification models with small samples.	mi_run_bmr
Cut the string of ID column character by character and divide it into multiple columns.	mi_split_col
Split the string into individual characters and complete the character vector to the maximum length.	mi_split_str
Convert data to numeric, and for the ID column convert with fixed levels.	mi_to_numer
Train a three layers neural network model.	mi_train_BP
Random Forest Model Training.	mi_train_rg
Classification tree model training.	mi_train_rp
Xgboost model training	mi_train_xgb
Tune the Random Forest model by hyperband.	mi_tune_rg
Tune the Decision Tree model by hyperband.	mi_tune_rp
Tune the Xgboost model by hyperband.	mi_tune_xgb
Predict with four models and unify results by the sub-model's specificity score to the four possible classes.	mi_unify_mod

Package: MantaID 1.0.4

MantaID: A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs

Citation

Development and contributors

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)