Aprendizaje de máquina y aprendizaje profundo en biotecnología: aplicaciones, impactos y desafíos
Issue | Vol. 2 Núm. 2 (2019): Ciencia, Ambiente y Clima |
DOI | |
Publicado | dic 14, 2019 |
Estadísticas |
Resumen
La bioinformática es un área que ha modificado la forma en que se diseñan y se desarrollan los experimentos e investigaciones de las áreas biológicas. La biotecnología no ha quedado fuera de los alcances de la bioinformática, impactando directamente áreas como el descubrimiento y el desarrollo de fármacos, mejoramiento de cultivos, biorremediación, estudios de la diversidad ambiental, patología molecular, entre otras. Esto se debe, en gran medida, al desarrollo de las tecnologías de secuenciación de alto rendimiento o Next-generation sequencing (NGS), que han generado gran cantidad de datos que deben ser procesados y analizados para producir nuevos conocimientos y descubrimientos. Lo anterior ha promovido que dos áreas de la bioinformática y la ciencia de la computación, machine learning y deep learning, hayan sido utilizadas para el análisis de estos datos. El “aprendizaje de máquina” aplica técnicas que permiten que las computadoras aprendan, mientras que el “aprendizaje profundo” genera modelos de redes neuronales artificiales que intenta imitar el funcionamiento del cerebro humano, permitiéndoles aprender a partir de los datos y mejorar su aprendizaje a través de las experiencias. Estas dos áreas son esenciales para poder identificar, analizar, interpretar y obtener conocimientos de la gran cantidad de datos biológicos (Big biological data). En este trabajo hacemos una revisión de estas dos áreas: el aprendizaje de máquina y el aprendizaje profundo, orientado al impacto y sus aplicaciones en el área de biotecnología.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … Zheng, X. (2016). TensorFlow : A System for Large-Scale Machine Learning This paper is included in the Proceedings of the TensorFlow : A system for large-scale machine learning.
Al-Ajlan, A., & El Allali, A. (2018). Feature selection for gene prediction in metagenomic fragments. BioData Mining, 11(1), 9. https://doi.org/10.1186/s13040-018-0170-z
Altae-Tran, H., Ramsundar, B., Pappu, A. S., & Pande, V. (2017). Low data drug discovery with one-shot learning. ACS Central Science, 3(4), 283–293.
Amara, J., Bouaziz, B., & Algergawy, A. (2017). A Deep Learning-based Approach for Banana Leaf Diseases Classification. In BTW (Workshops) (pp. 79–88).
Angermueller, C., Lee, H. J., Reik, W., & Stegle, O. (2017). DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biology, 18(1), 67.
Angermueller, C., Pärnamaa, T., Parts, L., & Stegle, O. (2016). Deep learning for computational biology. Molecular Systems Biology, 12(7), 878. https://doi.org/10.15252/msb.20156651
Bansal, A. K. (2005). Bioinformatics in microbial biotechnology - A mini review. Microbial Cell Factories, 4(ii), 1–11. https://doi.org/10.1186/1475-2859-4-19
Beckham, C., Hall, M., & Frank, E. (2016). WekaPyScript: Classification, Regression, and Filter Schemes for WEKA Implemented in Python. Journal of Open Research Software, 4. https://doi.org/10.5334/jors.108
Behjati, S., & Tarpey, P. S. (2013). What is next generation sequencing? Archives of Disease in Childhood: Education and Practice Edition, 98(6), 236–238. https://doi.org/10.1136/archdischild-2013-304340
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., … Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy) (Vol. 4). Austin, TX.
Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., … Haley, C. S. (2015). Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports, 5, 10312. https://doi.org/10.1038/srep10312
Berthold, M. R., Cebron, N., Dill, F., Di Fatta, G., Gabriel, T. R., Georg, F., … Wiswedel, B. (2006). KNIME: The konstanz information miner. 4th International Industrial Simulation Conference 2006, ISC 2006, 11(1), 58–61.
Bravo, À., Piñero, J., Queralt-Rosinach, N., Rautschka, M., & Furlong, L. I. (2015). Extraction of relations between genes and diseases from text and large-scale data analysis: Implications for translational research. BMC Bioinformatics, 16(1), 1–17. https://doi.org/10.1186/s12859-015-0472-9
Brechtmann, F., Mertes, C., Matusevičiūtė, A., Yepez, V. A., Avsec, Ž., Herzog, M., … Gagneur, J. (2018). OUTRIDER: A statistical method for detecting aberrantly expressed genes in RNA sequencing data. The American Journal of Human Genetics, 103(6), 907–917.
Budach, S., & Marsico, A. (2018). pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks. Bioinformatics, 34(17), 3035–3037.
Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C., & Collins, J. J. (2018). NextGeneration Machine Learning for Biological Networks. Cell, 173(7), 1581–1592. https://doi.org/10.1016/j.cell.2018.05.015
Chen, S.-C., Tsai, T.-H., Chung, C.-H., & Li, W.-H. (2015). Dynamic association rules for gene expression data analysis. BMC Genomics, 16(1), 786. https://doi.org/10.1186/s12864-015-1970-x
Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298.
Chollet, F. (2015). Keras. GitHub. Retrieved from https://github.com/fchollet/keras
Chung, C. L., Huang, K. J., Chen, S. Y., Lai, M. H., Chen, Y. C., & Kuo, Y. F. (2016). Detecting Bakanae disease in rice seedlings by machine vision. Computers and Electronics in Agriculture. https://doi.org/10.1016/j.compag.2016.01.008
Costello, J. C., Heiser, L. M., Georgii, E., Gönen, M., Menden, M. P., Wang, N. J., … Van Westen, G. J. P. (2014). A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology, 32(12), 1202–1212. https://doi.org/10.1038/nbt.2877
Cuperlovic-Culf, M. (2018). Machine learning methods for analysis of metabolic data and metabolic pathway modeling. Metabolites, 8(1). https://doi.org/10.3390/metabo8010004
Datta, S. S., & Datta, S. S. (2006). Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics, 7, 397. https://doi.org/10.1186/1471-2105-7-397
de Carvalho, L. M., Borelli, G., Camargo, A. P., de Assis, M. A., de Ferraz, S. M. F., Fiamenghi, M. B., … Carazzolle, M. F. (2019). Bioinformatics applied to biotechnology: A review towards bioenergy research. Biomass and Bioenergy, 123(March 2018), 195–224. https://doi.org/10.1016/j.biombioe.2019.02.016
Dixit, P., & Prajapati, G. I. (2015). Machine learning in bioinformatics: A novel approach for DNA sequencing. International Conference on Advanced Computing and Communication Technologies, ACCT, 2015-April, 41–47.
Dutil, F., Cohen, J. P., Weiss, M., Derevyanko, G., & Bengio, Y. (2018). Towards gene expression convolutions using gene interaction graphs. ArXiv Preprint ArXiv:1806.06975
Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., & Adams, R. P. (2015). Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems (pp. 2224–2232).
Eraslan, G., Avsec, Ž., Gagneur, J., & Theis, F. J. (2019). Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7), 389–403. https://doi.org/10.1038/s41576-019-0122-6
Fiannaca, A., La Paglia, L., La Rosa, M., Renda, G., Rizzo, R., Gaglio, S., & Urso, A. (2018). Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinformatics, 19(7), 198.
Frank, E., Hall, M., Trigg, L., Holmes, G., & Witten, I. H. (2004). Data mining in bioinformatics using Weka. Bioinformatics, 20(15), 2479– 2481. https://doi.org/10.1093/bioinformatics/bth261
Free Software Foundation, I. (2016). GNU R. Retrieved from http://directory.fsf.org/wiki/R#tab=Overview
Gauthier, J., Vincent, A. T., Charette, S. J., & Derome, N. (2018). A brief history of bioinformatics. Briefings in Bioinformatics, (February), 1–16. https://doi.org/10.1093/bib/bby063
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672–2680).
Guimaraes, G. L., Sanchez-Lengeling, B., Outeiral, C., Farias, P. L. C., & AspuruGuzik, A. (2017). Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models. ArXiv Preprint ArXiv:1705.10843
Gupta, A., & Zou, J. (2018). Feedback GAN (FBGAN) for DNA: A novel feedback-loop architecture for optimizing protein functions. ArXiv Preprint ArXiv:1804.01694.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software. ACM SIGKDD Explorations, 11(1), 10–18. https://doi.org/10.%201145/1656274.1656278
Hornik, K., Buchta, C., & Zeileis, A. (2009). Opensource machine learning: R meets Weka. Computational Statistics, 24(2), 225–232.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer.
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab-an S4 package for kernel methods in R. Journal of Statistical Software, 11(9), 1–20.
Kelley, D. R., Reshef, Y. A., Bileschi, M., Belanger, D., McLean, C. Y., & Snoek, J. (2018). Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Research, 28(5), 739–750.
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26.
Kumar, A., & Chrodia, N. (2016). Role of Bioinformatics in Biotechnology. Research and Review in BioSciences, 12(1), 293–317. https://doi.org/10.4018/978-1-5225-0610-2.ch011
Lavecchia, A. (2015). Machine-learning approaches in drug discovery: Methods and applications. Drug Discovery Today, 20(3), 318–331. https://doi.org/10.1016/j.drudis.2014.10.012
LeCun, Y., Bengio, Y., Hinton, G., Y., L., Y., B., & G., H. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
Liakos, K. G., Busato, P., Moshou, D., Pearson, S., & Bochtis, D. (2018). Machine learning in agriculture: A review. Sensors (Switzerland), 18(8), 1–29. https://doi.org/10.3390/s18082674
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22
Libbrecht, M. W., & Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nat Rev Genet, 16(6), 321–332. https://www.doi.org/%2010.1038/nrg3920
Libbrecht, M. W., & Noble, W. S. (2017). Machine learning in genetics and genomics. Nature Reviews Genetics, 16(6), 321–332. https://doi.org/10.1038/nrg3920.Machine
Mamoshina, P., Vieira, A., Putin, E., & Zhavoronkov, A. (2016a). Applications of Deep Learning in Biomedicine. Molecular Pharmaceutics, 13(5), 1445–1454. https://doi.org/10.1021/acs.molpharmaceut.5b00982
Mamoshina, P., Vieira, A., Putin, E., & Zhavoronkov, A. (2016b). Applications of Deep Learning in Biomedicine. Molecular Pharmaceutics, 13(5), 1445–1454. https://doi.org/10.1021/acs.molpharmaceut.5b00982
Mamoshina, P., Volosnikova, M., Ozerov, I. V., Putin, E., Skibina, E., Cortese, F., & Zhavoronkov, A. (2018). Machine learning on human muscle transcriptomic data for biomarker discovery and tissue-specific drug target identification. Frontiers in Genetics, 9(JUL), 1–10. https://doi.org/10.3389/fgene.2018.00242
Martinez, R., Pasquier, N., & Pasquier, C. (2008). GenMiner: Mining non-redundant association rules from integrated gene expression data and annotations. Bioinformatics, 24(22), 2643– 2644. https://doi.org/10.1093/bioinformatics/btn490
Mccombie, W. R., Mcpherson, J. D., & Mardis, E. R. (2019). Next-Generation Sequencing Technologies. https://doi.org/10.1101/cshperspect.a036798
Metzker, M. L. (2010). Sequencing technologies — the next generation. Nature Reviews Genetics, 11(1), 31–46. https://doi.org/10.1038/nrg2626
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2017). e1071: misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.6-8.
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T. (2006). YALE: Rapid prototyping for complex data mining tasks. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006, 935–940.
Min, S., Lee, B., & Yoon, S. (2017). Deep learning in bioinformatics. Briefings in Bioinformatics, 18(5), 851–869. https://doi.org/10.1093/bib/bbw068
Mohanty, S. P., Hughes, D. P., & Salathé, M. (2016). Using deep learning for image-based plant disease detection. Frontiers in Plant Science, 7, 1419
Morales, I. R., Cebrián, D. R., Fernandez-Blanco, E., & Sierra, A. P. (2016). Early warning in egg production curves from commercial hens: A SVM approach. Computers and Electronics in Agriculture, 121(03082), 169–179. https://doi.org/10.1016/j.compag.2015.12.009
Noble, W. S. (2006). What is a support vector machine? Nature Biotechnology, 24(12), 1565–1567.
Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E., Ameh, F., … Adebiyi, E. (2016). Clustering Algorithms: Their Application to Gene Expression Data. Bioinformatics and Biology Insights, 10, BBI. S38316. https://doi.org/10.4137/BBI.S38316
Pan, X., Rijnbeek, P., Yan, J., & Shen, H.-B. (2018). Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics, 19(1), 511.
Park, S., Min, S., Choi, H., & Yoon, S. (2016). deepMiRGene: Deep neural network based precursor microrna prediction. ArXiv Preprint ArXiv:1605.00017
Park, Y., & Kellis, M. (2015). Deep learning for regulatory genomics. Nature Biotechnology, 33(8), 825–826. https://doi.org/10.1038/nbt.3313
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., … Lerer, A. (2017). Automatic differentiation in pytorch
Patil, A. P., & Deka, P. C. (2016). An extreme learning machine approach for modeling evapotranspiration using extrinsic inputs. Computers and Electronics in Agriculture. https://doi.org/10.1016/j.compag.2016.01.016
Pedregosa, F., Michel, V., Grisel O., Blondel, M., Prettenhofer, P., Weiss, R., … Duchesnay E., Fré. (2011). Scikit-learn: Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre Passos Pedregosa, Varoquaux, Gramfort et al. Matthieu Perrot. Journal of Machine Learning Research, 12, 2825–2830. Recuperado de http://scikit-learn.sourceforge.net
Peres-Neto, P. R., Jackson, D. A., & Somers, K. M. (2005). How many principal components? stopping rules for determining the number of non-trivial axes revisited. Computational Statistics and Data Analysis, 49(4), 974–997. https://doi.org/10.1016/j.csda.2004.06.015
Rhee, S., Seo, S., & Kim, S. (2017). Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. ArXiv Preprint ArXiv:1711.05859.
Ringnér, M. (2008). What is principal component analysis? Nature Biotechnology, 26(3), 303.
Rouillard, A. D., Hurle, M. R., & Agarwal, P. (2018). Systematic interrogation of diverse Omic data reveals interpretable, robust, and generalizable transcriptomic features of clinically successful therapeutic targets. PLoS Computational Biology, 14(5), 1–28. https://doi.org/10.1371/journal.pcbi.1006142
Scholz, M., Kaplan, F., Guy, C. L., Kopka, J., & Selbig, J. (2005). Non-linear PCA: a missing data approach. Bioinformatics, 21(20), 3887–3895.
Searls, D. B. (2010). The roots of bioinformatics. PLoS Computational Biology, 6(6), 1–7. https://doi.org/10.1371/journal.pcbi.1000809
Seide, F., & Agarwal, A. (2016). CNTK: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (p. 2135). ACM.
Sheehan, S., & Song, Y. S. (2016). Deep learning for population genetic inference. PLoS Computational Biology, 12(3), e1004845
SINGH, V., SINGH, A., CHAND, R., & KUSHWAHA, C. (2011). Role of Bioinformatics in Agriculture and Sustainable Development. International Journal of Bioinformatics Research, 3(2), 221–226. https://doi.org/10.9735/0975-3087.3.2.221-226
Song, X., Zhang, G., Liu, F., Li, D., Zhao, Y., & Yang, J. (2016). Modeling spatio-temporal distribution of soil moisture by deep learning-based cellular automata model. Journal of Arid Land, 8(5), 734–748.
Tan, J., Doing, G., Lewis, K. A., Price, C. E., Chen, K. M., Cady, K. C., … Greene, C. S. (2017). Unsupervised extraction of stable expression signatures from public compendia with an ensemble of neural networks. Cell Systems, 5(1), 63–71
Tan, J., Hammond, J. H., Hogan, D. A., & Greene, C. S. (2016). ADAGE-based integration of publicly available Pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. MSystems, 1(1), e00025-150
Thermes, C. (2014). Ten years of next-generation sequencing technology. Trends in Genetics : TIG, 30(9), 418–426. https://doi.org/10.1016/j.tig.2014.07.001
Tiwari, A., & Sekhar, A. K. T. (2007). Workflow based framework for life science informatics. Computational Biology and Chemistry. https://doi.org/10.1016/j.compbiolchem.2007.08.009
Van Gerven, M., & Bohte, S. (2017). Artificial neural networks as models of neural information processing. Frontiers in Computational Neuroscience, 11, 114.
Wainberg, M., Merico, D., Delong, A., & Frey, B. J. (2018). Deep learning in biomedicine. Nature Biotechnology, 36(9), 829–838. https://doi.org/10.1038/nbt.4233
Wang, M., Tai, C., E, W., & Wei, L. (2018). DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants. Nucleic Acids Research, 46(11), e69–e69
Werli, S. (2016). scikit-learn: Classification Algorithms on Iris Dataset - Brain Scribble. Retrieved September 21, 2019, from http://stephanie-w.github.io/brainscribble/classification-algorithms-on-iris-dataset.htm
Witten, I. H., Frank, E., & Hall, M. a. (2011a). Data Mining: Practical Machine Learning Tools and Techniques, Third Edition. Annals of Physics (Vol. 54). https://doi.org/10.1002/1521-3773 (20010316)40:6<9823::AID-ANIE9823 >3.3.CO;2-C
Witten, I. H., Frank, E., & Hall, M. A. (2011b). Data Mining Practical Machine Learning Tools and Techniques (3ra ed.). Burlington, MA: Morgan Kaufmann.
Yadav, B., Ch, S., Mathur, S., & Adamowski, J. (2016). Estimation of in-situ bioremediation system cost using a hybrid Extreme Learning Machine (ELM)-particle swarm optimization approach. Journal of Hydrology. https://doi.org/10.1016/j.jhydrol.2016.10.013
Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018). Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature Genetics, 50(8), 1171.
Zou, Z., Yang, L., Wang, D., Huang, Q., Mo, Y., & Xie, G. (2016). Gene Structures , Evolution and Transcriptional Profiling of the WRKY Gene Family in Castor Bean ( Ricinus communis L.), 1–23. https://doi.org/10.1371/journal.pone.0148243
- Resumen visto - 1515 veces
- PDF descargado - 910 veces
- HTML descargado - 1176 veces
Descargas
Licencia
Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-CompartirIgual 4.0.
Copyright
© Ciencia, Ambiente y Clima, 2019
Afiliaciones
Edian F. Franco
Universidad Federal de Para, Belém, Pará, Brasil.
Rommel J. Ramos
Universidad Federal de Para, Belém, Pará, Brasil.