Computational Methodology for Identifying Cancer Driver Genes and Mutations
Eugene Chao
Introduction. Cancer is a serious disease, responsible for one out of eight deaths worldwide5. Cancer can be caused by somatic mutations in the genome6. There are at least 33 cancer types, with more than 9,000 oncogenic driver genes, which are required to drive the neoplastic process of cancer, and mutations1. Several large-scale cancer genomics studies, such as The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC), have been instrumental in discovering novel mutational drivers of cancer 7. Computational algorithms help identify these driver genes8. Methods. Five computational methods were investigated. CHASMplus trained random forest models2. DriverML used a combination of a weighted score test with a supervised machine learning approach3. HotCommics integrated protein dynamics with 3D protein structures4. ccpwModel used binary variables as features in hierarchical clustering, while xGeneModel combined functional similarities between putative cancer driver genes and their confidence scores and mutation events to calculate the genetic distance between tumors5. Finally, PanCancer and PanSoftware analysis used 26 diverse bioinformatics tools to identify cancer driver genes1. Results. When applied to 8,657 tumors across 32 cancer types in TCGA data set, CHASMplus identified over 4,000 unique driver missense mutations in 240 genes2. DriverML, when applied to 31 independent datasets from TCGA, identified several novel driver genes, such as HDAC5, HSPA5, EPHA2 and DNMT13. HotCommics, when applied to TCGA dataset, predicted 1 or more mutation hotspots within the resolved structures of proteins encoded by 434 different genes4. The models ccpwModel and xGeneModel were applied to TCGA dataset to discover existing cancer subtypes and further group them based on race. PanCancer analysis, when applied to TCGA dataset, 299 driver genes in regard to anatomical sites and cancer/cell types, as well as more than 3,400 putative missense driver mutations. Conclusions. Each of the studies discovers novel cancer driver genes in their respective data sets. Computational techniques such as random forest, weighted score, 3d protein dynamics, clustering, and combination of these tools can be used for personalized medicine and ensure that the future is bright in computational research and discovery.
- Bailey M, Tokheim C, Porta-Pardo E, Mills G, Karchin R, Ding L. Comprehensive characterization of cancer driver genes and mutations. Cell. 2018; 173(2):371-385.e18. https://www.ncbi.nlm.nih.gov/pubmed/29625053. doi: 10.1016/j.cell.2018.02.060.
- Tokheim C, Karchin R. CHASMplus reveals the scope of somatic missense mutations driving human cancers. Cell Systems. 2019; 9(1):9-23.e8. https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30154-1. doi: 10.1016/j.cels.2019.05.005.
- Han Y, Yang J, Qian X, Cheng WC, Liu SH, Hua X, Zhou L, Yang Y, Wu Q, Liu P, Lu Y. DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies. Nucleic Acids Research. 2019; 47(8):e45. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6486576/. doi: 10.1093/nar/gkz096.
- Kumar S, Clarke D, Gerstein MB. Leveraging protein dynamics to identify cancer mutational hotspots using 3D structures. Proc Natl Acad Sci USA. 2019; 116(38):18962-18970. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6754584/. doi: 10.1073/pnas.1901156116.
- Zhang W, Flemington E, Zhang K. Driver gene mutations based clustering of tumors: methods and applications. Bioinformatics. 2018; 34(13):i404-i411. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022677/. doi: 10.1093/bioinformatics/bty232.
- Dimitrakopoulos C, Beerenwinkel N. Computational approaches for the identification of cancer genes and pathways. WIREs Systems Biology and Medicine. 2017;9:e1364. https://www.ncbi.nlm.nih.gov/pubmed/27863091. doi: 10.1002/wsbm.1364.
- Hudson A, Wirth C, Stephenson N, Fawdar S, Brognard J, Miller C. Using large-scale genomics data to identify driver mutations in lung cancer: methods and challenges. Pharmacogenomics. 2015;16(10):1149-60. https://www.ncbi.nlm.nih.gov/pubmed/26230733. doi: 10.2217/pgs.15.60
- Zhang Jm, Liu J, Sun J, Chen C, Foltz G, Lin B. Identifying driver mutations from sequencing data of heterogeneous tumors in the era of personalized genome sequencing. Briefings in Bioinformatics. 2013;15(2):244-55. https://www.ncbi.nlm.nih.gov/pubmed/23818492. doi: 10.1093/bib/bbt042