生醫電資所教師研究亮點-111年11月份「莊曜宇教授」

研究主題:應用深度機器學習再DNA定序以及病理切片

撰寫人:潘日南博士、學生陳翰儒

研究如何將人工智慧應用到次世代定序、醫學影像等高通量資料中可使我們對精準醫學的領域有更深層的理解。目前莊曜宇教授的團隊已經使用了DNA定序資料及病理切片影像開發了三種高效能的深度學習工具,且可應用於分類及辨識特定目標物上。

第一項研究為發表Briefings in bioinformatics中的 ”High-performance deep learning pipeline predicts individuals in mixtures of DNA using sequencing data”。使用基因定序資料開發一套新穎且基於深度學習的預測方法去偵測及分類混合物中的不同個體。為證明此技術同樣可用於其他不同的資料中,該模型在開發的階段使用了來自不同定序平台的資料,包括:1.目標區間定序、2.全外顯子定序(WES)。在第一個資料中,我們利用目標的27個短片段重複序列及94個單核苷酸多型性來製備混雜不同個體的DNA樣本,並使用深度學習模型去區分出每個個體並可達到95-97%準確率。第二個資料集則使用乳癌患者的WES資料,並可完全正確地(100%)預測病患之疾病亞型。此外為克服每個序列之間長度的差異,我們使用新的sliding window方法可大幅提升模型效能。總結來說,本研究針對序列資料的處理和深度學習,提出一項能適用於不同次世代定序平台上的方法。

第二項研究為發表Frontiers in oncology中的”Predicting Breast Cancer Gene Expression Signature by Applying Deep Convolutional Neural Networks from Unannotated Pathological Images”。基於病理切片影像之易取得性和使用乳癌患者的70個基因計算出來的復發風險,第二項研究提出一深度學習模型使用病理切片影像進行乳癌復發率的預測,提供一快速、低成本以且健全之乳癌復發率預測工具,幫助醫師進行治療計畫的評估。本研究使用六個預訓練模型進行遷移學習。在驗證資料中,patch-wis的方法有0.87的準確率;且patient-wise方法中,高風險及低風險類別分別有0.90及1.00的準確率。總結來說,這項研究證明了病理切片影像在未標注特定區域的情況下,仍可建立出高效能之人工智慧模型來預測癌症的復發率。

第三篇研究為發表於Frontiers in Oncology的”Prediction of Breast Cancer Recurrence Using a Deep Convolutional Neural Network Without Region-of-Interest Labeling”。利用深度學習預測乳癌亞型並提供一便利之乳癌診斷策略,近一步降低進行mRNA表達量分析以及免疫組織化學染色的成本。我們期望使用上一項研究所訓練的模型權重進行兩階段遷移學習並應用到病理切片影像上。我們使用來自四個預訓練模型的權重以及TCGA-BRCA的資料集做四種乳癌亞型的預測模型。此外,使用Imagenet權重的ResNet101被用於與上述模型進行比較。在分類結果上,此兩階段遷移學習有優異的表現,ResNet101在slide-wise的預測準確率達到0.913。此深度學習模型亦用於與另一常用的乳癌分類工具Genefu進行比較,在比較的結果中,深度學習模型有與Genefu媲美的表現且在特定乳癌亞型中有更優異的預測能力。

深度學習技術已被應用至許多研究中,並已被整合到現今的醫療照護系統之中,以增進疾病的診斷以及預後的判定。美國食品藥物管理局也已制定完善的機器學習標準,用於管理深度學習及人工智慧工具的應用,並更進一步成為模型開發、資料集建立和部署到醫院的黃金標準。

 

Application of Machine Learning methods for biomedical research utilizing high-throughput data, such as imaging data and next generation sequencing data, has allowed deeper understanding towards expansion of precision medicine and improvement of public health issues. Deep learning (DL) is the latest sub-branch of ML and has been introduced with the aspiration of bringing ML closer to AI. At Professor Eric Y Chuang’s lab, we develop high performance DL pipelines, applicable universally for classification and identification tasks in this research work, 3 pipelines were developed to process pathological images and DNA sequencing data, respectively.

For the first study published in the Journal of Briefings in bioinformatics (2021) entitled “High-performance deep learning pipeline predicts individuals in mixtures of DNA using sequencing data”, a novel DL pipeline was proposed that utilized DNA sequencing data to successfully detect and classify different individuals. To prove the global applicability of the pipeline, it was implemented on datasets generated using different sequencing technologies: (i) targeted sequencing and (ii) whole exome sequencing data. For the first application, individuals were identified with 95-97% accuracy, from mixtures of DNA samples, prepared using targeted 27 short tandem repeats and 94 single nucleotide polymorphisms. WES data from breast cancer patients were used for the second application, and the pipeline could correctly classify all patients (100%) into subtypes. A new sliding window approach was proposed and applied, to overcome the sequence length variation problem of sequencing data, which dramatically improved the model performance. Overall, a complete pipeline, including sequencing data processing steps and DL steps is proposed that is applicable across different NGS platforms.

The second study published in the journal of Frontiers in oncology (2021) entitled “Predicting Breast Cancer Gene Expression Signature by Applying Deep Convolutional Neural Networks from Unannotated Pathological Images”. To leverage the availability of whole slide images data and the recurrence risk score provided by a 70 gene-signature from breast cancer patients, a DL model was proposed for the second study to predict the breast cancer recurrence status using only pathological images. This provides a rapid, cost-effective and robust predictive tool which would assist medical doctor in treatment recommendation. 6 pre-trained models (VGG16, ResNet50, ResNet101, Inception_ResNet, EfficientB5, and Xception) were used for transfer learning and their performances were evaluated based on accuracy, precision, recall, F1 score, confusion matrix, and AUC. Xception demonstrated highest validation performance with an overall accuracy of 0.87 for a patch-wise approach and 0.90 and 1.00 for a patient-wise approach for high-risk and low-risk groups, respectively. Taken together, this study demonstrated the feasibility and high performance of artificial intelligence models trained without region-of-interest labeling for predicting cancer recurrence.

Finally, Prof. Chuang published another study in the journal of Frontiers in Oncology (2021) with the title “Prediction of Breast Cancer Recurrence Using a Deep Convolutional Neural Network Without Region-of-Interest Labeling”. Deciphering breast cancer molecular subtypes by DL approaches could provide a convenient and method for the diagnosis of breast cancer patients. It could reduce costs associated with transcriptional profiling and subtyping discrepancy between IHC assays and mRNA expression. Therefore, we aim to develop a highly versatile 2-steps transfer learning pipeline for pathological images using weight obtained from model trained with the 70 gene signature images, for our final study. Weights from 4 pre-trained models namely VGG16, ResNet50, ResNet101, and Xception were used to train TCGA-BRCA datasets to predict 4 intrinsic breast cancer subtypes. Furthermore, ResNet101 model was used for training with weights from ImageNet for comparison with the aforementioned models. The 2-steps DL models showed promising classification results with the overall accuracy of slide-wise prediction as 0.913 with ResNet101 model. The DL model was additionally benchmarked with the common Genefu tool for breast cancer classification. The results demonstrated that the performance of the DL model is comparable to that of Genefu, even superior in certain breast cancer subtypes.

DL technology is applied routinely in the laboratory and is integrated into the current health care system to facilitate diagnosis and determination of prognosis. Good machine learning protocol has also been released by U.S FDA for managing the applications of DL and artificial intelligence tools and are made golden standard for model development, dataset preparation and deployment into the hospital. Eventually, artificial intelligence tools would make health care system less vulnerable to emergent situations which are otherwise not handled the best under current healthcare protocols.

 

References

  1. Phan, N.N., Chattopadhyay, A., Lee, T.T., Yin, H.I., Lu, T.P., Lai, L.C., Hwa, H.L., Tsai, M.H. and Chuang, E.Y., 2021. High-performance deep learning pipeline predicts individuals in mixtures of DNA using sequencing data.Briefings in bioinformatics22(6), p.bbab283.
  2. Phan, N.N., Hsu, C.Y., Huang, C.C., Tseng, L.M. and Chuang, E.Y., 2021. Prediction of Breast Cancer Recurrence Using a Deep Convolutional Neural Network Without Region-of-Interest Labeling.Frontiers in Oncology11, pp.734015-734015.
  3. Phan, N.N., Huang, C.C., Tseng, L.M. and Chuang, E.Y., 2021. Predicting Breast Cancer Gene Expression Signature by Applying Deep Convolutional Neural Networks From Unannotated Pathological Images. Frontiers in oncology11, pp.769447-769447.