machine learning detection thesis

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Article Menu

Subscribe SciFeed
Recommended Articles
Google Scholar
on Google Scholar
Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Malware analysis and detection using machine learning algorithms.

1. Introduction

2. literature review, 3. research problem, 4. methodology, 4.1. dataset, 4.2. pre-processing, 4.3. features extraction, 4.4. features selection, 5. results and discussion, logistic regression, 6. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest, abbreviations.

CNN	Convolutional Neural Network
FPR	False Positive Rate
RBM	Restricted Boltzmann Machine
DT	Decision Tree
SVM	Support Vector Machine
VM	Virtual Machine

Nikam, U.V.; Deshmuh, V.M. Performance evaluation of machine learning classifiers in malware detection. In Proceedings of the 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballari, India, 23–24 April 2022; pp. 1–5. [ Google Scholar ] [ CrossRef ]
Akhtar, M.S.; Feng, T. IOTA based anomaly detection machine learning in mobile sensing. EAI Endorsed Trans. Create. Tech. 2022 , 9 , 172814. [ Google Scholar ] [ CrossRef ]
Sethi, K.; Kumar, R.; Sethi, L.; Bera, P.; Patra, P.K. A novel machine learning based malware detection and classification framework. In Proceedings of the 2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security), Oxford, UK, 3–4 June 2019; pp. 1–13. [ Google Scholar ]
Abdulbasit, A.; Darem, F.A.G.; Al-Hashmi, A.A.; Abawajy, J.H.; Alanazi, S.M.; Al-Rezami, A.Y. An adaptive behavioral-based increamental batch learning malware variants detection model using concept drift detection and sequential deep learning. IEEE Access 2021 , 9 , 97180–97196. [ Google Scholar ] [ CrossRef ]
Feng, T.; Akhtar, M.S.; Zhang, J. The future of artificial intelligence in cybersecurity: A comprehensive survey. EAI Endorsed Trans. Create. Tech. 2021 , 8 , 170285. [ Google Scholar ] [ CrossRef ]
Sharma, S.; Krishna, C.R.; Sahay, S.K. Detection of advanced malware by machine learning techniques. In Proceedings of the SoCTA 2017, Jhansi, India, 22–24 December 2017. [ Google Scholar ]
Chandrakala, D.; Sait, A.; Kiruthika, J.; Nivetha, R. Detection and classification of malware. In Proceedings of the 2021 International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 8–9 October 2021; pp. 1–3. [ Google Scholar ] [ CrossRef ]
Zhao, K.; Zhang, D.; Su, X.; Li, W. Fest: A feature extraction and selection tool for android malware detection. In Proceedings of the 2015 IEEE Symposium on Computers and Communication (ISCC), Larnaca, Cyprus, 6–9 July 2015; pp. 714–720. [ Google Scholar ]
Akhtar, M.S.; Feng, T. Detection of sleep paralysis by using IoT based device and its relationship between sleep paralysis and sleep quality. EAI Endorsed Trans. Internet Things 2022 , 8 , e4. [ Google Scholar ] [ CrossRef ]
Gibert, D.; Mateu, C.; Planes, J.; Vicens, R. Using convolutional neural networks for classification of malware represented as images. J. Comput. Virol. Hacking Tech. 2019 , 15 , 15–28. [ Google Scholar ] [ CrossRef ] [ Green Version ]
Firdaus, A.; Anuar, N.B.; Karim, A.; Faizal, M.; Razak, A. Discovering optimal features using static analysis and a genetic search based method for Android malware detection. Front. Inf. Technol. Electron. Eng. 2018 , 19 , 712–736. [ Google Scholar ] [ CrossRef ]
Dahl, G.E.; Stokes, J.W.; Deng, L.; Yu, D.; Research, M. Large-scale Malware Classification Using Random Projections And Neural Networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing-1988, Vancouver, BC, Canada, 26–31 May 2013; pp. 3422–3426. [ Google Scholar ]
Akhtar, M.S.; Feng, T. An overview of the applications of artificial intelligence in cybersecurity. EAI Endorsed Trans. Create. Tech. 2021 , 8 , e4. [ Google Scholar ] [ CrossRef ]
Akhtar, M.S.; Feng, T. A systemic security and privacy review: Attacks and prevention mechanisms over IOT layers. EAI Endorsed Trans. Secur. Saf. 2022 , 8 , e5. [ Google Scholar ] [ CrossRef ]
Anderson, B.; Storlie, C.; Lane, T. "Improving Malware Classification: Bridging the Static/Dynamic Gap. In Proceedings of the 5th ACM Workshop on Security and Artificial Intelligence (AISec), Raleigh, NC, USA, 19 October 2012; pp. 3–14. [ Google Scholar ]
Varma, P.R.K.; Raj, K.P.; Raju, K.V.S. Android mobile security by detecting and classification of malware based on permissions using machine learning algorithms. In Proceedings of the 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 10–11 February 2017; pp. 294–299. [ Google Scholar ]
Akhtar, M.S.; Feng, T. Comparison of classification model for the detection of cyber-attack using ensemble learning models. EAI Endorsed Trans. Scalable Inf. Syst. 2022 , 9 , 17329. [ Google Scholar ] [ CrossRef ]
Rosmansyah, W.Y.; Dabarsyah, B. Malware detection on Android smartphones using API class and machine learning. In Proceedings of the 2015 International Conference on Electrical Engineering and Informatics (ICEEI), Denpasar, Indonesia, 10–11 August 2015; pp. 294–297. [ Google Scholar ]
Tahtaci, B.; Canbay, B. Android Malware Detection Using Machine Learning. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–6. [ Google Scholar ]
Baset, M. Machine Learning for Malware Detection. Master’s Dissertation, Heriot Watt University, Edinburg, Scotland, December 2016. [ Google Scholar ] [ CrossRef ]
Akhtar, M.S.; Feng, T. Deep learning-based framework for the detection of cyberattack using feature engineering. Secur. Commun. Netw. 2021 , 2021 , 6129210. [ Google Scholar ] [ CrossRef ]
Altaher, A. Classification of android malware applications using feature selection and classification algorithms. VAWKUM Trans. Comput. Sci. 2016 , 10 , 1. [ Google Scholar ] [ CrossRef ] [ Green Version ]
Chowdhury, M.; Rahman, A.; Islam, R. Malware Analysis and Detection Using Data Mining and Machine Learning Classification ; AISC: Chicago, IL, USA, 2017; pp. 266–274. [ Google Scholar ]
Patil, R.; Deng, W. Malware Analysis using Machine Learning and Deep Learning techniques. In Proceedings of the 2020 SoutheastCon, Raleigh, NC, USA, 28–29 March 2020; pp. 1–7. [ Google Scholar ]
Gavriluţ, D.; Cimpoesu, M.; Anton, D.; Ciortuz, L. Malware detection using machine learning. In Proceedings of the 2009 International Multiconference on Computer Science and Information Technology, Mragowo, Poland, 12–14 October 2009; pp. 735–741. [ Google Scholar ]
Pavithra, J.; Josephin, F.J.S. Analyzing various machine learning algorithms for the classification of malwares. IOP Conf. Ser. Mater. Sci. Eng. 2020 , 993 , 012099. [ Google Scholar ] [ CrossRef ]
Vanjire, S.; Lakshmi, M. Behavior-Based Malware Detection System Approach For Mobile Security Using Machine Learning. In Proceedings of the 2021 International Conference on Artificial Intelligence and Machine Vision (AIMV), Gandhinagar, India, 24–26 September 2021; pp. 1–4. [ Google Scholar ]
Agarkar, S.; Ghosh, S. Malware detection & classification using machine learning. In Proceedings of the 2020 IEEE International Symposium on Sustainable Energy, Signal Processing and Cyber Security (iSSSC), Gunupur Odisha, India, 16–17 December 2020; pp. 1–6. [ Google Scholar ]
Sethi, K.; Chaudhary, S.K.; Tripathy, B.K.; Bera, P. A novel malware analysis for malware detection and classification using machine learning algorithms. In Proceedings of the 10th International Conference on Security of Information and Networks, Jaipur, India, 13–15 October 2017; pp. 107–113. [ Google Scholar ]
Ahmadi, M.; Ulyanov, D.; Semenov, S.; Trofimov, M.; Giacinto, G. Novel feature ex-traction, selection and fusion for effective malware family classification. In Proceedings of the sixth ACM conference on data and application security and privacy, New Orleans, LA, USA, 9–11 March 2016; pp. 183–194. [ Google Scholar ]
Damshenas, M.; Dehghantanha, A.; Mahmoud, R. A survey on malware propagation, analysis and detec-tion. Int. J. Cyber-Secur. Digit. Forensics 2013 , 2 , 10–29. [ Google Scholar ]
Saad, S.; Briguglio, W.; Elmiligi, H. The curious case of machine learning in malware detection. arXiv 2019 , arXiv:1905.07573. [ Google Scholar ]
Selamat, N.; Ali, F. Comparison of malware detection techniques using machine learning algorithm. Indones. J. Electr. Eng. Comput. Sci. 2019 , 16 , 435. [ Google Scholar ] [ CrossRef ] [ Green Version ]
Firdausi, I.; Lim, C.; Erwin, A.; Nugroho, A. Analysis of machine learning techniques used in behavior-based malware detection. In Proceedings of the 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies, Jakarta, Indonesia, 2–3 December 2010; pp. 201–203. [ Google Scholar ] [ CrossRef ]
Hamid, F. Enhancing malware detection with static analysis using machine learning. Int. J. Res. Appl. Sci. Eng. Technol. 2019 , 7 , 38–42. [ Google Scholar ] [ CrossRef ]
Prabhat, K.; Gupta, G.P.; Tripathi, R. TP2SF: A trustworthy privacy-preserving secured framework for sustainable smart cities by leveraging blockchain and machine learning. J. Syst. Archit. 2021 , 115 , 101954. [ Google Scholar ]
Kumar, P.; Gupta, G.P.; Tripathi, R. A distributed ensemble design based intrusion detection system using fog computing to protect the internet of things networks. J. Ambient Intell. Human. Comput. 2021 , 12 , 9555–9572. [ Google Scholar ] [ CrossRef ]
Prabhat, K.; Gupta, G.P.; Tripathi, R. Design of anomaly-based intrusion detection system using fog computing for IoT network. Aut. Control Comp. Sci. 2021 , 55 , 137–147. [ Google Scholar ] [ CrossRef ]
Prabhat, K.; Tripathi, R.; Gupta, G.P. P2IDF: A Privacy-preserving based intrusion detection framework for software defined Internet of Things-Fog (SDIoT-Fog). In Proceedings of the Adjunct Proceedings of the 2021 International Conference on Distributed Computing and Networking (ICDCN ‘21), Nara, Japan, 5–8 January 2021; pp. 37–42. [ Google Scholar ] [ CrossRef ]
Kumar, P.; Gupta, G.P.; Tripathi, R. PEFL: Deep privacy-encoding-based federated learning framework for smart agriculture. IEEE Micro 2022 , 42 , 33–40. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

File Type		No. of Files
Malware	Backdoor	3654
	Rootkit	2834
	Virus	921
	Trojan	2563
	Exploit	652
	Work	921
	Others	3138
Cleanware		2711
Total		17,394

Methods	Accuracy (%)	TPR (%)	FPR (%)
KNN	95.02	96.17	3.42
CNN	98.76	99.22	3.97
Naïve Byes	89.71	90	13
Random Forest	92.01	95.9	6.5
SVM	96.41	98	4.63
DT	99	99.07	2.01

MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

Akhtar, M.S.; Feng, T. Malware Analysis and Detection Using Machine Learning Algorithms. Symmetry 2022 , 14 , 2304. https://doi.org/10.3390/sym14112304

Akhtar MS, Feng T. Malware Analysis and Detection Using Machine Learning Algorithms. Symmetry . 2022; 14(11):2304. https://doi.org/10.3390/sym14112304

Akhtar, Muhammad Shoaib, and Tao Feng. 2022. "Malware Analysis and Detection Using Machine Learning Algorithms" Symmetry 14, no. 11: 2304. https://doi.org/10.3390/sym14112304

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

Enhancing IoT Device Security: A Comparative Analysis of Machine Learning Algorithms for Attack Detection

Conference paper
First Online: 26 June 2024
Cite this conference paper

Abdulaziz Alzahrani 12 &
Abdulaziz Alshammari 12

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 1035))

Included in the following conference series:

International Conference on Forthcoming Networks and Sustainability in the AIoT Era

This study sought to compare the effectiveness, efficiency, and scalability of supervised learning algorithms; logistic regression, decision tree, and random forest in IoT networks’ attack detection and evaluate the effectiveness of these algorithms in adapting to evolving attack techniques in IoT networks. The study deployed data from a Telecom company encompassing a dataset with a total of 10,000 records and 8 attributes. Furthermore, the dataset comprised both normal and malicious traffic, with 3,000 records classified as attacks and 6,000 records classified as normal traffic. To ensure the creation of reliable and predictive models, a statistical sampling technique called Synthetic Minority Over-Sampling Technique (SMOTE) was employed. Based on the experiments, the logistic regression algorithm proved to be the most accurate, followed by random forest, and lastly the decision tree algorithm. In the context of IoT device security, the research contributed to an understanding of data preprocessing techniques, feature engineering, and model evaluation. The correlation analysis and heatmap visualization provide valuable insights into the relationships between various variables and highlight potential patterns and trends in the data. This study provides significant knowledge on the improvement of IoT devices’ security via machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime
Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Compact, lightweight edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Adat, V., Gupta, B.B.: Security in internet of things: issues, challenges, taxonomy, and architecture. Telecommun. Syst. 67 , 423–441 (2018)

Google Scholar

Allam, A., Nagy, M., Thoma, G., Krauthammer, M.: Neural networks versus Logistic regression for 30 days all-cause readmission prediction. Sci. Rep. 9 (1), 9277 (2019)

Alqarni, H., Alnahari, W., Quasim, M.T.: Internet of things (IoT) security requirements: Issues related to sensors. In: 2021 National Computing Colleges Conference (NCCC), pp. 1–6. IEEE (2021)

Alsharif, M., Rawat, D.B.: Study of machine learning for cloud assisted iot security as a service. Sensors 21 (4), 1034 (2021)

Arshad, A., et al.: A novel ensemble method for enhancing internet of things device security against botnet attacks. Decis. Anal. J. 8 , 100307 (2023)

Bari Antor, M., et al.: A comparative analysis of machine learning algorithms to predict Alzheimer’s disease. J. Healthc. Eng. 2021 (2021)

Bernard, S., Heutte, L., Adam, S.: On the selection of decision trees in random forests. In: 2009 International Joint Conference on Neural Networks, pp. 302–307. IEEE (2009)

Bharadiya, J.: Machine learning in cybersecurity: techniques and challenges. Eur. J. Technol. 7 (2), 1–14 (2023)

Boateng, E.Y., Abaye, D.A.: A review of the logistic regression model with emphasis on medical research. J. Data Anal. Inf. Proc. 7 (4), 190–207 (2019)

Dai, B., Chen, R.C., Zhu, S.Z., Zhang, W.W.: Using random forest algorithm for breast cancer diagnosis. In: 2018 International Symposium on Computer, Consumer and Control (IS3C), pp. 449–452. IEEE (2018)

Farid, D.M., Rahman, M.M., Al-Mamuny, M.A.: Efficient and scalable multi-class classification using naïve Bayes tree. In: 2014 International Conference on Informatics, Electronics & Vision (ICIEV), pp. 1–4. IEEE (2014)

Kirasich, K., Smith, T., Sadler, B.: Random forest vs logistic regression: binary classification for heterogeneous datasets. SMU Data Sci. Rev. 1 (3), 9 (2018)

Mahmud, S.H., Hossin, M.A., Jahan, H., Noori, S.R.H., Bhuiyan, T.: CSV-ANNOTATE: generate annotated tables from CSV file. In: 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), pp. 71–75. IEEE (2018)

Makkar, A., Garg, S., Kumar, N., Hossain, M. S., Ghoneim, A., Alrashoud, M.: An efficient spam detection technique for IoT devices using machine learning. IEEE Trans. Ind. Inform. 17 (2), 903–912 (2020)

Meidan, Y., et al.: Detection of unauthorized IoT devices using machine learning techniques. arXiv preprint arXiv:1709.04647 (2017)

Pramanik, P.K.D., Pal, S., Choudhury, P. (2018). Beyond automation: the cognitive IoT. artificial intelligence brings sense to the internet of things. In: Sangaiah, A., Thangavelu, A., Meenakshi Sundaram, V. (eds.) Cognitive Computing for Big Data Systems Over IoT. Lecture Notes on Data Engineering and Communications Technologies, vol. 14, pp. 1–37. Springer, Cham (2018). 10.1007/978-3-319-70688-7_1

Talwana, J.C., Hua, H.J.: Smart world of internet of things (IoT) and its security concerns. In: 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 240–245. IEEE (2016)

Tschang, F.T., Almirall, E.: Artificial intelligence as augmenting automation: implications for employment. Acad. Manage. Perspect. 35 (4), 642–659 (2021)

Xiaolong, X.U., Wen, C.H.E.N., Yanfei, S.U.N.: Over-sampling algorithm for imbalanced data classification. J. Syst. Eng. Electron. 30 (6), 1182–1191 (2019)

Article Google Scholar

Download references

Author information

Authors and affiliations.

Imam Mohammad Ibn Saud Islamic University, Riyadh, Kingdom of Saudi Arabia

Abdulaziz Alzahrani & Abdulaziz Alshammari

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdulaziz Alzahrani .

Editor information

Editors and affiliations.

Department of Computer Engineering, Istanbul Sabahattin Zaim University, Istanbul, Türkiye

Jawad Rasheed

Council for Scientific and Industrial Research (CSIR), Pretoria, South Africa

Adnan M. Abu-Mahfouz

School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast, UK

Muhammad Fahim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper.

Alzahrani, A., Alshammari, A. (2024). Enhancing IoT Device Security: A Comparative Analysis of Machine Learning Algorithms for Attack Detection. In: Rasheed, J., Abu-Mahfouz, A.M., Fahim, M. (eds) Forthcoming Networks and Sustainability in the AIoT Era. FoNeS-AIoT 2024. Lecture Notes in Networks and Systems, vol 1035. Springer, Cham. https://doi.org/10.1007/978-3-031-62871-9_7

Download citation

DOI : https://doi.org/10.1007/978-3-031-62871-9_7

Published : 26 June 2024

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-62870-2

Online ISBN : 978-3-031-62871-9

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

Select type of work

Gw Work (Non-ETD)

Electronic Thesis/Dissertation

Journal Issue

Machine Learning for Intrusion Detection of Cyber-Attacks in Unmanned Aerial Vehicles

Downloadable content.

The introduction of Unmanned Aerial Vehicles (UAVs) has revolutionized civilian and military aviation operations. Their vast and advantageous applications induce high-value proposition. The global UAV market projects a revenue of $102 billion by 2030, with a compound annual growth rate of 19.6%. In fact, the According to the 2023 Presidential Budget, the Department of Defense planned on spending $2.6 billion in unmanned systems (McNabb, 2023). Despite their tangible benefits and value proposition, UAVs are vulnerable to significant security weaknesses that could impact human safety and national security. Evidently, there is a direct relationship between the demand for UAV systems and the incentive for threat actors to conduct malicious cyber activity. As such, it is pivotal to design and develop effective countermeasures to prevent unauthorized UAV intrusions. This research is concerned with UAV security vulnerabilities that disrupt GPS signals and proposes a Machine Learning approach to detect intrusions of cyber-attacks in UAVs. More specifically, it leverages supervised machine learning to effectively detect intrusion of cyber-attacks on the UAV Attack dataset (Whelan, et. al., 2020) via binary and multi-class classification, while simultaneously aiming to identify a classifier that outperforms prior approaches using standard classification metrics. The research methodology was founded on the data mining process comprised of data collection and understanding, data preparation, modeling, validation, and evaluation. Within this construct, 11 popular classification algorithms were modeled against the UAV Attack Dataset (Whelan, et. al., 2020) to address the research questions and hypotheses. The contributions and conclusions of this research codify that ML approaches are effective for classifying intrusion detection of cyber-attacks in UAVs with 80% precision and accuracy. This research additionally postulates the UAV Attack dataset as a useful dataset for analyzing UAV network environments. Furthermore, it validates that Tree-Based ML algorithms are the most effective for classification purposes when compared against the other classifiers used in this research. Lastly, it provides context into some of the possible factors that contributed to rejecting or accepting each research hypothesis.

molina, andrea angelina
Dissertation
In Copyright
Cybersecurity in Computer Science
Fossaceca, John
Sarkani, Shahryar
Islam, Muhammad
https://scholarspace.library.gwu.edu/etd/5t34sk51n

Notice to Authors

If you are the author of this work and you have any questions about the information on this page, please use the Contact form to get in touch with us.

Thumbnail	Title	Date Uploaded	Visibility	Actions
		2023-11-14	Open Access	Select an action

Machine Learning - CMU

PhD Dissertations

[all are .pdf files].

Robust Machine Learning: Detection, Evaluation and Adaptation Under Distribution Shift Saurabh Garg, 2024

UNDERSTANDING, FORMALLY CHARACTERIZING, AND ROBUSTLY HANDLING REAL-WORLD DISTRIBUTION SHIFT Elan Rosenfeld, 2024

Representing Time: Towards Pragmatic Multivariate Time Series Modeling Cristian Ignacio Challu, 2024

Foundations of Multisensory Artificial Intelligence Paul Pu Liang, 2024

Advancing Model-Based Reinforcement Learning with Applications in Nuclear Fusion Ian Char, 2024

Learning Models that Match Jacob Tyo, 2024

Improving Human Integration across the Machine Learning Pipeline Charvi Rastogi, 2024

Reliable and Practical Machine Learning for Dynamic Healthcare Settings Helen Zhou, 2023

Automatic customization of large-scale spiking network models to neuronal population activity (unavailable) Shenghao Wu, 2023

Estimation of BVk functions from scattered data (unavailable) Addison J. Hu, 2023

Rethinking object categorization in computer vision (unavailable) Jayanth Koushik, 2023

Advances in Statistical Gene Networks Jinjin Tian, 2023 Post-hoc calibration without distributional assumptions Chirag Gupta, 2023

The Role of Noise, Proxies, and Dynamics in Algorithmic Fairness Nil-Jana Akpinar, 2023

Collaborative learning by leveraging siloed data Sebastian Caldas, 2023

Modeling Epidemiological Time Series Aaron Rumack, 2023

Human-Centered Machine Learning: A Statistical and Algorithmic Perspective Leqi Liu, 2023

Uncertainty Quantification under Distribution Shifts Aleksandr Podkopaev, 2023

Probabilistic Reinforcement Learning: Using Data to Define Desired Outcomes, and Inferring How to Get There Benjamin Eysenbach, 2023

Comparing Forecasters and Abstaining Classifiers Yo Joong Choe, 2023

Using Task Driven Methods to Uncover Representations of Human Vision and Semantics Aria Yuan Wang, 2023

Data-driven Decisions - An Anomaly Detection Perspective Shubhranshu Shekhar, 2023

Applied Mathematics of the Future Kin G. Olivares, 2023

METHODS AND APPLICATIONS OF EXPLAINABLE MACHINE LEARNING Joon Sik Kim, 2023

NEURAL REASONING FOR QUESTION ANSWERING Haitian Sun, 2023

Principled Machine Learning for Societally Consequential Decision Making Amanda Coston, 2023

Long term brain dynamics extend cognitive neuroscience to timescales relevant for health and physiology Maxwell B. Wang, 2023

Long term brain dynamics extend cognitive neuroscience to timescales relevant for health and physiology Darby M. Losey, 2023

Calibrated Conditional Density Models and Predictive Inference via Local Diagnostics David Zhao, 2023

Towards an Application-based Pipeline for Explainability Gregory Plumb, 2022

Objective Criteria for Explainable Machine Learning Chih-Kuan Yeh, 2022

Making Scientific Peer Review Scientific Ivan Stelmakh, 2022

Facets of regularization in high-dimensional learning: Cross-validation, risk monotonization, and model complexity Pratik Patil, 2022

Active Robot Perception using Programmable Light Curtains Siddharth Ancha, 2022

Strategies for Black-Box and Multi-Objective Optimization Biswajit Paria, 2022

Unifying State and Policy-Level Explanations for Reinforcement Learning Nicholay Topin, 2022

Sensor Fusion Frameworks for Nowcasting Maria Jahja, 2022

Equilibrium Approaches to Modern Deep Learning Shaojie Bai, 2022

Towards General Natural Language Understanding with Probabilistic Worldbuilding Abulhair Saparov, 2022

Applications of Point Process Modeling to Spiking Neurons (Unavailable) Yu Chen, 2021

Neural variability: structure, sources, control, and data augmentation Akash Umakantha, 2021

Structure and time course of neural population activity during learning Jay Hennig, 2021

Cross-view Learning with Limited Supervision Yao-Hung Hubert Tsai, 2021

Meta Reinforcement Learning through Memory Emilio Parisotto, 2021

Learning Embodied Agents with Scalably-Supervised Reinforcement Learning Lisa Lee, 2021

Learning to Predict and Make Decisions under Distribution Shift Yifan Wu, 2021

Statistical Game Theory Arun Sai Suggala, 2021

Towards Knowledge-capable AI: Agents that See, Speak, Act and Know Kenneth Marino, 2021

Learning and Reasoning with Fast Semidefinite Programming and Mixing Methods Po-Wei Wang, 2021

Bridging Language in Machines with Language in the Brain Mariya Toneva, 2021

Curriculum Learning Otilia Stretcu, 2021

Principles of Learning in Multitask Settings: A Probabilistic Perspective Maruan Al-Shedivat, 2021

Towards Robust and Resilient Machine Learning Adarsh Prasad, 2021

Towards Training AI Agents with All Types of Experiences: A Unified ML Formalism Zhiting Hu, 2021

Building Intelligent Autonomous Navigation Agents Devendra Chaplot, 2021

Learning to See by Moving: Self-supervising 3D Scene Representations for Perception, Control, and Visual Reasoning Hsiao-Yu Fish Tung, 2021

Statistical Astrophysics: From Extrasolar Planets to the Large-scale Structure of the Universe Collin Politsch, 2020

Causal Inference with Complex Data Structures and Non-Standard Effects Kwhangho Kim, 2020

Networks, Point Processes, and Networks of Point Processes Neil Spencer, 2020

Dissecting neural variability using population recordings, network models, and neurofeedback (Unavailable) Ryan Williamson, 2020

Predicting Health and Safety: Essays in Machine Learning for Decision Support in the Public Sector Dylan Fitzpatrick, 2020

Towards a Unified Framework for Learning and Reasoning Han Zhao, 2020

Learning DAGs with Continuous Optimization Xun Zheng, 2020

Machine Learning and Multiagent Preferences Ritesh Noothigattu, 2020

Learning and Decision Making from Diverse Forms of Information Yichong Xu, 2020

Towards Data-Efficient Machine Learning Qizhe Xie, 2020

Change modeling for understanding our world and the counterfactual one(s) William Herlands, 2020

Machine Learning in High-Stakes Settings: Risks and Opportunities Maria De-Arteaga, 2020

Data Decomposition for Constrained Visual Learning Calvin Murdock, 2020

Structured Sparse Regression Methods for Learning from High-Dimensional Genomic Data Micol Marchetti-Bowick, 2020

Towards Efficient Automated Machine Learning Liam Li, 2020

LEARNING COLLECTIONS OF FUNCTIONS Emmanouil Antonios Platanios, 2020

Provable, structured, and efficient methods for robustness of deep networks to adversarial examples Eric Wong , 2020

Reconstructing and Mining Signals: Algorithms and Applications Hyun Ah Song, 2020

Probabilistic Single Cell Lineage Tracing Chieh Lin, 2020

Graphical network modeling of phase coupling in brain activity (unavailable) Josue Orellana, 2019

Strategic Exploration in Reinforcement Learning - New Algorithms and Learning Guarantees Christoph Dann, 2019 Learning Generative Models using Transformations Chun-Liang Li, 2019

Estimating Probability Distributions and their Properties Shashank Singh, 2019

Post-Inference Methods for Scalable Probabilistic Modeling and Sequential Decision Making Willie Neiswanger, 2019

Accelerating Text-as-Data Research in Computational Social Science Dallas Card, 2019

Multi-view Relationships for Analytics and Inference Eric Lei, 2019

Information flow in networks based on nonstationary multivariate neural recordings Natalie Klein, 2019

Competitive Analysis for Machine Learning & Data Science Michael Spece, 2019

The When, Where and Why of Human Memory Retrieval Qiong Zhang, 2019

Towards Effective and Efficient Learning at Scale Adams Wei Yu, 2019

Towards Literate Artificial Intelligence Mrinmaya Sachan, 2019

Learning Gene Networks Underlying Clinical Phenotypes Under SNP Perturbations From Genome-Wide Data Calvin McCarter, 2019

Unified Models for Dynamical Systems Carlton Downey, 2019

Anytime Prediction and Learning for the Balance between Computation and Accuracy Hanzhang Hu, 2019

Statistical and Computational Properties of Some "User-Friendly" Methods for High-Dimensional Estimation Alnur Ali, 2019

Nonparametric Methods with Total Variation Type Regularization Veeranjaneyulu Sadhanala, 2019

New Advances in Sparse Learning, Deep Networks, and Adversarial Learning: Theory and Applications Hongyang Zhang, 2019

Gradient Descent for Non-convex Problems in Modern Machine Learning Simon Shaolei Du, 2019

Selective Data Acquisition in Learning and Decision Making Problems Yining Wang, 2019

Anomaly Detection in Graphs and Time Series: Algorithms and Applications Bryan Hooi, 2019

Neural dynamics and interactions in the human ventral visual pathway Yuanning Li, 2018

Tuning Hyperparameters without Grad Students: Scaling up Bandit Optimisation Kirthevasan Kandasamy, 2018

Teaching Machines to Classify from Natural Language Interactions Shashank Srivastava, 2018

Statistical Inference for Geometric Data Jisu Kim, 2018

Representation Learning @ Scale Manzil Zaheer, 2018

Diversity-promoting and Large-scale Machine Learning for Healthcare Pengtao Xie, 2018

Distribution and Histogram (DIsH) Learning Junier Oliva, 2018

Stress Detection for Keystroke Dynamics Shing-Hon Lau, 2018

Sublinear-Time Learning and Inference for High-Dimensional Models Enxu Yan, 2018

Neural population activity in the visual cortex: Statistical methods and application Benjamin Cowley, 2018

Efficient Methods for Prediction and Control in Partially Observable Environments Ahmed Hefny, 2018

Learning with Staleness Wei Dai, 2018

Statistical Approach for Functionally Validating Transcription Factor Bindings Using Population SNP and Gene Expression Data Jing Xiang, 2017

New Paradigms and Optimality Guarantees in Statistical Learning and Estimation Yu-Xiang Wang, 2017

Dynamic Question Ordering: Obtaining Useful Information While Reducing User Burden Kirstin Early, 2017

New Optimization Methods for Modern Machine Learning Sashank J. Reddi, 2017

Active Search with Complex Actions and Rewards Yifei Ma, 2017

Why Machine Learning Works George D. Montañez , 2017

Source-Space Analyses in MEG/EEG and Applications to Explore Spatio-temporal Neural Dynamics in Human Vision Ying Yang , 2017

Computational Tools for Identification and Analysis of Neuronal Population Activity Pengcheng Zhou, 2016

Expressive Collaborative Music Performance via Machine Learning Gus (Guangyu) Xia, 2016

Supervision Beyond Manual Annotations for Learning Visual Representations Carl Doersch, 2016

Exploring Weakly Labeled Data Across the Noise-Bias Spectrum Robert W. H. Fisher, 2016

Optimizing Optimization: Scalable Convex Programming with Proximal Operators Matt Wytock, 2016

Combining Neural Population Recordings: Theory and Application William Bishop, 2015

Discovering Compact and Informative Structures through Data Partitioning Madalina Fiterau-Brostean, 2015

Machine Learning in Space and Time Seth R. Flaxman, 2015

The Time and Location of Natural Reading Processes in the Brain Leila Wehbe, 2015

Shape-Constrained Estimation in High Dimensions Min Xu, 2015

Spectral Probabilistic Modeling and Applications to Natural Language Processing Ankur Parikh, 2015 Computational and Statistical Advances in Testing and Learning Aaditya Kumar Ramdas, 2015

Corpora and Cognition: The Semantic Composition of Adjectives and Nouns in the Human Brain Alona Fyshe, 2015

Learning Statistical Features of Scene Images Wooyoung Lee, 2014

Towards Scalable Analysis of Images and Videos Bin Zhao, 2014

Statistical Text Analysis for Social Science Brendan T. O'Connor, 2014

Modeling Large Social Networks in Context Qirong Ho, 2014

Semi-Cooperative Learning in Smart Grid Agents Prashant P. Reddy, 2013

On Learning from Collective Data Liang Xiong, 2013

Exploiting Non-sequence Data in Dynamic Model Learning Tzu-Kuo Huang, 2013

Mathematical Theories of Interaction with Oracles Liu Yang, 2013

Short-Sighted Probabilistic Planning Felipe W. Trevizan, 2013

Statistical Models and Algorithms for Studying Hand and Finger Kinematics and their Neural Mechanisms Lucia Castellanos, 2013

Approximation Algorithms and New Models for Clustering and Learning Pranjal Awasthi, 2013

Uncovering Structure in High-Dimensions: Networks and Multi-task Learning Problems Mladen Kolar, 2013

Learning with Sparsity: Structures, Optimization and Applications Xi Chen, 2013

GraphLab: A Distributed Abstraction for Large Scale Machine Learning Yucheng Low, 2013

Graph Structured Normal Means Inference James Sharpnack, 2013 (Joint Statistics & ML PhD)

Probabilistic Models for Collecting, Analyzing, and Modeling Expression Data Hai-Son Phuoc Le, 2013

Learning Large-Scale Conditional Random Fields Joseph K. Bradley, 2013

New Statistical Applications for Differential Privacy Rob Hall, 2013 (Joint Statistics & ML PhD)

Parallel and Distributed Systems for Probabilistic Reasoning Joseph Gonzalez, 2012

Spectral Approaches to Learning Predictive Representations Byron Boots, 2012

Attribute Learning using Joint Human and Machine Computation Edith L. M. Law, 2012

Statistical Methods for Studying Genetic Variation in Populations Suyash Shringarpure, 2012

Data Mining Meets HCI: Making Sense of Large Graphs Duen Horng (Polo) Chau, 2012

Learning with Limited Supervision by Input and Output Coding Yi Zhang, 2012

Target Sequence Clustering Benjamin Shih, 2011

Nonparametric Learning in High Dimensions Han Liu, 2010 (Joint Statistics & ML PhD)

Structural Analysis of Large Networks: Observations and Applications Mary McGlohon, 2010

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy Brian D. Ziebart, 2010

Tractable Algorithms for Proximity Search on Large Graphs Purnamrita Sarkar, 2010

Rare Category Analysis Jingrui He, 2010

Coupled Semi-Supervised Learning Andrew Carlson, 2010

Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong, 2009

Efficient Matrix Models for Relational Learning Ajit Paul Singh, 2009

Exploiting Domain and Task Regularities for Robust Named Entity Recognition Andrew O. Arnold, 2009

Theoretical Foundations of Active Learning Steve Hanneke, 2009

Generalized Learning Factors Analysis: Improving Cognitive Models with Machine Learning Hao Cen, 2009

Detecting Patterns of Anomalies Kaustav Das, 2009

Dynamics of Large Networks Jurij Leskovec, 2008

Computational Methods for Analyzing and Modeling Gene Regulation Dynamics Jason Ernst, 2008

Stacked Graphical Learning Zhenzhen Kou, 2007

Actively Learning Specific Function Properties with Applications to Statistical Inference Brent Bryan, 2007

Approximate Inference, Structure Learning and Feature Estimation in Markov Random Fields Pradeep Ravikumar, 2007

Scalable Graphical Models for Social Networks Anna Goldenberg, 2007

Measure Concentration of Strongly Mixing Processes with Applications Leonid Kontorovich, 2007

Tools for Graph Mining Deepayan Chakrabarti, 2005

Automatic Discovery of Latent Variable Models Ricardo Silva, 2005

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 28 June 2024

A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography

Chiu-Fan Chen 1 , 2 , 3 ,
Chun-Hsiang Hsu 1 ,
You-Cheng Jiang 1 ,
Wen-Ren Lin 1 ,
Wei-Cheng Hong 1 ,
I.-Yuan Chen 1 ,
Min-Hsi Lin 1 ,
Kuo-An Chu 1 ,
Chao-Hsien Lee 3 ,
David Lin Lee 1 &
Po-Fan Chen 4 , 5

Scientific Reports volume 14 , Article number: 14917 ( 2024 ) Cite this article

1 Altmetric

Metrics details

Machine learning
Tuberculosis

In tuberculosis (TB), chest radiography (CXR) patterns are highly variable, mimicking pneumonia and many other diseases. This study aims to evaluate the efficacy of Google teachable machine, a deep neural network-based image classification tool, to develop algorithm for predicting TB probability of CXRs. The training dataset included 348 TB CXRs and 3806 normal CXRs for training TB detection. We also collected 1150 abnormal CXRs and 627 normal CXRs for training abnormality detection. For external validation, we collected 250 CXRs from our hospital. We also compared the accuracy of the algorithm to five pulmonologists and radiological reports. In external validation, the AI algorithm showed areas under the curve (AUC) of 0.951 and 0.975 in validation dataset 1 and 2. The accuracy of the pulmonologists on validation dataset 2 showed AUC range of 0.936–0.995. When abnormal CXRs other than TB were added, AUC decreased in both human readers (0.843–0.888) and AI algorithm (0.828). When combine human readers with AI algorithm, the AUC further increased to 0.862–0.885. The TB CXR AI algorithm developed by using Google teachable machine in this study is effective, with the accuracy close to experienced clinical physicians, and may be helpful for detecting tuberculosis by CXR.

Deep learning for distinguishing normal versus abnormal chest radiographs and generalization to two unseen diseases tuberculosis and COVID-19

Deep learning, computer-aided radiography reading for tuberculosis: a diagnostic accuracy study from a tertiary hospital in India

Automated abnormality classification of chest radiographs using deep convolutional neural networks

Introduction.

Tuberculosis (TB) is one of the most important infectious diseases worldwide and causes millions of illnesses and deaths annually 1 . Chest radiography is an essential first-line diagnostic tool for TB because of its low cost and speed. However, the characteristics of TB chest X-ray (CXR) are highly variable, mimicking pneumonia and many other diseases. The atypical pattern is particularly common in elderly patients, immunocompromised, and those with multiple comorbidities 2 , 3 . Consequently, the early diagnosis of TB using CXRs can be challenging. Moreover, CXR reports often cannot be completed in a timely manner, this also increases the difficulty of early TB detection for the frontline clinicians.

The application of artificial intelligence (AI) to CXR for TB is a field with tremendous potential. The deep neural network-based image interpretation has achieved remarkable results in the field of medical imaging. Recent research has developed numerous medical image recognition algorithms for CXR patterns 4 , 5 , 6 and various pulmonary diseases (pneumonia, lung cancer, TB, pneumothorax, COVID-19, etc.) 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , in some of them the accuracy can match or even outperform that of radiologists. Some of them had external validation confirmed accuracy 7 , 8 , 9 , 10 , 12 , 14 , 16 , 17 . In a study evaluation CXR algorithms for pulmonary diseases classification, the combination of algorithm with physicians successfully improve accuracy than physicians alone. And the benefit is observed in both radiologists and non-radiology physicians 8 . In another study evaluating CXR algorithm for TB detection, a similar accuracy benefit is found in physicians with algorithm assistance 7 . Five commercial TB AI algorithms had been carefully validated, and the specificity ranged from 61 to 74% when sensitivity was fixed at 90% 13 , 14 .

In 2019, Google Teachable Machine (GoogleTM) launched its second version 18 . This tool allows users to train deep neural networks for image recognition using a graphical user interface on a Chrome browser with almost no coding required. Its highly lightweight design, along with its use of transfer learning techniques, significantly reduces the computational requirements and amount of data required for training. This implies that AI training can be simply performed using a desktop or laptop computer. Therefore, the purpose of this study is to assess the feasibility and accuracy of GoogleTM, for the detection of CXR images in patients with TB. To see its utility in clinical practice, we also plan to compare the accuracy of this simple AI tool to the frontline physicians.

Materials and methods

This study was designed to use freely available open TB CXR datasets as training data for our AI algorithm. Subsequent accuracy analyses were performed using independent CXR datasets and actual TB cases from our hospital. All image data were de-identified to ensure privacy. This study was reviewed and approved by institutional review board (IRB) of Kaohsiung Veterans General Hospital, which waived the requirement for informed consent (IRB no.: KSVGH23-CT4-13). This study adheres to the principles of the Declaration of Helsinki.

Training datasets

The flowchart of the study design is shown in Fig. 1 . Due to a high prevalence of TB and varied imaging presentation, TB cannot be entirely excluded in case of CXR presenting with pneumonia or other entities. Our preliminary research indicated that training a model solely on TB vs. normal resulted in bimodally distributed predictive values. Therefore, CXRs that were abnormal but not indicative of TB usually had predictive value too high or too low, and failed to effectively differentiate abnormal cases from normal or TB. For common CXR abnormalities such as pneumonia and pleural effusion, the TB risk is lower, but not zero. Thus, we trained two models using 2 different training datasets, one for TB detection and another for abnormality detection. Then the output predictive values were averaged.

Flow chart of model training and validations.

The features of the CXR datasets for training is summarized in Table 1 . The inclusion criteria are CXR of TB, other abnormality, or normal. Both posteroanterior view and anteroposterior view CXRs are included. The exclusion criteria are CXR with poor quality, lateral view CXR, children CXR, and those with lesions too small to detect at 224 × 224 pixels size). All the CXR images were confirmed by C.F.C. to ensure both image quality and correctness.

Training dataset 1 is used for training algorithms to detect typical TB pattern on CXR. 348 TB CXRs and 3806 normal CXRs were collected from various open datasets for training, including the Shenzhen dataset from Shenzhen No. 3 People’s Hospital, the Montgomery dataset 19 , 20 , and Kaggle's RSNA Pneumonia Detection Challenge 21 , 22 .

Training dataset 2 is used for training algorithms to detect CXR abnormalities. A total of 1150 abnormal CXRs and 627 normal CXRs were collected from the ChestX-ray14 dataset 23 . The abnormal CXRs consisted of consolidation: 185, cardiomegaly: 235, pulmonary edema 139, pleural effusion: 230, pulmonary fibrosis 106, and mass: 255.

Algorithm: Google teachable machine

In this study, we employed GoogleTM 18 , a free online AI software dedicated to image classification. GoogleTM provides a user-friendly web-based graphical interface that allows users to execute deep neural network computations and train image classification models with minimal coding requirements. By utilizing the power of transfer learning, GoogleTM significantly reduces the computational time and data amount required for deep neural network training. Within GoogleTM, the base model for transfer learning was MobileNet, a model pretrained by Google on the ImageNet dataset featuring 14 million images and capable of recognizing 1,000 classes of images. Transfer learning is achieved by modifying the last 2 layers of the pre-trained MobileNet, and then keep subsequent specific image recognition training 18 , 24 . In GoogleTM , all images are adjusted and cropped to 224 × 224 pixels for training. 85% of the image is automatically divided into training dataset, and the remaining 15% into validation dataset to calculate the accuracy.

The hardware employed in this study included a 12th-generation Intel Core i9-12900K CPU with 16 cores, operating at 3.2–5.2 GHz, an NVIDIA RTX A5000 GPU equipped with 24GB of error-correction code (ECC) graphics memory, 128 GB of random-access memory (RAM), and a 4TB solid-state disk (SSD).

Dataset for external validation

To evaluate the accuracy of the algorithms, we collected clinical CXR data for TB, normal cases, and pneumonia/other disease from our hospital.

Validation dataset 1 included 250 de-identified CXRs retrospectively collected from VGHKS. The CXRs dates were between January 1, 2010 and February 27, 2023. This dataset included 83 TB (81 confirmed by microbiology, and 2 confirmed by pathology), 84 normal, and 83 abnormal other than TB cases (73 pneumonia, 14 pleural effusion, 10 heart failure, and 4 fibrosis. Some cases had combined features). The image size of these CXRs ranged from width: 1760–4280 pixels and height: 1931–4280 pixels.

Validation dataset 2 is a smaller dataset derived from validation dataset 1, for comparison of algorithm and physician’s performance, and included 50 TB, 33 normal and 22 abnormal other than TB cases (22 pneumonia, 5 pleural effusion, 1 heart failure, and 1 fibrosis) CXRs. The features of the two validation datasets are provided in Table 1 .

Data collected from clinical CXR cases included demographic data (such as age and sex), radiology reports, clinical diagnoses, microbiological reports, and pathology reports. All clinical TB cases included in the study had their diagnosis confirmed by microbiology or pathology. Their CXR was performed within 1 month of TB diagnosis. Normal CXRs were also reviewed by C.F.C. and radiology reports were considered. Pneumonia/other disease cases were identified by reviewing medical records and examinations, with diagnoses made by clinical physicians’ judgement, and without evidence of TB detected within three months period.

Physician’s performance test

We employed validation dataset 2 to evaluate the accuracy of TB detection of 5 clinical physicians (five board-certified pulmonologists, average experience 10 years, range 5–16 years). Each physician performed the test without additional clinical information, and was asked to estimate the probability of TB in each CXR, consider whether sputum TB examinations were needed, and make a classification from three categories: typical TB pattern, normal pattern, or abnormal pattern (less like TB).

We also collected radiology reports from validation dataset 2 to evaluate their sensitivity for detecting TB. Reports mentioning suspicion of TB or mycobacterial infection were classified as typical TB pattern. Reports indicating abnormal patterns such as infiltration, opacity, pneumonia, effusion, edema, mass, or tumor (but without mentioning “tuberculosis”, “TB”, or “mycobacterial infection”) were classified as abnormal pattern (less like TB). Reports demonstrating no evident abnormalities were classified as normal pattern. Furthermore, by analyzing the pulmonologists’ decisions regarding sputum TB examinations, we estimate the sensitivity of TB detection in pulmonologist’s actual clinical practice.

Statistical analysis

Continuous variables are represented as mean ± standard deviation (SD) or median (interquartile range [IQR]), while categorical variables are represented as number (percentage). For accuracy analysis, the receiver operating characteristic (ROC) curve was used to compute the area under the curve (AUC). Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), likelihood ratio (LR), overall accuracy, and F1 score were calculated. A confusion matrix was used to illustrate the accuracy of each AI model. Boxplots were used to evaluate the distribution of the predicted values of the AI models for each etiology subgroup.

The formulas for each accuracy calculation are as follows:

(TP is true positives, TN is true negatives, FP is false positives, FN is false negatives, P is all positives, and N is all negatives.)

In this study, model 1 was trained by training dataset 1 (TB vs. normal), with the purpose to detect typical TB pattern on CXR. Model 2 was trained by training dataset 2 (abnormal vs. normal), with the purpose to detect CXR abnormalities. Each training dataset was trained at least 10 times, and the algorithm with the best overall accuracy was chosen. In model 2, twofold data augmentation was performed by zoom in method. Model 3 was a combination of model 1 and model 2, by averaging the predictive values of the two models. It was developed to detect both TB and other CXR abnormalities.

Internal validation

The internal validation results calculated during training showed excellent accuracy: model 1 showed a sensitivity of 0.96, specificity of 0.98, and overall accuracy of 0.97. Model 2 exhibited a sensitivity of 0.92, specificity of 0.92, and an overall accuracy of 0.92. A detailed analysis of the accuracy is provided in Table 2 , and the confusion matrix is provided in Table S1 . The hyperparameters in training GoogleTM, and the accuracy curve and loss function were shown in Figure S1 and S2 .

External validation

The accuracy analysis for external validation is shown in Table 3 and Fig. 2 a–d. For the analysis of TB vs. normal, model 1 showed AUC of 0.8 and 0.795 in validation dataset 1 and 2, respectively. Model 2 showed AUC of 0.902 and 0.917. Model 3 demonstrated better accuracy, with AUC of 0.951 and 0.975, respectively. For the analysis of TB vs. normal and abnormal other than TB, model 1 showed AUC of 0.72 and 0.752 in validation dataset 1 and 2, respectively. Model 2 showed AUC of 0.656 and 0.718. Model 3 showed AUC of 0.758 and 0.828.

Receiver operating characteristic curves of AI models in external validations. ( a ) Validation dataset 1: TB vs. normal, ( b ) validation dataset 1: TB vs. normal and abnormal other than TB, ( c ) validation dataset 2: TB vs normal, ( d ) validation dataset 2: TB vs. normal and abnormal other than TB. TB tuberculosis, AI artificial intelligence.

Both datasets revealed that model 3 outperformed model 1 and 2, with the best AUC, overall accuracy and F1 score. The distribution of predictive values of model 1 to model 3 in each disease subgroup were provided in Figure S3 .

Physicians’ performance

Five pulmonologists independently assessed validation dataset 2. The detailed results of the accuracy analysis are presented in Table 4 and Fig. 3 a, b. For the analysis of TB vs. normal, the AUC ranged from 0.936 to 0.995. For TB vs. normal and abnormal other than TB, the AUC ranged from 0.843 to 0.888. The AUC of model 3 is close but mild inferior to the five pulmonologists. The overall accuracy and F1 score of Model 3 are similar or even better than pulmonologist. Model 3 has a higher sensitivity than pulmonologists (0.86 vs. 0.34–0.76), while the specificity is lower (0.65–1.0 vs. 0.85–1.0). When combining pulmonologists with model 3 by averaging predictive values, 4 of 5 pulmonologists showed improving of AUC (0.862–0.885, Table 4 and Fig. 4 ). The radiographic report for validation dataset 2 revealed an even lower sensitivity for TB (0.3), and a good specificity (0.98–1.0).

Receiver operating characteristic curves of model 3 and 5 pulmonologists evaluating validation dataset 2. ( a ) TB vs normal, ( b ) TB vs. normal and abnormal other than TB. TB tuberculosis. V1–V5 represents the 5 pulmonologists.

Receiver operating characteristic curves of the 5 pulmonologists that combined with model 3, evaluating validation dataset 2 (TB vs. normal and abnormal other than TB). TB tuberculosis. V1–V5 represents the 5 pulmonologists.

Table S2 showed the decisions of pulmonologists on TB sputum exams in each subgroup. The average TB sputum exam rate is 97% in CXR typical TB pattern, 62% in those with abnormal pattern (less like TB). The average TB sputum coverage rate of TB cases is 87%, abnormal other than TB cases is 56%, and normal cases is 2%.

CXR image patterns and cutoff value evaluation

According to the average result of the five pulmonologists’ interpretation, the CXR image patterns are classified as three categories: typical TB pattern, abnormal pattern (less like TB), and normal pattern. The summary of predictive values of AI models and pulmonologists in each CXR image pattern are provided in Table 5 . For model 3, the median predictive value is 0.97 (IQR: 0.64–0.99) in typical TB pattern, 0.5 (IQR: 0.5–0.9) in abnormal pattern (less like TB), and 0.03 (IQR: 0.005–0.13) in normal pattern. The boxplot for distribution of predictive values of model 3 and pulmonologists was shown in Figure S4 . A cross table analyzing CXR patterns and disease groups of validation dataset 2 is provided in Table S3 , showing that only 26 of 50 TB cases (52%) had typical TB pattern. Meaning while, 4 of 22 of abnormal other than TB cases (18%) presented with CXR pattern mimicking TB. Figure S5 compared the predictive value of model 3 between each disease group and image pattern subgroup. Model 3 had higher predictive values in CXR with typical TB pattern than abnormal pattern (both for TB group and abnormal other than TB group).

Cutoff value evaluation for model 3 is shown in Table S4 . At cutoff value of 0.4, the sensitivity approached 0.92 and 0.94 in validation dataset 1 and 2, respectively. While at cutoff value of 0.8, the specificity is 0.81 and 0.89. When setting sensitivity at 0.90, the specificity is 0.48 and 0.60 in validation dataset 1 and 2, respectively.

Deployment of the TB CXR AI

Based on the results of this study, we deployed model 3, which had the best accuracy performance, as a readily accessible web application (utilizing JavaScript and TensorFlow.js). This TB CXR AI algorithm can run on a web browser and process data on your device, without sending image to the server. The AI algorithm can be accessed via the following URL: https://www.cxrai-prediction.net/ , and the CXR interpretation examples were shown in Figs. S6 and S7 . We also provided some examples of TB cases detected by AI algorithm but miss diagnosed by physicians in Fig. S8 , and some examples that AI algorithm failed to detect TB in Fig. S9 .

In this study, the TB CXR AI algorithm training via Google Teachable Machine with a relatively small number of images, has achieved an acceptable accuracy close to that of professional pulmonologists, and it has a higher sensitivity in TB detection, showing a potential to aid both specialist and non-specialist physicians in enhancing their TB screening sensitivity.

The TB cases collected in this study had relatively high percentage (48%) of atypical CXR pattern. This may be due to older age of our patient group (average 72.7 years old in TB patients). Literature also showed that the percentage of typical TB CXR pattern (upper lung predominant) is significantly influenced by patient’s performance status (PS) 25 . For TB patients with good physical activity (PS of 0), a typical CXR pattern was observed in 71% of cases. As the patient’s physical activity got worse, the proportion of typical CXR patterns drops dramatically (PS = 1: 44%, PS = 2: 19%, PS = 3: 16%, PS = 4: 0%) 25 .

Among the AI models established in this study, model 1 had good specificity but lower sensitivity for TB. However, we found this model was not effective to detect TB with atypical CXR patterns. As for model 2, it is effective to differentiate abnormal CXR from normal cases. Model 3 is the combination of model 1 and 2, and give the average predictive values of the 2 models. This ensemble method can balance the detection of typical and atypical TB, and compensate the occasional false positives and false negatives from model 1 and 2. In theory, typical TB cases would have predictive values near 1 for both model 1 and model 2, averaging around 1. For abnormal cases without a typical TB pattern, model 1 might predict values close to 0, while model 2 would remain near 1, with an average of 0.5. In normal cases, both models would predict values close to 0, resulting in an average also near 0. As evidenced by validation datasets 1 and 2, model 3 successfully achieved the best AUC, which is close to clinical experts.

Both model 3 and the pulmonologists demonstrate excellent accuracy when evaluating TB vs normal. However, when adding abnormal other than TB (mostly pneumonia), the accuracy decreased remarkably in both model 3 (AUC: 0.975 decrease to 0.828) and pulmonologists (AUC: 0.936–0.995 decrease to 0.843–0.888). Pneumonia and other diseases (e.g. pulmonary fibrosis) may also mimic TB. As pneumonia cases increase, the false positives also increase, and we suggest this is the limitation of CXR TB detection, both for human and AI models. However, our study showed that the integration of AI model with physicians’ clinical judgment could potentially improve the overall accuracy of TB detection.

In terms of the performance of pulmonologists and radiology reports, direct comparisons between them maybe not feasible. Because the pulmonologists are already aware that the study is evaluating TB CXRs, and during the exam, the judgment is made under heightened awareness. Therefore, the sensitivity for TB is better than in real-world clinical practice. In contrast, radiology reports are collected retrospectively, reflecting the radiologists’ daily practice at that time. Awareness of TB in these reports is likely lower. On the other hand, the accuracy difference between the radiology reports (sensitivity: 0.3, overall accuracy: 0.65) and the pulmonologists (sensitivity: 0.34–0.76, overall accuracy: 0.65–0.80) also indicates that increasing physicians' awareness of TB may enhance the accuracy of TB CXR evaluations. In this study, pulmonologists tended to perform more extensive TB sputum examinations (even without clinical information), which reflect the experts’ alertness to improve TB detection (70%-98% exam rate in TB cases). Besides, we suggest TB CXR AI may well potentially improve TB awareness for both specialist and non-specialist physicians.

Although in this study, our model showed a lower accuracy than the 5 commercial TB AI algorithms (specificity 48–60% vs. 61–74%, when sensitivity was fixed at 90%) 14 . However, the TB patients in our study are much older (median age 74 vs. 37 years), and the percentage of typical TB image pattern is lower (52%). This difference may decrease the accuracy of AI model in our study. In fact, previous literature also showed decreased accuracy performance of the 5 commercial TB AI algorithms in older age group (> 60 years, AUC range: 0.805–0.864) 14 . This result is getting close to the accuracy of our model (AUC = 0.828) and the pulmonologists (AUC range: 0.843–0.888) in validation dataset 2.

Recent literature has also discussed the problems about TB CXR AI 26 , including the heterogeneity of accuracy across different populations, determination of prediction value thresholds and their variability, and misjudgments in non-TB patients. Therefore, this study used actual clinical CXRs for external validation to confirm accuracy in clinical situation. The determination of thresholds is both a strength and limitation of AI models. Therefore, this study also conducted a cutoff value evaluation to help determine the relationship of predictive value and accuracy.

The limitations of this study were as follows. First, the image recognition of GoogleTM operates on a relatively small resolution (224 × 224 pixels). Therefore, this AI algorithm can only identify large and obvious image features, and small lung lesions may be missed. Second, the AI model used in this study could not locate lesions. Third, this is a single center retrospective study, and the size of the validation dataset is relatively small. The accuracy result may not be generalizable to different CXR machine and settings. Fourth, this AI model is not optimal for detect TB cases without a typical TB pattern. However, physicians also have the similar limitation. Fifth, we did not evaluate the accuracy of radiologists. However, the retrospectively collected radiology reports may reflect the accuracy of daily clinical practice of radiologists. Finally, we did not evaluate the accuracy of frontline medical staffs such as junior residents and nurse practitioners. However, we can expect their accuracy for TB detection would be lower than expert physicians. And AI algorithm may be more helpful for them.

In conclusion, this study developed an open and free AI algorithm, which is effective in detection of typical TB features on CXR. The accuracy is acceptable and may be close to the clinical experts. We suggest a predictive value > 0.9 for high TB probability. For predictive value 0.5–0.9, abnormal pattern is favored, and TB may be considered. For predictive value < 0.4, TB is unlikely. Further research with larger scale validation to evaluate the generalizability of the algorithm, and compare the performance in different population, is required.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

World Health Organization. Global Tuberculosis Report 2021 . https://www.who.int/publications/i/item/9789240037021 (World Health Organization, 2021).

Perez-Guzman, C., Torres-Cruz, A., Villarreal-Velarde, H. & Vargas, M. H. Progressive age-related changes in pulmonary tuberculosis images and the effect of diabetes. Am. J. Respir. Crit. Care Med. 162 (5), 1738–1740. https://doi.org/10.1164/ajrccm.162.5.2001040 (2000).

Article CAS PubMed Google Scholar

Mathur, M., Badhan, R. K., Kumari, S., Kaur, N. & Gupta, S. Radiological manifestations of pulmonary tuberculosis—A comparative study between immunocompromised and immunocompetent patients. J. Clin. Diagn. Res. 11 (9), TC06–TC09. https://doi.org/10.7860/JCDR/2017/28183.10535 (2017).

Article PubMed PubMed Central Google Scholar

Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 15 (11), e1002686. https://doi.org/10.1371/journal.pmed.1002686 (2018).

Cohen, J. P., Bertin, P., & Frappier, V. Chester: A Web Delivered Locally Computed Chest X-Ray Disease Prediction System . arXiv: https://arxiv.org/abs/1901.11210 (2020).

Al-Antari, M. A., Hua, C. H., Bang, J. & Lee, S. Fast deep learning computer-aided diagnosis of COVID-19 based on digital chest x-ray images. Appl. Intell. (Dordr). 51 (5), 2890–2907 (2021).

Article PubMed Google Scholar

Hwang, E. J. et al. Development and validation of a deep learning-based automatic detection algorithm for active pulmonary tuberculosis on chest radiographs. Clin. Infect. Dis. 69 (5), 739–747. https://doi.org/10.1093/cid/ciy967 (2019).

Wang, C. et al. Development and validation of an abnormality-derived deep-learning diagnostic system for major respiratory diseases. NPJ Digit. Med. 5 (1), 124 (2022).

Hwang, E. J. et al. Development and validation of a deep learning-based automated detection algorithm for major thoracic diseases on chest radiographs. JAMA Netw. Open 2 (3), e191095 (2019).

Nam, J. G. et al. Development and validation of deep learning-based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology 290 (1), 218–228 (2019).

Sze-To, A., Riasatian, A. & Tizhoosh, H. R. Searching for pneumothorax in X-ray images using autoencoded deep features. Sci. Rep. 11 (1), 9817 (2021).

Article ADS CAS PubMed PubMed Central Google Scholar

Murphy, K. et al. Computer aided detection of tuberculosis on chest radiographs: An evaluation of the CAD4TB v6 system. Sci. Rep. 10 (1), 5492 (2020).

Article ADS PubMed PubMed Central Google Scholar

Singh, M. et al. Evolution of machine learning in tuberculosis diagnosis: A review of deep learning-based medical applications. Electronics 11 (17), 2634 (2022).

Article Google Scholar

Qin, Z. Z. et al. Tuberculosis detection from chest X-rays for triaging in a high tuberculosis-burden setting: An evaluation of five artificial intelligence algorithms. Lancet Digit. Health 3 (9), e543–e554 (2021).

Akhter, Y., Singh, R. & Vatsa, M. AI-based radiodiagnosis using chest X-rays: A review. Front. Big Data 6 , 1120989 (2023).

Miyazaki, A. et al. Computer-aided diagnosis of chest X-ray for COVID-19 diagnosis in external validation study by radiologists with and without deep learning system. Sci. Rep. 13 (1), 17533. https://doi.org/10.1038/s41598-023-44818-9 (2023).

Abad, M., Casas-Roma, J. & Prados, F. Generalizable disease detection using model ensemble on chest X-ray images. Sci. Rep. 14 (1), 5890. https://doi.org/10.1038/s41598-024-56171-6 (2024).

Teachable Machine: Train a Computer to Recognize Your Own Images, Sounds, & Poses . https://teachablemachine.withgoogle.com/

Jaeger, S. et al. Automatic tuberculosis screening using chest radiographs. IEEE Trans. Med. Imaging. 33 (2), 233–245. https://doi.org/10.1109/TMI.2013.2284099 (2014).

Candemir, S. et al. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans. Med. Imaging 33 (2), 577–590. https://doi.org/10.1109/TMI.2013.2290491 (2014).

Kaggle. RSNA Pneumonia Detection Challenge [ Online ]. https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data . Accessed 14 June 2021 (2021).

Rahman, T. et al. Reliable tuberculosis detection using chest X-ray with deep learning, segmentation and visualization. IEEE Access 8 , 191586–191601. https://doi.org/10.1109/ACCESS.2020.3031384 (2020).

Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M. & Summers, R.M. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition ( CVPR ). 3462–3471 (IEEE, 2017).

Carney, M. et al . Teachable machine: Approachable web-based tool for exploring machine learning classification. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems ( CHI EA '20 ). https://doi.org/10.1145/3334480.3382839 (Association for Computing Machinery, 2020).

Goto, A. et al. Factors associated with atypical radiological findings of pulmonary tuberculosis. PLoS One. 14 (7), e0220346. https://doi.org/10.1371/journal.pone.0220346 (2019).

Article CAS PubMed PubMed Central Google Scholar

Geric, C. et al. The rise of artificial intelligence reading of chest X-rays for enhanced TB diagnosis and elimination. Int. J. Tuberc. Lung Dis. 27 (5), 367–372. https://doi.org/10.5588/ijtld.22.0687 (2023).

Download references

Acknowledgements

We thank Mrs. Yu-Jung Chang for assisting literature search.

Author information

Authors and affiliations.

Division of Chest Medicine, Department of Internal Medicine, Kaohsiung Veterans General Hospital, Kaohsiung, Taiwan, R.O.C.

Chiu-Fan Chen, Chun-Hsiang Hsu, You-Cheng Jiang, Wen-Ren Lin, Wei-Cheng Hong, I.-Yuan Chen, Min-Hsi Lin, Kuo-An Chu & David Lin Lee

Shu-Zen Junior College of Medicine and Management, Kaohsiung, Taiwan, R.O.C.

Chiu-Fan Chen

Department of Nursing, Mei-Ho University, Pingtung, Taiwan, R.O.C.

Chiu-Fan Chen & Chao-Hsien Lee

Department of Obstetrics and Gynecology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan, Taiwan, R.O.C.

Po-Fan Chen

Quality Center, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan, Taiwan, R.O.C.

You can also search for this author in PubMed Google Scholar

Contributions

C.F.C.: Conceptualization, Methodology, Investigation, Formal analysis, Data Curation, Writing—original draft, Writing—review & editing; P.F.C: Conceptualization, Supervision, Software, Investigation, Resources, Writing—review & editing. C.H.H., Y.C.J., W.R.L., W.C.H., I.Y.C.: Validation, Investigation. C.H.L.: Methodology, Investigation, Formal analysis. M.H.L., K.A.C., D.L.L.: Investigation, Resources. All of the authors contributed to and approved the final version of the manuscript.

Corresponding author

Correspondence to Po-Fan Chen .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Chen, CF., Hsu, CH., Jiang, YC. et al. A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography. Sci Rep 14 , 14917 (2024). https://doi.org/10.1038/s41598-024-65703-z

Download citation

Received : 12 November 2023

Accepted : 24 June 2024

Published : 28 June 2024

DOI : https://doi.org/10.1038/s41598-024-65703-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Artificial intelligence
Chest X-ray
Deep learning
Neural network

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

A Showcase of scholarship, research, and creativity at the university of southern mississippi

< Previous

Home > Master's Theses > 651

Master's Theses

A machine learning approach to network intrusion detection system using k nearest neighbor and random forest.

Ilemona S. Atawodi , University of Southern Mississippi Follow

Date of Award

Spring 2019

Degree Type

Masters Thesis

Degree Name

Master of Science (MS)

Committee Chair

Zhaoxian Zhou

Committee Chair School

Computing Sciences and Computer Engineering

Committee Member 2

Chaoyang Zhang

Committee Member 2 School

Committee member 3.

Kuo Lane Chen

Committee Member 3 School

The evolving area of cybersecurity presents a dynamic battlefield for cyber criminals and security experts. Intrusions have now become a major concern in the cyberspace. Different methods are employed in tackling these threats, but there has been a need now more than ever to updating the traditional methods from rudimentary approaches such as manually updated blacklists and whitelists. Another method involves manually creating rules, this is usually one of the most common methods to date.

A lot of similar research that involves incorporating machine learning and artificial intelligence into both host and network-based intrusion systems recently. Doing this originally presented problems of low accuracy, but the growth in the area of machine learning over the last decade has led to vast improvements in machine learning algorithms and their requirements.

This research applies k nearest neighbours with 10-fold cross validation and random forest machine learning algorithms to a network-based intrusion detection system in order to improve the accuracy of the intrusion detection system. This project focused on specific feature selection improve the increase the detection accuracy using the K-fold cross validation algorithm on the random forest algorithm on approximately 126,000 samples of the NSL-KDD dataset.

2019, Ilemona S. Atawodi

Recommended Citation

Atawodi, Ilemona S., "A Machine Learning Approach to Network Intrusion Detection System Using K Nearest Neighbor and Random Forest" (2019). Master's Theses . 651. https://aquila.usm.edu/masters_theses/651

Since May 16, 2019

Included in

Computational Engineering Commons , Computer Engineering Commons

Advanced Search

Notify me via email or RSS
Collections

Author Corner

Submit Research

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright

Help | Advanced Search

Computer Science > Machine Learning

Title: towards reducing data acquisition and labeling for defect detection using simulated data.

Abstract: In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. However, relying solely on synthetic data is frequently inadequate for effectively training models that perform well on real-world data, primarily due to domain shifts between the synthetic and real-world data. We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels. Using both simulated and real-world X-ray images, we train an object detection model with different strategies to identify the training approach that generates the best detection results while minimising the demand for annotated real-world training samples. Our preliminary findings suggest that the sim-2-real domain adaptation approach is more cost-efficient than a fully supervised oracle - if the total number of available annotated samples is fixed. Given a certain number of labeled real-world samples, training on a mix of synthetic and unlabeled real-world data achieved comparable or even better detection results at significantly lower cost. We argue that future research into the cost-efficiency of different training strategies is important for a better understanding of how to allocate budget in applied machine learning projects.

Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	[cs.LG]
	(or [cs.LG] for this version)
	Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

HTML (experimental)
Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Downloadable Content

Malware detection using machine learning

Masters Thesis
Donepudi, Naveen
Lee, Wonjun
Wiegley, Jeffrey
McIlhenny, Robert
Wang, Taehyung
Computer Science
California State University, Northridge
medium sized dataset
recurrent neural network (RNN)
one-sided perceptron
convolutional neural network
Machine learning
Malware detection
Dissertations, Academic -- CSUN -- Computer Science.
random forest
decision tree
http://hdl.handle.net/10211.3/224558
by Naveen Donepudi

Thumbnail	Title	Date Uploaded	Visibility	Actions
		2023-06-26	Public
		2023-06-26	Public
		2023-06-26	Public
		2023-06-26	Public
		2023-06-26	Public
		2023-06-26	Public
		2023-06-26	Public
		2023-06-26	Public

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings
My Bibliography
Collections
Citation manager

Save citation to file

Email citation, add to collections.

Create a new collection
Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

Search in PubMed
Search in NLM Catalog
Add to Search

A review of machine learning and deep learning algorithms for Parkinson's disease detection using handwriting and voice datasets

Affiliations.

1 Department of Robotics and Mechatronics Engineering, University of Dhaka, Nilkhet Rd, Dhaka, 1000, Bangladesh.
2 Institute of Electronics, Bangladesh Atomic Energy Commission, Dhaka, 1207, Bangladesh.
3 Department of Electrical and Electronic Engineering, University of Dhaka, Dhaka, 1000, Bangladesh.
4 Moulvibazar Polytechnic Institute, Bangladesh.
PMID: 38356538
PMCID: PMC10865258
DOI: 10.1016/j.heliyon.2024.e25469

Parkinson's Disease (PD) is a prevalent neurodegenerative disorder with significant clinical implications. Early and accurate diagnosis of PD is crucial for timely intervention and personalized treatment. In recent years, Machine Learning (ML) and Deep Learning (DL) techniques have emerged as promis-ing tools for improving PD diagnosis. This review paper presents a detailed analysis of the current state of ML and DL-based PD diagnosis, focusing on voice, handwriting, and wave spiral datasets. The study also evaluates the effectiveness of various ML and DL algorithms, including classifiers, on these datasets and highlights their potential in enhancing diagnostic accuracy and aiding clinical decision-making. Additionally, the paper explores the identifi-cation of biomarkers using these techniques, offering insights into improving the diagnostic process. The discussion encompasses different data formats and commonly employed ML and DL methods in PD diagnosis, providing a comprehensive overview of the field. This review serves as a roadmap for future research, guiding the development of ML and DL-based tools for PD detection. It is expected to benefit both the scientific community and medical practitioners by advancing our understanding of PD diagnosis and ultimately improving patient outcomes.

Keywords: Deep learning (DL); Diagnosis; Disease prediction; Machine learning (ML); Parkinson's disease (PD).

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Parkinson's disease (PD) manifests itself…

Parkinson's disease (PD) manifests itself in significant ways, both motor and non-motor.

Techniques of machine learning and…

Techniques of machine learning and deep learning applied toward the diagnosis of Parkinson's…

The process of review.

PRISMA flow diagram of the…

PRISMA flow diagram of the review work.

Parkinson's Disease (PD) diagnostic modality…

Parkinson's Disease (PD) diagnostic modality categorization.

Datasets that are used in…

Datasets that are used in various researches.

Overall performance conducted in various…

Overall performance conducted in various researches.

Number of methods (in %)…

Number of methods (in %) used in UCI dataset.

Number of methods (in %) used in handwriting dataset.

Designing practical implications of a…

Designing practical implications of a machine learning-based Parkinson's dis-ease prediction system.

Future research directions of the…

Future research directions of the parkinson's disease detection methods based on machine learning…

Related information

Linkout - more resources, full text sources.

Elsevier Science
Europe PubMed Central
PubMed Central

Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Early Detection of Emotional Issues in High School Students: A Machine Learning Approach

Degree grantor, degree level, degree name, committee member, thesis type, usage metrics.

Search Menu
Sign in through your institution
Volume 9, Issue 1, 2024 (In Progress)
Volume 8, Issue 1, 2023
Advance articles
Biomolecular, Structural and Biophysical Analysis
Cell Biology
Cell, Tissue and Organoid Culture Methods
Chromatin and Epigenetics
Computational Methods
Genomics and Polymorphism/Mutation Detection
Health Care Methods
Imaging technologies
Immunological Methodologies
Monitoring Gene Expression
Nucleic Acid Modification, Amplification and Sequencing
Protein-Protein and Protein-Nucleic Acid Interaction and Proteomics
Recombinant DNA Expression and Protein Translation
RNA Characterisation and Manipulation
Transcriptome Mapping
Author Guidelines
Submission Site
Open Access
Why publish in Biology Methods and Protocols
About Biology Methods and Protocols
Editorial Board
Advertising and Corporate Services
Journals Career Network
Self-Archiving Policy
Journals on Oxford Academic
Books on Oxford Academic

Article Contents

Introduction, materials and methods, acknowledgements, author contributions, competing interests, data availability, code availability.

< Previous

Early detection and diagnosis of cancer with interpretable machine learning to uncover cancer-specific DNA methylation patterns

Article contents
Figures & tables
Supplementary Data

Izzy Newsham, Marcin Sendera, Sri Ganesh Jammula, Shamith A Samarajiwa, Early detection and diagnosis of cancer with interpretable machine learning to uncover cancer-specific DNA methylation patterns, Biology Methods and Protocols , Volume 9, Issue 1, 2024, bpae028, https://doi.org/10.1093/biomethods/bpae028

Permissions Icon Permissions

Cancer, a collection of more than two hundred different diseases, remains a leading cause of morbidity and mortality worldwide. Usually detected at the advanced stages of disease, metastatic cancer accounts for 90% of cancer-associated deaths. Therefore, the early detection of cancer, combined with current therapies, would have a significant impact on survival and treatment of various cancer types. Epigenetic changes such as DNA methylation are some of the early events underlying carcinogenesis. Here, we report on an interpretable machine learning model that can classify 13 cancer types as well as non-cancer tissue samples using only DNA methylome data, with 98.2% accuracy. We utilize the features identified by this model to develop EMethylNET, a robust model consisting of an XGBoost model that provides information to a deep neural network that can generalize to independent data sets. We also demonstrate that the methylation-associated genomic loci detected by the classifier are associated with genes, pathways and networks involved in cancer, providing insights into the epigenomic regulation of carcinogenesis.

Cancer remains one of the most challenging human diseases, with over 19 million cases and 10 million deaths reported annually [ 1 ]. The increase of an ageing population worldwide, together with exposure to environmental carcinogens, and lifestyle choices such as poor diets, smoking and lack of physical activity contribute to the worldwide increase in cancer incidences. The evolutionary nature of cancer, complex interactions with the tissue micro-environment and host immune system, engender heterogeneity and make the pursuit and development of interventions difficult. Therefore, early detection and diagnosis of cancer, leading to better interventions and increased survival, remain one of the more effective avenues in combating cancer.

Each of our somatic cells contains a single identical genome, incorporating the information necessary to specify and maintain our characteristics. In contrast, each cell will exhibit multiple epigenomes that change during different cellular states and over the passage of time. These epigenomes consist of a collection of reversible chromatin structures, interactions and modifications that do not change the DNA sequence and may be heritable across progeny cells. Histone variations, post translational modifications of the amino terminal tails of histone proteins, and covalent modification of DNA are some of the factors that contribute to epigenomic change. Notably, covalent methylation of DNA is one such reversible chemical modification with many functional consequences, and evidence for its role in embryonic development, cell differentiation, genomic imprinting, X chromosome inactivation, repression of regulatory elements, genome maintenance and the regulation of gene expression has accumulated in the last few decades [ 2 ].

Aberrant DNA methylation is observed in many cancers. CpG island promoter hypermethylation of tumour suppressor genes is an early neoplastic event in many tumours [ 3–6 ]. In addition, global DNA hypomethylation can lead to chromosomal instability, activation of oncogenes and latent retrotransposons that promote carcinogenesis [ 7 ]. Hypomethylation is seen in many cancer types, including cervical, prostate, hepatocellular, breast, brain, and leukaemia [ 8–11 ]. These hyper- and hypo-methylation patterns can serve as cancer-associated signals and prognostic biomarkers. They are of particular use for early detection of cancer, as epigenetic modifications are some of the earliest neoplastic events associated with carcinogenesis [ 12 , 13 ]. Computational methods that detect these complex neoplastic methylation patterns can thus assist in cancer early detection, diagnosis, and screening. Here, we developed both binary and multiclass machine learning models to identify multiple cancer types from non-cancerous tissue samples. An expanding corpus of literature supports the use of classification methods trained on DNA methylation changes to identify carcinogenic signatures. Some of the more relevant works are reviewed in Table 1 . Moreover, we previously demonstrated that machine learning models, leveraging DNA methylation data from 1228 tissue samples can accurately classify pathological subtypes of renal tumours [ 14 ]. In this study, we introduce a multiclass deep neural network, EMethylNET: Explainable Methylome Neural network for Evaluation of Tumours . EMethylNET is robust, generalizable, and interpretable and demonstrates high predictive accuracy.

Summary of related studies, including EMethylNET, detailing the model type, number of train/test data sources (and total sample number), number of external validation data sources (and total sample number), number of CpGs input into the model and number of CpGs used by the model.

Work .	Model type .	Train/test data sources (n) .	(n) .	CpGs input to the model .	CpGs used by the model .
Hao 2017 [18]	LASSO	1 (2676)	1 (718)
Tang 2017 [19]	Random forest	1 (5379)	7 (504)	9-998
Capper 2018 [20]	Random forest	1 (3905)	5 (401)	10000
Peng 2018 [21]	LASSO	1 (1478)	3 (267)		128
Ding 2019 [22]	Logistic regression	1 (7605)	6 (742)	12	12
Zheng 2020 [23]	DNN	1 (7339)	12 (972)	10360	10360
Koelsche 2021 [24]	Random forest	1 (1077)	4 (428)	10000
Liu 2021 [25]	XGBoost	1 (7224)	0	294
Modhurkur 2021 [26]	Random forest	9 (9303)	0	2978
Ibrahim 2022 [27]	PLSDA	1 (6502)	10 (1595)	20	20
Kuschel 2022 [28]	Random forest	3 (369)	0	50000
Zhang 2023 [29]	Linear support vector classifier	1 (781)	1 (4702)	1588
EMethylNET	XGBoost and DNN	1 (6224)	9 (940)	276016	3388

Work .	Model type .	Train/test data sources (n) .	(n) .	CpGs input to the model .	CpGs used by the model .
Hao 2017 [18]	LASSO	1 (2676)	1 (718)
Tang 2017 [19]	Random forest	1 (5379)	7 (504)	9-998
Capper 2018 [20]	Random forest	1 (3905)	5 (401)	10000
Peng 2018 [21]	LASSO	1 (1478)	3 (267)		128
Ding 2019 [22]	Logistic regression	1 (7605)	6 (742)	12	12
Zheng 2020 [23]	DNN	1 (7339)	12 (972)	10360	10360
Koelsche 2021 [24]	Random forest	1 (1077)	4 (428)	10000
Liu 2021 [25]	XGBoost	1 (7224)	0	294
Modhurkur 2021 [26]	Random forest	9 (9303)	0	2978
Ibrahim 2022 [27]	PLSDA	1 (6502)	10 (1595)	20	20
Kuschel 2022 [28]	Random forest	3 (369)	0	50000
Zhang 2023 [29]	Linear support vector classifier	1 (781)	1 (4702)	1588
EMethylNET	XGBoost and DNN	1 (6224)	9 (940)	276016	3388

Microarray-based methylation analysis

Methylome microarray data were obtained from The Cancer Genome Atlas (TCGA) GDC data portal ( https://www.cancer.gov/ccg/research/genome-sequencing/tcga , RRID: SCR 003193). The data sets utilized were from the Illumina Infinium Human DNA Methylation 450 platform, and 13 human cancer types with at least fifteen normal samples were analysed. Metadata was also obtained from the TCGA data portal. Supplementary Table S12 shows the number of cancer and normal samples for each cancer type.

In addition to the TCGA data, a number of data sets from independent studies were also used in model evaluation. Eight of the independent data sets were from the Illumina Infinium Human DNA Methylation 450 platform, and one (ESCA 2) was from the Illumina Infinium Methylation EPIC array platform. The number of cancer and normal samples for each independent data set is shown in Supplementary Table S13 . The sources of each data set are: breast cancer (BRCA): GSE52865, colon adenocarcinoma (COAD): GSE77955 (only samples from sites colon, left colon, right colon, and sigmoid are taken), esophageal carcinoma (ESCA): GSE72874, ESCA 2: EGAD00010001822 and EGAD00010001834, head and neck squamous cell carcinoma (HNSC): GSE38266 (note that half of these samples are HPV+), kidney renal clear cell carcinoma (KIRC): GSE61441, liver hepatocellular carcinoma (LIHC): GSE75041, prostate adenocarcinoma (PRAD): project PRAD-CA from ICGC, thyroid carcinoma (THCA): GSE97466. For the COAD and THCA independent data sets, details regarding the adenoma samples were obtained from their metadata.

Data pre-processing

Classification models and metrics.

Throughout this study, we used both binary and multiclass models. Each binary model compared one tissue type, distinguishing cancer from normal, and the multiclass models utilized all 13 tissue types and normal samples. Note that in the binary models, the normal class was only normal samples for that tissue, whereas in the multiclass models, the normal class was normal samples from all tissue types pooled together. For each model, the input data were split into training and test sets, with 25% of samples in the test sets.

To begin with, we tested two simple classification models: logistic regression and an SVM. Both models were created and tuned using the package sklearn [ 20 ] in Python (version 3.7.5). Hyperparameter tuning on the training set using 5-fold cross-validation selected the default values in most cases, except for the binary logistic regression using the Newton solver and the multiclass SVM using gamma = ‘auto’.

An XGBoost model based on gradient boosted decision trees was created, using the XGBoost ( https://xgboost.ai/ , RRID: SCR 021361) Python package [ 21 ]. Hyperparameter tuning for the binary models resulted in 450 estimators with a maximum depth of ten and a learning rate of 0.189. The multiclass model had eight hundred estimators with a max depth of three and the same learning rate. In this model, 50% of features were randomly sampled when constructing each tree and 50% of samples were taken in each iteration, which helped to prevent over-fitting.

We also report the area under the curve (AUC) for both receiver operating characteristic (ROC) curves and precision–recall curves. For both metrics, one is a perfect score and 0.5 is the score from a random model (assuming classes are balanced). For the multiclass AUCs, an AUC was generated for each class using the one-vs-all strategy, which reflects the model’s ability to distinguish each class from the rest of the classes.

Biological feature interpretability

Multiclass pcc importance analysis.

Importances of the multiclass probes contributing to classification (PCCs) were obtained from the trained XGBoost model, which used the gain measure as the feature importance.

SHAP values

The shap package in Python was used for analysing SHAP values of the multiclass DNN [ 25 ]. A stratified sample of 10% of the training set was used as the background set and a stratified sample of 10% of the whole data set (training and test) was used to calculate the SHAP values.

Probe annotation and mapping

Probes with an XGBoost importance score > 0 were mapped to genes that were overlapping or that had overlapping promoter regions (taken as the 1500 base pair window upstream of the transcription start site). Each probe was mapped to all genes that fulfilled this property. Then, we went through the multiclass probe list manually and refined probes that mapped to multiple genes, removing mapped genes where it was obvious that the gene was not being affected by the probe. This process removed 161 genes from the multiclass gene list. The gene annotation data were obtained from Ensembl (version 101) using the R package biomaRt ( https://bioconductor.org/packages/biomaRt/ , RRID: SCR 019214) [ 26 , 27 ], and the mapping functionality was implemented using the R package, ChIPpeakAnno ( http://www.bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html , RRID: SCR 012828) [ 28 ].

Differential methylation analysis

Differential methylation analysis was performed using the R package TCGAbiolinks [ 15 ], and the input data were M-values of the probes after filtering (see Data pre-processing). Differentially methylated probes were found by the Wilcoxon test using the Benjamini-Hochberg false discovery rate adjustment method. The probes with an adjusted P -value < .01 and an absolute mean difference of above 2 were selected.

Enrichment analysis

Gene ontology over-representation analysis.

Functional enrichment analysis was carried out using the R package gprofiler2 ( https://biit.cs.ut.ee/gprofiler/page/r , RRID: SCR 018190) [ 29 ] with the Bonferroni correction method. The background set were the XGBoost input probes (ie, the microarray probe list after filtering) mapped to genes. This result was then visualized by REVIGO ( http://revigo.irb.hr/ , RRID: SCR 005825) [ 30 ] using the settings: small, Homo sapiens GO terms, SimRel similarity. The scatter plot in Fig. 5a was based on the R script provided by REVIGO, and the visible labels are the twenty most significant terms with four or more parents in the GO Biological Process hierarchy (to avoid very general terms).

Gene set over-representation analysis

Fisher’s exact tests were performed on two cancer gene sets: COSMIC Cancer Gene Census [ 31 ] ( https://cancer.sanger.ac.uk/census , RRID: SCR 002260), OncoKB ( https://www.oncokb.org/ , RRID: SCR 014782) Cancer Gene List [ 32 ], and the TF Checkpoint 2.0 resource ( https://www.tfcheckpoint.org ,RRID: SCR 023880) to determine overlap with translational regulators to assess the overlap with the multiclass gene list. In these analyses, we only included genes present in our background gene set, i.e. the microarray probe list (after filtering) mapped to genes.

Text mining

The Pangaea package [ 33 ] was used for text mining of over four million cancer-related PubMed ( https://pubmed.ncbi.nlm.nih.gov/ , RRID: SCR 004846) abstracts (downloaded in 2020) that were associated with cancer. We analysed the abstracts that referred to at least one of our multiclass genes, which was a total of 183,909 abstracts. The output of Pangaea is available as an excel spreadsheet in supplementary data ( Supplementary File 1 ).

Pathway enrichment analysis and visualization

KEGGprofile [ 34 ] was used for gene set enrichment of KEGG pathways for the multiclass gene list. Transformation of the Ensembl IDs to Entrez gene IDs was required (losing some unmappable genes in the process), and the background set was the microarray probe list (after filtering) mapped to genes. For visualization, KEGG pathways were retrieved using the R package KEGGgraph ( https://bioconductor.org/packages/KEGGgraph/ , RRID: SCR 023788) [ 35 ] and KEGG IDs were converted to Ensembl IDs using the R package biomaRt [ 26 , 27 ]. Pathways were visualized with the NetworkX Python package ( https://networkx.org/ , RRID: SCR 016864) [ 36 ], and only multiclass genes were shown. For each multiclass gene, the difference in average methylation between cancer and normal is displayed as the node colour. More specifically, for each cancer type and each PCC, the mean M-value of the cancer samples minus the mean M-value of the normal samples for that cancer type was taken. Where multiple PCCs mapped to the same gene, the PCC with the maximum absolute difference was taken.

For the visualization of the pathway network, sixty cancer-related KEGG pathways were collected. Only pathways with more than three multiclass genes were kept (resulting in fifty-six pathways). Interaction data were collected for all multiclass genes, from STRING ( http://string.embl.de/ , RRID: SCR 005223) [ 37 ] (using all interactions from the default 0.4 confidence), GeneMania ( http://genemania.org/ , RRID: SCR 005709) [ 38 ] (with all data sources selected) and GeneWalk ( https://github.com/churchmanlab/genewalk , RRID: SCR 023787) [ 39 ]. These pathways and interaction data were visualized as a network with Cytoscape software ( http://cytoscape.org , RRID: SCR 003032). Each node represented a pathway, and the multiclass genes in that pathway were visualized as smaller shapes around the nodes. The interaction data was summarized into pathway interactions—if a gene in one pathway interacted with another gene in a different pathway, an edge was drawn between those two pathways. In addition, data from the COSMIC Cancer Gene Census, version 93 [ 31 ], were integrated.

Pan-cancer methylome network model

A model of the pan-cancer methylome network incorporating Molecular Mechanisms of Cancer pathway from the Ingenuity Pathway Analysis (IPA) resource ( http://www.ingenuity.com/ , RRID: SCR 008653) [ 40 ] and the Pathways in Cancer (Human) pathway from the KEGG pathway database ( https://www.kegg.jp/kegg/pathway.html , RRID: SCR 012773) [ 41 , 42 ] was produced using PathVisio software ( https://pathvisio.org/ , RRID: SCR 023789)[ 43 ]. Multiclass methylation features mapped to genes were displayed as blue nodes (non-coding genes highlighted in yellow), or purple if they were also known cancer genes from Cosmic Cancer Gene Census or OncoKB. Interaction between nodes is derived from the literature, pathway databases (including IPA and KEGG) and protein-protein interaction data sets (STRING). The model was produced as a gpml object, adhering to the Systems Biology Graphical Notation (SBGN) standard. Direct interactions are shown as complete black lines and indirect interactions as broken black lines, respectively. Catalytic interactions are shown as red edges, inhibitory interactions as blue edges, and protein–protein interactions as orange edges between nodes.

Long non-coding RNA analysis

The gene type annotation data were obtained from Ensembl (version 101). Literature evidence was obtained using the Pangaea tool [ 33 ], where cancer hallmark keywords were extracted from the abstracts that mentioned at least one of the multiclass lncRNAs. Additionally, we used two cancer lncRNA databases, Lnc2Cancer 3.0 ( http://bio-bigdata.hrbmu.edu.cn/lnc2cancer/ , RRID: SCR 023781) [ 44 ] and CRlncRNA [ 45 ]. For Lnc2Cancer, we searched for cancer hallmark keywords in the description column, and for CRlncRNA, these cancer hallmark keywords were included explicitly. LncRNAs found in one or more of these two sources were plotted in a heatmap showing the average methylation (beta value) for all BRCA samples. The differential expression (log 2 fold change) was obtained using the DESeq2 package ( https://bioconductor.org/packages/release/bioc/html/DESeq2.html , RRID: SCR 015687) [ 46 ].

Comparison with cancer lncRNAs

We compared our lncRNAs to the Cancer LncRNA Census (CLC) [ 47 ], using a Fisher’s exact test. We then carried out a pared-down version of their CLC features analysis, following a method as similar as possible. In each of these tests, a Fisher’s exact test was used when not otherwise specified. Gene location and length data were obtained from Ensembl (version 101) using the R package biomaRt [ 26 , 27 ].

Close to cancer-associated and non-cancer-associated germline SNPs. Data were obtained from the GWAS Catalog (NHGRI-EBI’s catalog of published genome-wide association studies ( http://www.ebi.ac.uk/gwas , RRID: SCR 012745) [ 48 ]. Cancer SNPs were found using keywords ‘cancer,’ ‘tumor,’ ‘tumour,’ and non-cancer SNPs were all other SNPs. We tested whether the closest cancer/non-cancer SNPs to the lncRNAs were within a distance threshold (1 kb, 10 kb, and 100 kb were tested).

Within 1 kb of the COSMIC cancer gene census genes. For each background and multiclass lncRNA, the distance to the closest COSMIC cancer gene [ 31 ] was computed, and we tested whether that distance was under 1 kb more (or less) frequently for the multiclass lncRNAs.

Epigenetically silenced in tumours. The multiclass lncRNAs were tested against a list of cancer-associated epigenetically silenced lncRNA genes (CAESLGs) [ 49 ].

Differentially expressed. The multiclass lncRNAs were tested against a list of dysregulated lncRNAs in a range of cancer types (BRCA, COAD, HNSC, KIRC, lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), and PRAD) [ 49 ].

Gene and exon lengths. To test the difference in lengths, a Wilcoxon rank sum test was performed over the logged lengths (to ensure equal variance). For the exon length, the longest transcript for each gene was taken.

Higher expression levels. TCGA expression data were used, normalized by the TMM method [ 50 ] using the edgeR package ( http://bioconductor.org/packages/edgeR/ , RRID: SCR 012802) [ 51 ], and averaged across samples. A Wilcoxon rank sum test over the logged expression values was used to test expression differences.

Conservation. Phast 100-way conservation scores were downloaded [ 52 , 53 ] and all conservation scores that overlap with background, or multiclass lncRNA gene bodies were taken. The mean conservation score per gene was computed, and then the difference in conservation between multiclass and background lncRNAs was tested using a Wilcoxon rank sum test.

Survival analysis

To determine whether the gene sets could differentiate survival, the following was executed for each gene list from the binary XGBoost models. TCGA expression data were normalized (using the variance-stabilizing transformation in the DESeq2 package [ 46 ]), and matched TCGA survival data were obtained. A Cox proportional hazards regression model (using the R package survival ( https://CRAN.R-project.org/package=survival , RRID: SCR 021137) [ 54 ], version 3.1.8) was fitted on each gene separately along with the age, stage, and gender as covariates (excluding gender for BRCA, PRAD and UCEC). The genes that had a significant effect on survival (using the Wald statistic P -value with a .05 cutoff) were selected and a Cox proportional hazards regression model was fitted over all these selected genes. Kaplan–Meier curves were computed (using the survival package) by splitting the samples into a high and low hazard, split by the median hazard.

Then a similar analysis that splits the samples up into a train and test set was run, to show whether the gene lists could predict survival. This was repeated thirty times to get a distribution over the test set performance. After gene normalization, samples were split into stratified train (75%) and test (25%) sets. Using only the train set, a Cox proportional hazards regression model was fitted on each gene separately and selected genes as before. Then three Cox proportional hazards regression models were fitted to the train set—one using just the selected genes, one using just the covariates, and one using the selected genes and covariates. We then used these models to predict the hazard on the test set, and plotted time dependent ROC curves (using the R package timeROC [ 55 ], version 0.4) on these predicted hazards.

We utilized machine learning approaches to identify cancer-specific changes from normal tissue-specific methylation. DNA methylation microarray data from 13 cancer types and corresponding normal tissues were utilized. Illumina Infinium array-based methylome data were used in this study and data were extracted, cleaned, and processed as described in the Methods. Analysis of this methylation microarray data identifies the ratio of the methylated probe intensity over the overall intensity, known as the beta value, at given CpG locations using a pair of methylated and unmethylated probes.

In this study, we trained and evaluated four different model types: logistic regression, support vector machines (SVM), gradient boosted decision trees (XGBoost), and a deep neural network (DNN). See Fig. 1 for a visual overview. For the first three model types, both binary and multiclass classification models were created.

Overview of method. DNA methylation microarray data from 13 cancer types and corresponding normal tissues were collected from TCGA and preprocessed. For binary and multiclass classification tasks, three types of models were trained: Simple models (logistic regression and support vector machines), XGBoost, and EMethylNET, a model consisting of XGBoost combined with a Deep Neural Network. Then the models were evaluated on independent data sets and an analysis of their features used in classification was performed

Cancer types can be accurately classified using both binary and multiclass methods

Here, we present the results from the XGBoost and DNN models. We also trained and tested SVM and logistic regression models. In addition to measures of accuracy such as ROC AUC and F 1 score, we also provide the MCC measure. MCC is used to portray performance of the binary classification models and is especially useful when class imbalances are present. The SVM models did not outperform the XGBoost or DNN models: the SVM binary models reached an average MCC of 0.894 and the SVM multiclass model reached an MCC of 0.956. The binary logistic regression models had an average MCC of 0.960, outperforming the binary XGBoost models on average (average MCC of 0.919); however, their performance varies across cancer types: logistic regression models perform better for 5 cancer types, XGBoost models perform better for 4 cancer types, and both achieve the same MCC for four cancer types. The multiclass logistic regression model (MCC score of 0.973) did not outperform the multiclass XGBoost or DNN (MCC scores of 0.980 and 0.976). Since the binary logistic regression models did not substantially outperform the binary XGBoost models and the multiclass logistic regression achieved a lower MCC score than the multiclass XGBoost and DNN, we focus our analysis on the XGBoost and DNN. For detailed performance metrics of the SVM and logistic regression models, see Supplementary Tables S1 and S2 .

Detection of the cancer states through binary classification of DNA methylation from individual tumour and normal tissues

XGBoost, a type of gradient boosted tree model, is an iterative ensemble machine learning approach [ 21 ]. We trained 13 binary XGBoost models, one for each cancer type. DNA methylome data from TCGA were used in training and testing the models, with a total of 6224 samples. Each model learns to classify between cancer and normal samples (adjacent matched normal tissue) for its tissue type. Overall, there was good performance on the test set, with five out of 13 models achieving a perfect test set performance (COAD, KIRC, LUAD, LUSC, and uterine corpus endometrial carcinoma [UCEC]). Across all models, the average accuracy was 0.987 and the average MCC (a performance measure unaffected by severe class imbalance) was 0.919, demonstrating that the models can accurately classify cancer and normal samples. Figure 2a – e shows the confusion matrices for the best and worst performing models, AUC of ROC curves and precision-recall curves, and the MCC scores for all models. Performance metrics for all binary models can be found in Supplementary Tables S3.1 . A key issue with these binary models is the major class imbalance. The average fraction of normal samples is 0.135 (see Supplementary Table S12 for the numbers of normal and cancer samples for each tissue type), which reveals why the average MCC is considerably lower than the average accuracy. In addition, the lowest performing model, ESCA, with an accuracy of 0.961 and MCC of 0.693, is the tissue type with the lowest number of normal samples, of which there were only sixteen. This sparsity of data contributed to its worse performance.

Performance of the binary and multiclass XGBoost models on the TCGA test set. a and b Confusion matrices of the best (KIRC) and worst (ESCA) performing binary XGBoost models. c AUC of the ROC curves for all binary XGBoost models. d AUC of the Precision Recall (PR) curves for both cancer and normal classes of all binary XGBoost models. Note that the scales of c and d start from 0.7. e MCC scores for all binary XGBoost models. f shows the confusion matrix, g shows the AUC of the ROC curves for each class, and h shows the AUC of the Precision Recall (PR) curves for each class of the multiclass XGBoost model. Note that the scales of g and h start from 0.9

The multiclass classification of 13 cancer and normal tissues is more robust

Here, we trained a single multiclass XGBoost model on the whole of the training data. There were classes for each of the 13 cancer types and a single normal class, which contained normal samples from every cancer type. The model was now required to learn the differences between 13 tissue types in addition to the differences between cancer and normal tissue samples, making it a more challenging task than the previous binary classification. However, there was no longer a large class imbalance due to pooling of the normal samples together. As shown in Fig. 2f–h , the performance of the test set was very good for all classes. The model can discriminate each of the 13 cancer types and normal samples with a high degree of accuracy. The overall accuracy was 0.982 and the overall MCC was 0.980, see Supplementary Table S3.2 for the detailed metrics.

Models achieve high accuracy on independent heterogeneous data sets

To determine the robustness of our models, we evaluated our XGBoost models on several independent data sets representing different cancer types, amounting to a total of 940 samples. These were more heterogeneous than the TCGA data used for training. Two data sets included some adenoma samples (COAD and THCA), one data set consisted of samples from early-stage tumours, some of which were later shown to recur (LIHC), and one data set included some Human papillomavirus (HPV) positive samples (HNSC). The data sets also came from seven different countries, viz., Iceland (BRCA), USA (COAD), Australia (ESCA), UK (HNSC and ESCA), China (LIHC and KIRC), Canada (PRAD), and Brazil (THCA).

Binary models show good performance on independent data sets

When these independent data sets were tested, most of the binary XGBoost models (trained on TCGA data) performed well, illustrated by Fig. 3 . Confusion matrices of the best and worst performing binary XGBoost models are shown in Fig. 3a and b . In terms of ROC AUC ( Fig. 3d ), the highest performing model was the BRCA model, with a perfect ROC AUC of 1.0, and the lowest performing was the COAD model, with a ROC AUC of 0.758. The precision–recall AUC results show similar trends ( Fig. 3e ). In terms of MCC ( Fig. 3f ), the lowest performing model was the ESCA model, which is expected given the major class imbalance in the ESCA training data set.

Performance of the binary XGBoost models on independent data sets. a and b Confusion matrices of the best (BRCA) and worst (COAD) performing binary XGBoost models (according to the ROC AUC scores) on the independent data sets. c Detailed confusion matrix for COAD showing the predictions of Normal (N), Adenoma (A), and Cancer (C) samples. d AUC of the ROC curves for binary XGBoost models where the independent data set included normal samples. e AUC of the Precision Recall (PR) curves for both cancer and normal (where available) classes of binary XGBoost models on the independent data sets. f MCC scores for binary XGBoost models where the independent data set included normal samples. For d , e and f , ESCA is the average of the two ESCA independent data sets

Regarding the COAD model, its confusion matrix in Fig. 3b shows that it predicted 12 normal samples as cancer. Nine out of these twelve samples are in fact adenomas; benign tumours of glandular origin (see Supplementary Table S4 for the number of normal, adenoma and cancer samples in the COAD independent data set). A confusion matrix that also shows whether the samples are Normal (N), Adenomas (A), or Carcinomas (C) is shown in Fig. 3c , illustrating that all adenomas are classified as cancer. This was unexpected, as there were no adenomas in the training data set, and instead of randomly classifying them, the model found some cancer-associated signal in adenoma samples in the independent data set.

A similar trend was identified in the other independent data set with adenomas. In the THCA model, eleven out of 17 adenomas were predicted to be cancer (see Supplementary Table S4 for the number of normal, adenoma and cancer samples in the THCA independent data set). In detail, all occurrences of ‘follicular adenoma’, and ‘follicular adenoma/Hürthle cell’ were classified as cancer (n = 8), all ‘lymphocytic thyroiditis’ were classified as normal ( n = 3), and ‘nodular goitre’ was split evenly between the two classes (n = 6).

EMethylNET, a model consisting of a DNN model trained on features learnt from multiclass XGBoost, improves performance

The results for the multiclass XGBoost model on the independent data, which had an accuracy of 0.68 and MCC of 0.661, can be found in Supplementary Table S6 . With the aim of creating a more robust model and improving these results, we designed EMethylNET, a feed-forward neural network based on our XGBoost model, as shown in Fig. 4a . The input features of EMethylNET were the features the multiclass XGBoost model learnt to utilize for classification, referred to as the probes contributing to classification, see below. See Supplementary Table S5 for EMethylNET’s results on the TCGA test set. The results on the independent data sets, which had an accuracy of 0.867 and MCC of 0.844, are shown as a confusion matrix ( Figure 4b ), AUC of ROC ( Figure 4c ) and AUC of PR ( Figure 4d ) and in Supplementary Table S7 . The only data set that did not reach a F 1 score of at least 0.8 (excluding COAD, as it contains adenomas, see above) was HNSC.

Architecture of the feed forward neural network ( a ) and its performance on all independent data sets. b shows the confusion matrix, c shows the AUC of the ROC curves for each class, and d shows the AUC of the Precision Recall (PR) curves for each class. The two ESCA data sets are combined into one ESCA class. The colour orange denotes normal and purple denotes cancer. Note that we do not have independent data sets for every cancer type (the independent data sets used lacked BLCA, KIRP, LUAD, LUSC and UCEC samples). Nevertheless, for the confusion matrix in b all 14 classes are retained in the rows to maintain a square configuration, enhancing readability

Comparison of EMethylNET with related cancer classification studies

The detection and classification of cancer using methylation-based approaches is a large and growing body of literature. A diverse range of approaches and objectives have been investigated, from binary classification of cancer using tissue data [ 56 ], to multiclass classification using data from liquid biopsies [ 57 ]. Here, we conduct a comparative analysis of EMethylNET with other related works that utilize machine learning for pan-cancer multiclass classification of DNA methylation data from tissue samples. These related works [ 58–69 ] are listed in Table 1 . Various machine learning approaches have been used, from logistic regression to DNNs, with tree-based methods (random forest and XGBoost) being a popular approach (6/12 works).

First, we provide a performance comparison of EMethylNET to these related works. We only compare works that provide test set scores on TCGA (we do not attempt to run their models). This comparison is not exact, and it is important to note the following shortcomings: the test sets contain different samples and have different sizes, the related works have slightly different classification tasks (for example, some only consider cancer samples, some define a normal class for each tissue type and some have separate classes for metastatic samples) and some classes in related works are not comparable with our classes (for example, some works combine cancer types found in the same tissue type). In addition, we can only compare with the metrics reported in the publication, and so different metrics are compared for different works.

First, we compare with Hao 2017 [ 58 ]. They classify four cancer types and four normal tissues, and so we can only compare with the four cancer types (as our normal samples are pooled). Supplementary Table S8 shows the precision and recall metrics, indicating that EMethylNET achieves comparable performance for these cancer types (higher precision for COAD, LIHC and LUAD, and higher recall for LUAD). Next, we compared with Ibrahim 2022 [ 67 ]. They perform a slightly different task, as they do not include a normal class, and they combine colon and rectal tumour data sets, so we cannot compare performance on COAD. Supplementary Table S9 shows the ROC AUC scores, showing that EMethylNET achieves the same ROC AUC or higher in all classes (when rounding to three decimal places). Ibrahim 2022 also externally validate their model on the independent BRCA (GSE52865) and THCA (GSE97466) data sets. Again, this is not a direct comparison because the rest of their independent external validation data set differs from ours (which affects the one-vs-all approach to calculating AUC scores). For the BRCA data set, they report a ROC AUC of 0.928, and we achieve a ROC AUC of 0.99997. For the THCA data set, they report a ROC AUC of 0.990, and we achieve 0.99463. Next, we compare with Zheng 2020 [ 63 ]. They do not include normal samples, and they classify the cancer origin site, which again is a slightly different task to ours. We cannot compare KIRC, kidney renal papillary cell carcinoma (KIRP), LUAD and LUSC classes as they combine them into generic kidney and lung classes. The ROC AUC, precision and recall metrics are shown in Supplementary Table S10 , indicating that we achieve comparable performance. EMethylNET’s ROC AUCs are the same or higher in all classes (when rounding to two decimal places), precision is higher in 4/8 classes and recall is the same or higher in 5/8 classes. Lastly, we compared with Modhurkur 2021 [ 66 ]. As they have distinct classes for each metastatic cancer and each normal tissue type, they address a more challenging task. Supplementary Table S11 shows the precision, recall and F1 metrics, which shows that we achieve comparable performance. EMethylNET’s precision is the same or higher in 6/13 classes, recall is the same or higher in 10/13 classes, and F1 is higher in 8/13 classes. In summary, we have shown that EMethylNET achieves competitive test set performance amongst comparable works.

A key advantage of using an interpretable method such as XGBoost is that the features utilized for classification can be identified. In our case, these were the CpG probes with a feature importance of above zero, which we refer to as PCCs. Surprisingly, most PCCs from the binary models were found not to be differentially methylated in each respective cancer type. Only 65/221, 56/318 and 29/179 PCCs from the BRCA, PRAD, and THCA binary models, respectively, were found to be differentially methylated, as shown in Supplementary Fig. S1 .

We explored the PCCs from the multiclass XGBoost model (exactly the input features to EMethylNET). The importance scores of the most important PCCs are shown in Supplementary Fig. S2a , which shows that most of the importance is captured by the top one hundred PCCs. The most important PCC, cg16508600, is at position chr1:204562255 and does not map to any gene. The closest gene is RNA5SP74, which is ∼ 200bp away, and the location of this PCC coincides with a C > T SNV (rs567580996). The second most important, cg14789818, is ∼200bp upstream of RNA5SP77 on chromosome 1. The third most important, cg03988778, is near the promoter of SVIP and AC006299.1. Supplementary Fig. S2b shows the distribution of methylation of the top ten important probes, indicating that they commonly differentiate one class from the rest. For example, the most important PCC differentiates BRCA from all other classes, and the third most important differentiates a sizable proportion of HNSC class.

An interpretation of the multiclass DNN model can be achieved by analysing its SHAP (SHapley Additive exPlanation) values [ 25 ]. The feature with the highest average impact on model output (the highest average absolute SHAP value) is cg15267232, which is within GATA3, and the feature with the second-highest average impact is cg22455450, which is within ZNF808. The feature with the third-highest average impact is cg22541735, which is within HOXD9 and HOXD-AS2. Interestingly, the feature with the 10th highest average impact (cg14789818) is also the second most important feature for the multiclass XGBoost model. Supplementary Fig. S3 visualizes the features with the highest average absolute SHAP values.

The proximal genes of the multiclass model’s features are enriched in genes contributing to hallmarks of cancer, carcinogenesis, and transcriptional regulation

The PCCs can be mapped to the proximal genes—genes where the gene body or promoter region (taken as the 1500 base pair window upstream of the transcription start site) overlap the PCCs. We will refer to the genes obtained by mapping the multiclass PCCs to proximal genes as ‘multiclass genes’.

We performed functional enrichment analysis on the multiclass genes. A visualization of the significant Gene Ontology terms, restricted to the Biological Process ontology, is shown in Fig. 5a . This shows that our multiclass gene list is enriched in development, regulation of signaling, processes involved in gene expression changes, and the regulation of a wide variety of metabolic processes. Over-representation analysis revealed that there is significant overlap between the multiclass genes and the COSMIC Cancer Gene Census [ 31 ], with an overlap of 140 genes (19.0% of COSMIC genes) (Fisher’s exact test, p = 8.7e − 17). We also found significant overlap between the multiclass genes and the OncoKB Cancer Gene List [ 32 ], namely 217 genes (19.7% of OncoKB cancer genes) (Fisher’s exact test, p = 4.5e − 27). Furthermore, analysis of multiclass features using the TF checkpoint 2.0 database indicated that 17.2% (546 genes) (Fisher’s exact test, p = 2.4e − 39) are also transcriptional regulators.

Cancer processes, genes, and pathways in the multiclass gene list. a A REVIGO visualization showing the significant Gene Ontology terms, restricted to the biological process domain. Only a small selection of terms is labelled. b The 20 multiclass genes found most often in abstracts about cancer. Colour indicates the number of abstracts also specifying a tissue. c A visualization of the significant KEGG pathways, where the size of the node (pathway) is the amount of overlap between the multiclass gene list and the pathway, and the width of the edge indicates the amount of overlap between the two pathways. d The Pathways in cancer KEGG pathway, showing only multiclass genes. Each multiclass gene is coloured by the difference in average methylation between cancer and normal for two cancer types: BLCA and PRAD

We also looked at the overlap with established DNA methylation biomarkers in cancer, by comparing with the genes used by commercially available DNA methylation-based biomarker assays [ 70 ]. Out of the 13 genes measured by these assays, four overlap with our multiclass genes (RASSF1, SEPTIN9, SHOX2, MGMT). During normal expression, RASSF1A represses cell cycle proteins cyclin A2 and cyclin D1, leading to cell cycle arrest and plays a significant role in microtubule stability and modulates apoptosis. Furthermore, RASSF1 inactivation is one of the most common epigenetic changes in cancer [ 71 ]. Similarly, SEPTIN9 participates in cytokinesis during the cell cycle [ 72 ], while SHOX2 is a transcription factor involved in proliferation, migration and colony formation [ 73 ] and MGMT inhibits tumour formation [ 74 ]. All these genes are well-known prognostic biomarkers in cancer. There are also several multiclass genes in the same family as these 13 genes (such as NDRG3, BMP8A, OTX2, ONECUT1).

Text mining a corpus of 183,909 PubMed cancer-related abstracts that mention at least one multiclass gene revealed that the cancer literature provides evidence for the multiclass genes. 65.6% (2083) of our multiclass genes are found in at least one cancer-related article abstract from PubMed. See Fig. 5b for the genes most supported by the literature. These include well-studied oncogenes such as STAT3, BRCA1, AR, MYC, CXCR4, NOTCH1, SMAD4, TERT, ZEB1, JUN, amongst others. This analyses also demonstrated that just under 40% of these abstracts are additionally associated with at least one of the 13 tissue types included in the multiclass model. BRCA, PRAD and COAD are most commonly found, due to their high prevalence. Supplementary File 1 details the evidence for the multiclass genes in these PubMed cancer-related abstracts.

The multiclass genes are enriched in cancer-related pathways and networks

Pathway enrichment analysis using the KEGG pathway database revealed enrichment of pathways related to general cancer hallmarks, such as Pathways in cancer (adjusted P -value = 4.3e − 4), Metabolic pathways (adjusted P -value = .0214), and signal transduction pathways such as the Wnt signalling pathway (adjusted P -value = 8.2e − 3), TGF beta signalling , Hippo signalling , Axonal guidance pathway involved in invasion and metastasis , and many metabolic pathways. See Fig. 5c for a visualization of these enriched pathways.

The multiclass genes in these pathways displayed different methylation patterns for different cancer types. A visualization of the Pathways in cancer network from KEGG is shown in Fig. 5d for both bladder urothelial carcinoma (BLCA) and PRAD, and in Supplementary Fig. S4 for all other cancer types. This shows that BLCA, KIRC, KIRP, LIHC, THCA and UCEC are mostly hypomethylated whilst BRCA, COAD, LUAD, LUSC and PRAD are mostly hypermethylated. Supplementary Fig. S5a shows a heatmap of PCCs at least 2-fold differentially methylated and their recurrent mutation status (from COSMIC cancer gene census and TCGA significantly mutated list) is indicated. Similarly, the differential methylation of all PCCs is shown in Supplementary Fig. S5b . In addition to mutations and copy number aberrations, the PCC features identified by our analysis contribute to carcinogenesis in multiple cancers via methylation changes of regulatory elements.

Furthermore, multiclass genes were found to be present in a broad range of cancer-related pathways, as shown in Fig. 6 . These pathways covered a wide range of categories: Individual cancer types, Cell Death and Survival, Tissue Microenvironment, Signalling, Metabolism, and Immune System. This pathway model also shows that many multiclass genes in these pathways are present in the COSMIC Cancer Gene Census [ 31 ]. To visualize the multiclass genes in one unified cancer network, we curated a general cancer network, based on two general cancer pathways: KEGG’s Pathways in cancer and Ingenuity Pathway Analysis (IPA) Molecular Mechanisms of Cancer (both of which our multiclass genes are enriched in, Fisher’s exact test respective adjusted P -values = 4.206e − 10 and 1.065e − 05). This network model is shown in Supplementary Fig. S6 and demonstrates that the multiclass genes span all areas of cellular networks underlying carcinogenesis.

A network of cancer pathways and the multiclass genes. Each circle of nodes is a cancer pathway, and each node represents a multiclass gene. The node colour represents the number of times each multiclass gene is displayed (as they can be in multiple pathways), the edge thickness represents the number of interactions between pathways, and a black outline indicates that the multiclass gene is found in the Cancer Gene Census. The colour of the pathway name represents the pathway category

Multiclass long non-coding RNAs are associated with oncogenic properties

Additionally, we investigated the proportion of protein-coding versus non-coding genes in our gene lists. This is visualized in Fig. 7a for all individual cancer gene lists and the multiclass genes. It shows that, as expected, most genes are protein-coding (around 65% to 85%). However, the proportion of long non-coding RNA (lncRNA) is surprisingly high for all gene lists (around 14% to 26%), which motivated further analysis. We validated some of the multiclass lncRNAs with literature evidence using Pangaea [ 33 ] and two cancer lncRNA databases (Lnc2Cancer 3.0) [ 44 ] and CRlncRNA [ 45 ]. We found evidence for 142 multiclass lncRNAs (out of a total 596 multiclass lncRNAs). See the heatmap in Fig. 7b (and Supplementary Fig. S7a for a larger version), which shows that there is a wide range of methylation values for these lncRNAs, and a range of cancer hallmarks associated with them. The most common hallmarks are proliferation, invasion, and migration. The lncRNAs with the most evidence include HOTAIR, NEAT1, and HOTTIP, as seen in Fig. 7c .

$Analysis of the lncRNAs found in the gene lists. a The fractions of different gene types in all cancer gene lists, including the multiclass gene list. b A heatmap of BRCA data showing the average beta value of the multiclass lncRNAs with literature evidence, and the cancer hallmarks they are associated with. The row annotation indicates the log fold change from differential expression analysis, where non-significant fold change (adjusted P-value > 0.05) is in grey. c The top 10 multiclass lncRNAs that had the most literature evidence. d The significance levels resulting from testing the multiclass lncRNAs for previously observed cancer lncRNA features [41]. The dashed red line indicates the P-value = .05 level of significance. e Boxplot of the loge gene length of non-multiclass lncRNAs and multiclass lncRNAs. ‘***’ indicates P-value < .001$

Analysis of the lncRNAs found in the gene lists. a The fractions of different gene types in all cancer gene lists, including the multiclass gene list. b A heatmap of BRCA data showing the average beta value of the multiclass lncRNAs with literature evidence, and the cancer hallmarks they are associated with. The row annotation indicates the log fold change from differential expression analysis, where non-significant fold change (adjusted P -value > 0.05) is in grey. c The top 10 multiclass lncRNAs that had the most literature evidence. d The significance levels resulting from testing the multiclass lncRNAs for previously observed cancer lncRNA features [ 41 ]. The dashed red line indicates the P -value = .05 level of significance. e Boxplot of the log e gene length of non-multiclass lncRNAs and multiclass lncRNAs. ‘***’ indicates P -value < .001

We also compared the multiclass lncRNAs to a set of validated cancer lncRNAs and found that they share some of the same properties. Carlevaro-Fita et al. introduced [ 47 ] and Vancura et al. updated the Cancer LncRNA Census 2 (CLC2), which is a list of 492 lncRNAs that have been causally associated with cancer [ 75 ]. Our multiclass lncRNAs do have a significant overlap with the CLC2 (Fisher’s exact test, P -value = 3.0e − 18); however, this is only 74 overlapping lncRNAs. Carlevaro-Fita et al. also uncovered the properties of genes in the CLC, such as smaller distances to cancer SNPs, higher conservation, and longer gene lengths. By carrying out the same tests on our multiclass lncRNAs, we found that the multiclass lncRNAs share some of these CLC properties. We tested the distances to cancer-associated and non-cancer SNPs, distances to cancer associated genes, epigenetic silencing in tumours, differential expression, gene and transcript lengths, gene expression levels, and conservation. The P -values for each of these tests are shown in Fig. 7d . We found that our multiclass lncRNAs did not share any of the same proximity properties (distances to SNPs and cancer genes) but did share all five remaining properties. See Fig. 7e for a boxplot showing that the multiclass lncRNAs have longer gene lengths, and Supplementary Fig. S7b–g for boxplots of the other relevant tests.

Models for some of the cancer types can predict 5-year survival

We used the gene lists from the binary XGBoost models to determine whether they could firstly differentiate, and then predict, survival. Survival was computed for each cancer type, using just the expression of the genes from the binary model as input. For every cancer type, the cox proportional hazard model significantly differentiated survival. Figure 8a shows the Kaplan–Meier curves for the most significantly differentiated cancer types, HNSC ( P -value = 3.15e − 16) and KIRC ( P -value = 3.06e − 15).

Survival analysis using the gene lists from the binary models. a The two most significant Kaplan-Meier curves that differentiate survival: HNSC ( P -value: 3.15x10-16) and KIRC ( P -value: 3.06x10-15). b The distribution of ROC AUCs when predicting 5-year survival for cancer types with sufficient survival data. Colour represents the three different variations of input variables to the survival models. c The best ROC curves for predicting 5-year survival of the cancer types with the highest average ROC AUC: KIRC and COAD

Next, we explored whether the gene lists could predict survival on a held-out test-set. The performance differed between cancer types, as shown in Fig. 8b (here we use age, stage, and gender as covariates). This shows that there is a broad distribution for some cancer types (such as KIRP), which could be due to low sample numbers (KIRP has the second-lowest number of samples). Three cancer types did not have enough data to converge—PRAD and THCA both had less than 15 positive samples (events), and ESCA had the least number of samples. However, models for some cancer types could predict 5-year survival consistently well using only genes as input, such as KIRC, COAD, BLCA, HNSC, and UCEC. Figure 8c shows the best ROC curves from the two cancer types with the highest average ROC AUC, KIRC and COAD. This shows their best ROC AUCs are 0.817 and 0.895, respectively.

The early detection of cancer is vital for enabling treatment options that lead to better prognosis. A fundamental requirement for this is to distinguish cancerous from non-cancerous tissue samples accurately. Here, we have utilized epigenetic changes in the DNA methylome and present binary and multiclass machine learning models to classify 13 cancer types and corresponding normal tissues. Our approach achieved good test set performance for all XGBoost models, namely an average accuracy of 0.987 and 0.982 for the binary and multiclass models, respectively. We were then able to show that the PCCs selected by XGBoost can robustly classify cancer when fed into a multiclass deep neural network, namely EMethylNET (accuracy 0.976).

The performance on most independent (non-TCGA) data sets was above a F 1 score of 0.8, and half of the independent data sets achieved an F 1 score of over 0.9. These independent data sets were more heterogeneous and reflected more realistic situations. Lastly, we demonstrated that multiclass PCCs do have biologically meaningful significance in cancer. Over-representation analysis revealed that the multiclass genes were enriched in processes which are linked to cancer hallmarks, and other cancer and methylation studies report similar Gene Ontology enrichment results [ 76 , 77 ]. Furthermore, a comprehensive text mining analysis of the literature demonstrates that cancer-associated methylation changes in 892 of our multiclass genes are supported by 7831 publications. We also showed that the multiclass genes set consists of 229 known tumour suppressors and oncogenes, 546 transcriptional regulators and are involved in a wide range of cancer-related pathways and processes. Additionally, we showed that our gene lists contain many non-coding RNA genes, primarily consisting of lncRNAs. This is consistent with a growing body of research showing that lncRNAs and other non-coding RNAs play a key role in carcinogenesis [ 78–80 ].

There were two exceptions to the performance of our models, one of them being the independent COAD data set. As indicated in the Results, this low performance can be explained by all adenomas, labelled as normal, being predicted as cancer. Adenomas are dysplastic polyps which can progress via the adenoma–carcinoma sequence to invasive cancer. Therefore, it is common to remove colon adenomas when they are found to stop the possible progression into carcinomas [ 81 , 82 ] and so this behaviour was inadvertently useful. However, a larger sample size of adenomas would be needed to validate this. The other exception is the HNSC independent data set, which has the lowest performance. HNSC is very heterogeneous, in that it can arise from multiple different tissue sites, and the TCGA HNSC data reflects this. However, the independent HNSC data set only stems from one tissue of origin, the oropharynx, and only 1.55% of the TCGA data stems from the oropharynx. In addition, half of the independent HNSC data set is Human papillomavirus positive (HPV+), which is known to display different methylation patterns [ 83 , 84 ]. Thus, we were testing on HNSC cancer types with very little TCGA training data, which could explain the poor performance. In addition, the independent HNSC data were often misclassified as LUAD. The uniform manifold approximation and projection visualization in Supplementary Fig. S8 illustrates that out of all TCGA classes, the independent HNSC data were the closest to LUAD. This could be due to a biological reason, such as the independent HNSC data are in fact metastases which originated in the lung, or this could be due to specific data generation or processing artefacts.

We compared EMethylNET with related cancer classification studies and demonstrated similar or better performance against test set data. We also compare these related works with respect to the features selected by the models. The related works all utilized feature selection methods, such as the moderated t-statistic or differential methylation analysis, with multiple works using redundancy filters, for example the Maximum Relevance–Maximum Distance technique [ 59 ], and many utilizing multiple feature selection methods in parallel [ 59 , 63 , 67 , 69 , 85 ]. Thus, most of these approaches start from a highly filtered probe list, and some only use tens of probes in the final classification model (as detailed in Table 1 ), consequently the models could potentially be biased by the feature selection methods used. In our approach, we did not perform a prior feature selection, but instead let the XGBoost classification model perform the feature selection itself, from an input set of around 277,000 features. For the multiclass case, this resulted in a large set of PCCs, of size 3388, that provided us with an interpretable model and an explainable list of genomic loci for further analysis. Only a handful of the related works have performed feature analysis of the CpGs selected by the model. Ding et al. [ 62 ] performed functional analysis of its 7 CpGs and Liu et al. [ 85 ] found cancer-related genes near three out of its 12 CpGs. We provide an extensive analysis of our PCCs, encompassing over-representation analyses, extensive literature mining, and pathway enrichment visualizations. Exploring the pan-cancer methylome as a network ( Fig. 6 and supplementary Fig. S6 ) enabled the identification of genes associated with several well-studied cancer-associated pathways, including well-known tumour suppressor and oncogenes present in the collection of our PCCs. These include those genes involved in cancer-associated pathways such as TP53, WNT, Notch, TGF beta/BMP, RAS, MAPK, PI3K-AKT and Hedgehog signalling as well as pathways impacting proliferation, survival and cell death including cell cycle regulators, mitotic checkpoint genes, mitochondrial metabolism, DNA damage responses and apoptosis. In addition, pathways involved in invasion and metastasis-associated processes such as the epithelial mesenchymal transition (EMT)-related genes, axonal guidance pathway, and those involved in adherence junctions and extracellular matrix interactions, Integrin signalling and angiogenesis were present. Furthermore, immune response regulators such as cytokine (IFN, interleukin, chemokine), TLR signalling, and interferon stimulated genes were also present. Finally, genes and pathways affecting global gene expression such as developmental regulators, chromatin remodellers, epigenetic regulators and transcription factors were detected. Investigating these genes in a cancer network context enabled their interactions and relationships to be identified. The pan-cancer methylome also demonstrates that in addition to mutations and genetic aberrations, epigenetic changes have wide-ranging impacts on carcinogenesis. To summarize, in comparison with related studies, we are the first to provide an in-depth feature analysis where the CpGs were selected freely by the model, with no prior feature selection adding potential bias to the feature analysis results.

In conclusion, we demonstrated that XGBoost models are suitable for classifying a multitude of cancer types using only DNA methylation data as input. We additionally designed EMethylNET, a robust deep neural network that was able to generalize to most independent data sets. In addition, we find that mapping the PCCs to genes identifies genes that are enriched in functional properties and pathways linked to carcinogenesis. Depending on the availability of training data, this method can be extended to detect hundreds of cancer types. Future applications include extending this approach to DNA methylation data of cell-free DNA, with the eventual aim being early detection of multiple types of cancer from liquid biopsy approaches. Furthermore, a clear clinical application of this method is screening for specific cancer types or cancers of unknown origin, although the current models are not optimized for this purpose.

S.A.S conceived the study. I.N developed the machine learning models and carried out the data processing and analysis. M.S contributed to the initial machine learning models and analysis during a summer studentship. S.J contributed an external data set and expertise. I.N and S.A.S wrote the manuscript with input from the other authors. We acknowledge the contribution of Dr Charles Massie (In Memoriam) of the University of Cambridge who was also involved in the conception of the study and whose advice and expertise on cancer early detection and cancer-related DNA methylome analysis was invaluable to this study. We are thankful to Prof. Rebecca Fitzgerald (University of Cambridge), who contributed an oesophagus cancer data set to this study. We also thank members of the S.A.S laboratory that read and commented on the manuscript.

Izzy Newsham (Data curation [lead], Formal analysis [lead], Investigation [lead], Methodology [lead], Resources [equal], Software [lead], Validation [lead], Visualization [equal], Writing—original draft [equal], Writing—review & editing [equal]), Marcin Sendera (Data curation [supporting], Formal analysis [supporting], Investigation [supporting], Methodology [supporting], Software [supporting], Writing—review & editing [supporting]), Sri Ganesh Jammula (Data curation [supporting], Investigation [supporting], Resources [supporting], Writing—review & editing [supporting]), and Shamith Samarajiwa (Conceptualization [lead], Formal analysis [supporting], Funding acquisition [lead], Investigation [equal], Methodology [supporting], Project administration [lead], Resources [equal], Software [supporting], Supervision [lead], Visualization [supporting], Writing—original draft [lead], Writing—review & editing [equal])

The authors declare no competing interests.

This work was supported by the Medical Research Council (UK MRC) (MC UU 12022/10) funding to S.A.S. I.N is also supported by a UK MRC doctoral studentship.

The results shown here are in whole or part based upon methylome data generated by the TCGA Research Network: https://www.cancer.gov/tcga . Non TCGA evaluation data sets with the following accession IDs were downloaded from NCBI GEO and ICGC.

BRCA:	GSE52865
COAD:	GSE77955
ESCA:	GSE72874
ESCA2:	EGAD00010001822 and EGAD00010001834
HNSC:	GSE38266
KIRC:	GSE61441
LIHC:	GSE75041
PRAD:	PRAD-CA from ICGC
THCA:	GSE97466

The code for this project was produced in a reproducible manner with Python Jupyter and R notebooks, available at: https://github.com/ss-lab-cancerunit/EMethylNET_code . The code and data used to generate most of the figures are also available as a compute capsule https://doi.org/10.24433/CO.1745934.v1 on the CodeOcean computational reproducibility platform enabling the ease of sharing, running and discovery of the code used in this study.

IARC . "Globocan: All Cancers Fact Sheet." https://gco.iarc.who.int/media/globocan/factsheets/cancers/39-all-cancers-fact-sheet.pdf (accessed 24.08.23, 2023 ).

Baylin SB , Jones PA. A decade of exploring the cancer epigenome—biological and translational implications . Nat Rev Cancer 2011 ; 11 : 726 – 34 . https://doi.org/10.1038/nrc3130 .

Google Scholar

Gonzalez-Zulueta M , Bender CM , Yang AS et al. Methylation of the 5' CpG island of the p16/CDKN2 tumor suppressor gene in normal and transformed human tissues correlates with gene silencing . Cancer Res 1995 ; 55 : 4531 – 5 . [Online]. Available: https://www.ncbi.nlm.nih.gov/pubmed/7553622 .

Greger V , Debus N , Lohmann D et al. Frequency and parental origin of hypermethylated RB1 alleles in retinoblastoma . Hum Genet 1994 ; 94 : 491 – 6 . https://doi.org/10.1007/BF00211013 .

Herman JG , Latif F , Weng Y et al. Silencing of the VHL tumor-suppressor gene by DNA methylation in renal carcinoma . Proc Natl Acad Sci U S A 1994 ; 91 : 9700 – 4 . https://doi.org/10.1073/pnas.91.21.9700 .

Hiltunen MO , Alhonen L , Koistinaho J et al. Hypermethylation of the APC (adenomatous polyposis coli) gene promoter region in human colorectal carcinoma . Int J Cancer 1997 ; 70 : 644 – 8 . https://doi.org/10.1002/(sici)1097-0215(19970317)70:6<644::aid-ijc3>3.0.co;2-v .

Sheaffer KL , Elliott EN , Kaestner KH. DNA hypomethylation contributes to genomic instability and intestinal cancer initiation . Cancer Prev Res (Phila) 2016 ; 9 : 534 – 46 . https://doi.org/10.1158/1940-6207.CAPR-15-0349 .

Bedford MT , van Helden PD. Hypomethylation of DNA in pathological conditions of the human prostate . Cancer Res 1987 ; 47 : 5274 – 6 . [Online]. Available: https://www.ncbi.nlm.nih.gov/pubmed/2443238 .

Kim Y-I , Giuliano A , Hatch KD et al. Global DNA hypomethylation increases progressively in cervical dysplasia and carcinoma . Cancer 1994 ; 74 : 893 – 9 . https://doi.org/10.1002/1097-0142(19940801)74:3<893::aid-cncr2820740316>3.0.co;2-b .

Lin CH et al. Genome-wide hypomethylation in hepatocellular carcinogenesis . Cancer Res 2001 ; 61 : 4238 – 43 . [Online]. Available: https://www.ncbi.nlm.nih.gov/pubmed/11358850 .

Wahlfors J , Hiltunen H , Heinonen K et al. Genomic hypomethylation in human chronic lymphocytic leukemia . Blood 1992 ; 80 : 2074 – 80 . [Online]. Available: https://www.ncbi.nlm.nih.gov/pubmed/1382719 .

Irizarry RA , Ladd-Acosta C , Wen B et al. The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores . Nat Genet 2009 ; 41 : 178 – 86 . https://doi.org/10.1038/ng.298 .

Paziewska A , Dabrowska M , Goryca K et al. DNA methylation status is more reliable than gene expression at detecting cancer in prostate biopsy . Br J Cancer 2014 ; 111 : 781 – 9 . https://doi.org/10.1038/bjc.2014.337 .

Rossi SH , Newsham I , Pita S et al. Accurate detection of benign and malignant renal tumor subtypes with MethylBoostER: an epigenetic marker-driven learning framework . Sci Adv 2022 ; 8 : eabn9828. https://doi.org/10.1126/sciadv.abn9828 .

Peng D , Ge G , Xu Z et al. Diagnostic and prognostic biomarkers of common urological cancers based on aberrant DNA methylation . Epigenomics 2018 ; 10 : 1189 – 99 . https://doi.org/10.2217/epi-2018-0017 .

impute: Imputation for microarray data. ( 2023 ).

Du P , Zhang X , Huang C-C et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis . BMC Bioinformatics 2010 ; 11 : 587. https://doi.org/10.1186/1471-2105-11-587 .

Pedregosa F et al. Scikit-learn: machine learning in Python . The Journal of Machine Learning Research 2011 ; 12 : 2825 – 30 .

Talos . ( 2019 ). [Online]. Available: http://github.com/autonomio/talos

Kingma DP , Ba J. "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014 .

Chicco D , Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation . BMC Genomics 2020 ; 21 : 6 . https://doi.org/10.1186/s12864-019-6413-7 .

Durinck S , Moreau Y , Kasprzyk A et al. BioMart and bioconductor: a powerful link between biological databases and microarray data analysis . Bioinformatics 2005 ; 21 : 3439 – 40 .

Durinck S , Spellman PT , Birney E , Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt . Nat Protoc 2009 ; 4 : 1184 – 91 . https://doi.org/10.1038/nprot.2009.97 .

Zhu LJ , Gazin C , Lawson ND et al. ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data . BMC Bioinformatics 2010 ; 11 : 237 . https://doi.org/10.1186/1471-2105-11-237 .

gprofiler2: Interface to the ’g:Profiler’ Toolset. ( 2019 ).

Supek F , Bošnjak M , Škunca N , Šmuc T. REVIGO summarizes and visualizes long lists of gene ontology terms . PLoS One 2011 ; 6 : e21800 . https://doi.org/10.1371/journal.pone.0021800 .

KEGGprofile: An annotation and visualization package for multi-types and multi-groups expression data in KEGG pathway . ( 2019 ).

Zhang JD , Wiemann S. KEGGgraph: a graph approach to KEGG PATHWAY in R and bioconductor . Bioinformatics 2009 ; 25 : 1470 – 1 . https://doi.org/10.1093/bioinformatics/btp167 .

Hagberg A , Swart P , Chult DS. "Exploring network structure, dynamics, and function using NetworkX," Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008 .

Liu H , Qiu C , Wang B et al. Evaluating DNA methylation, gene expression, somatic mutation, and their combinations in inferring tumor tissue-of-origin . Front Cell Dev Biol 2021 ; 9 : 619330 .

Tate JG , Bamford S , Jubb HC et al. COSMIC: the catalogue of somatic mutations in cancer . Nucleic Acids Res 2019 ; 47 : D941 – 47 . https://doi.org/10.1093/nar/gky1015 .

Chakravarty D , Gao J , Phillips SM et al. OncoKB: a precision oncology knowledge base . JCO Precis Oncol 2017 ; 2017 :1: 1 – 16 https://doi.org/10.1200/PO.17.00011 .

Szklarczyk D , Gable AL , Lyon D et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets . Nucleic Acids Res 2019 ; 47 : D607 – D613 . https://doi.org/10.1093/nar/gky1131 .

Warde-Farley D , Donaldson SL , Comes O et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function . Nucleic Acids Res 2010 ; 38 : W214 – 20 . https://doi.org/.1093/nar/gkq537 .

Ietswaart R , Gyori BM , Bachman JA et al. GeneWalk identifies relevant gene functions for a biological context using network representation learning . Genome Biol 2021 ; 22 : 55. https://doi.org/10.1186/s13059-021-02264-8 .

Kramer A , Green J , Pollard J Jr. , Tugendreich S. Causal analysis approaches in ingenuity pathway analysis . Bioinformatics 2014 ; 30 : 523 – 30 . https://doi.org/10.1093/bioinformatics/btt703 .

Kanehisa M. Toward understanding the origin and evolution of cellular organisms . Protein Sci 2019 ; 28 : 1947 – 51 . https://doi.org/10.1002/pro.3715 .

Koch A , Joosten SC , Feng Z et al. Analysis of DNA methylation in cancer: location revisited . Nat Rev Clin Oncol 2018 ; 15 : 459 – 66 . https://doi.org/10.1038/s41571-018-0004-4 .

Singh AN , Sharma N. Identification of key pathways and genes with aberrant methylation in prostate cancer using bioinformatics analysis . Onco Targets Ther 2017 ; 10 : 4925 – 33 . https://doi.org/10.2147/OTT.S144725 .

Balas MM , Johnson AM. Exploring the mechanisms behind long noncoding RNAs and cancer . Noncoding RNA Res 2018 ; 3 : 108 – 17 . https://doi.org/10.1016/j.ncrna.2018.03.001 .

Li Q , Wang P , Sun C et al. Integrative analysis of methylation and transcriptome identified epigenetically regulated lncRNAs with prognostic relevance for thyroid cancer . Front Bioeng Biotechnol 2019 ; 7 : 439 . https://doi.org/10.3389/fbioe.2019.00439 .

Kanehisa M , Goto S. KEGG: kyoto encyclopedia of genes and genomes . Nucleic Acids Res 2000 ; 28 : 27 – 30 . https://doi.org/10.1093/nar/28.1.27 .

van Iersel MP , Kelder T , Pico AR et al. Presenting and exploring biological pathways with PathVisio . BMC Bioinformatics 2008 ; 9 : 399. https://doi.org/10.1186/1471-2105-9-399 .

Love MI , Huber W , Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 . Genome Biol 2014 ; 15 : 550. https://doi.org/10.1186/s13059-014-0550-8 .

Buniello A , MacArthur JAL , Cerezo M et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 . Nucleic Acids Res 2019 ; 47 : D1005 – D1012 . https://doi.org/10.1093/nar/gky1120 .

Yan X , Hu Z , Feng Y et al. Comprehensive genomic characterization of long non-coding rnas across human cancers . Cancer Cell 2015 ; 28 : 529 – 40 . https://doi.org/10.1016/j.ccell.2015.09.006 .

Robinson MD , Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data . Genome Biol 2010 ; 11 : R25. https://doi.org/10.1186/gb-2010-11-3-r25 .

Robinson MD , McCarthy DJ , Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data . Bioinformatics 2010 ; 26 : 139 – 40 . https://doi.org/10.1093/bioinformatics/btp616 .

Siepel A , Bejerano G , Pedersen JS et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes . Genome Res 2005 ; 15 : 1034 – 50 . https://doi.org/10.1101/gr.3715005 .

Pollard KS , Hubisz MJ , Rosenbloom KR , Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies . Genome Res 2010 ; 20 : 110 – 21 . https://doi.org/10.1101/gr.097857.109 .

Survival: A Package for Survival Analysis in R . ( 2019 ).

Blanche P , Dartigues JF , Jacqmin-Gadda H. Estimating and comparing time-dependent areas under receiver operating characteristic curves for censored event times with competing risks . Stat Med 2013 ; 32 : 5381 – 97 . https://doi.org/10.1002/sim.5958 .

Chen T , Guestrin C. "Xgboost: a scalable tree boosting system Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 , pp. 785 – 94 ," New York, NY : ACM 2016.

Zhang X , Wan S , Yu Y et al. Identifying potential DNA methylation markers in early-stage colorectal cancer . Genomics 2020 ; 112 : 3365 – 73 . https://doi.org/10.1016/j.ygeno.2020.06.007 .

Liu MC , Oxnard GR , Klein EA , CCGA Consortium et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA . Ann Oncol 2020 ; 31 : 745 – 59 . https://doi.org/10.1016/j.annonc.2020.02.011 .

Hao X , Luo H , Krawczyk M et al. DNA methylation markers for diagnosis and prognosis of common cancers . Proc Natl Acad Sci U S A 2017 ; 114 : 7414 – 9 . https://doi.org/10.1073/pnas.1703577114 .

Tang W , Wan S , Yang Z et al. Tumor origin detection with tissue-specific miRNA and DNA methylation markers . Bioinformatics 2018 ; 34 : 398 – 406 . https://doi.org/10.1093/bioinformatics/btx622 .

Capper D , Jones DTW , Sill M et al. DNA methylation-based classification of central nervous system tumours . Nature 2018 ; 555 : 469 – 74 . no. Mar 22 https://doi.org/10.1038/nature26000 .

Ding W , Chen G , Shi T. Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis . Epigenetics 2019 ; 14 : 67 – 80 . https://doi.org/10.1080/15592294.2019.1568178 .

Zheng C , Xu R. Predicting cancer origins with a DNA methylation-based deep neural network model . PLoS One 2020 ; 15 : e0226461. https://doi.org/10.1371/journal.pone.0226461 .

Koelsche C , Schrimpf D , Stichel D et al. Sarcoma classification by DNA methylation profiling . Nat Commun 2021 ; 12 : 498. https://doi.org/10.1038/s41467-020-20603-4 .

Modhukur V , Sharma S , Mondal M et al. Machine learning approaches to classify primary and metastatic cancers using tissue of origin-based DNA methylation profiles . Cancers (Basel) 2021 ; 13 : 3768 . https://doi.org/10.3390/cancers13153768 .

Ibrahim J , Op de Beeck K , Fransen E et al. Genome-wide DNA methylation profiling and identification of potential pan-cancer and tumor-specific biomarkers . Mol Oncol 2022 ; 16 : 2432 – 47 . https://doi.org/10.1002/1878-0261.13176 .

Kuschel LP , Hench J , Frank S et al. Robust methylation-based classification of brain tumours using nanopore sequencing . Neuropathol Appl Neurobiol 2023 ; 49 : e12856 . https://doi.org/10.1111/nan.12856 .

Zhang S , He S , Zhu X et al. DNA methylation profiling to determine the primary sites of metastatic cancers using formalin-fixed paraffin-embedded tissues . Nat Commun 2023 ; 14 : 5686 . https://doi.org/10.1038/s41467-023-41015-0 .

Lundberg SM , Lee S-I. A unified approach to interpreting model predictions . Adv. Neural Inf. Process Syst 2017 ; 30 : 1 – 10 .

Hesson LB , Cooper WN , Latif F. The role of RASSF1A methylation in cancer . Dis Markers 2007 ; 23 : 73 – 87 . https://doi.org/10.1155/2007/291538 .

Sun J , Zheng MY , Li YW , et al. Structure and function of Septin 9 and its role in human malignant tumors . World J Gastrointest Oncol 2020 ; 12 : 619 – 31 . https://doi.org/10.4251/wjgo.v12.i6.619 .

Wu X , Chen H , You C et al. A potential immunotherapeutic and prognostic biomarker for multiple tumors including glioma: SHOX2 . Hereditas 2023 ; 160 : 21. https://doi.org/10.1186/s41065-023-00279-8 .

Bai P , Fan T , Sun G et al. The dual role of DNA repair protein MGMT in cancer prevention and treatment . DNA Repair (Amst) 2023 ; 123 : 103449. https://doi.org/10.1016/j.dnarep.2023.103449 .

Pirvan L , Samarajiwa SA. "Pangaea: A modular and extensible collection of tools for mining context dependent gene relationships from the biomedical literature," bioRxiv, p. 2020.04. 02.022517 , 2020 .

Gao Y , Shang S , Guo S et al. Lnc2Cancer 3.0: an updated resource for experimentally supported lncRNA/circRNA cancer associations and web tools based on RNA-seq and scRNA-seq data . Nucleic Acids Res 2021 ; 49 : D1251 – 58 . https://doi.org/10.1093/nar/gkaa1006 .

Wang J , Zhang X , Chen W et al. CRlncRNA: a manually curated database of cancer-related long non-coding RNAs with experimental proof of functions on clinicopathological and molecular features . BMC Med Genomics 2018 ; 11 : 114. https://doi.org/10.1186/s12920-018-0430-2 .

Carlevaro-Fita J , Lanzós A , Feuerbach L , PCAWG Consortium et al. Cancer LncRNA Census reveals evidence for deep functional conservation of long noncoding RNAs in tumorigenesis . Commun Biol 2020 ; 3 : 56. https://doi.org/10.1038/s42003-019-0741-7 .

Vancura A , Lanzós A , Bosch-Guiteras N et al. Cancer LncRNA Census 2 (CLC2): an enhanced resource reveals clinical features of cancer lncRNAs . NAR Cancer 2021 ; 3 : zcab013. https://doi.org/10.1093/narcan/zcab013 .

Ohara K , Arai E , Takahashi Y et al. Genes involved in development and differentiation are commonly methylated in cancers derived from multiple organs: a single-institutional methylome analysis using 1007 tissue specimens . Carcinogenesis 2017 ; 38 : 241 – 51 . https://doi.org/10.1093/carcin/bgw209 .

Huarte M. The emerging role of lncRNAs in cancer . Nat Med 2015 ; 21 : 1253 – 61 . https://doi.org/10.1038/nm.3981 .

England PH. "Bowel cancer screening: guidelines for colonoscopy." https://www.gov.uk/government/publications/bowel-cancer-screening-colonoscopy-quality-assurance/bowel-cancer-screening-guidelines-for-colonoscopy (accessed.

NICE . "Colorectal cancer prevention: colonoscopic surveillance in adults with ulcerative colitis, Crohn's disease or adenomas." https://www.nice.org.uk/guidance/cg118 (accessed.

Canning M , Guo G , Yu M et al. Heterogeneity of the head and neck squamous cell carcinoma immune landscape and its impact on immunotherapy . Front Cell Dev Biol 2019 ; 7 : 52 . https://doi.org/10.3389/fcell.2019.00052 .

Misawa K , Mochizuki D , Imai A et al. Analysis of site-specific methylation of tumor-related genes in head and neck cancer: potential utility as biomarkers for prognosis . Cancers 2018 ; 10 : 27 .

Liu B , Liu Y , Pan X et al. DNA methylation markers for pan-cancer prediction by deep learning . Genes (Basel) 2019 ; 10 : 778 . https://doi.org/10.3390/genes10100778 .

Colaprico A , Silva TC , Olsen C et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data . Nucleic Acids Res 2016 ; 44 : e71 . https://doi.org/10.1093/nar/gkv1507 .

R. R Core Team , "R: A language and environment for statistical computing," 2013 .

Naeem H , Wong NC , Chatterton Z et al. Reducing the risk of false discovery enabling identification of biologically significant genome-wide methylation status using the HumanMethylation450 array . BMC Genomics 2014 ; 15 : 51 . https://doi.org/10.1186/1471-2164-15-51 .

Supplementary data

Month:	Total Views:
June 2024	3,350

Email alerts

Citing articles via, affiliations.

Online ISSN 2396-8923
About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Open access
Institutional account management
Rights and permissions
Get help with access
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

COMMENTS

PDF Machine learning for detection of cyberattacks on industrial control
This thesis serves as a guide for managers of industrial and IoT systems looking to assess and upgrade the cyber risk posture of their organization. It focuses on one portion of industrial cyber security: early detection of anomalies. In this area, machine learning (ML) based anomaly
Malware Analysis and Detection Using Machine Learning Algorithms
We used supervised machine learning algorithms or classifiers (KNN, CNN, NB, RF, SVM, and DT) to examine malware and characterise it. Through statistical analysis of Table 2 's results, we deduced that results of classifiers' accuracy (KNN = 95.02%, CNN = 98.76%, Naïve Byes = 89.71%, Random Forest = 92.01%, SVM = 96.41%, and DT = 99% ...
Enhancing IoT Device Security: A Comparative Analysis of Machine
This thesis focuses on the types of IoT attacks. The goal was to deploy machine learning to make communication protocols in IoT devices more secure. ... (XSS), and Denial of Service (DOS). To address this issue, machine learning-based attack detection, using supervised learning algorithms, such as logistic regression, decision tree, and random ...
PDF Machine learning techniques for advanced cyber attack detection
on investigating suitable machine learning (ML) techniques to construct advanced cyber attack detection systems in this thesis. Specifically, we focus on exploring promising ML techniques to design effective intrusion detection systems (IDS) and efficient cyber threat intelligence (CTI) analysis models to realize proactive defense to cyber attacks.
PDF MACHINE LEARNING METHODS FOR MALWARE DETECTION AND
This paper discusses the main points and concerns of machine learning-based malware detection, as well as looks for the best feature representation and classification methods. The goal of this project is to develop the proof of concept for the machine learning based malware classification based on Cuckoo Sandbox.
California State University, NORTHRIDGE Malware detection using machine
Malware detection using machine learning A thesis submitted in partial fulfilment of the requirements For the degree of Master of Science, In Computer Science By Naveen Donepudi December 2022 . ii The thesis of Naveen Donepudi is approved X Dr. Robert McIlhenny X Date X Dr. Jeff Wiegley X
PDF ADVERSARIALLY ROBUST MACHINE LEARNING WITH ...
Machine learning (ML) systems are remarkably successful on a variety of benchmarks across sev-eral domains. In these benchmarks, the test data points, though not identical, are very similar to ... This thesis focuses on an extreme version of this brittleness, adversarial examples, where even imperceptible (but carefully constructed) changes ...
PDF Machine learning and anomaly detection for insider threat detection
PhD Thesis Machine learning and anomaly detection for insider threat detection Author: Filip Wieslaw Bartoszewski Supervisors: Dr. Mike Just Dr Michael A. Lones April 2022 "The copyright in this thesis is owned by the author. Any quotation from the thesis or use of any of the information contained in it must acknowledge this thesis as the source
Title: Detecting Fake News using Machine Learning: A Systematic
on is big challenge. They have used the machine learning for detecting fake news. Researchers of (Zh. u et al., 2019) found that the fake news are inc. easing with the passage of time. That is why there is a need to detect. ake news. The algorithms of machine learning are trained to fulfill this purpose. Machine l.
Machine learning for detection of fake news
As such, the goal of this project was to create a tool for detecting the language patterns that characterize fake and real news through the use of machine learning and natural language processing techniques. The results of this project demonstrate the ability for machine learning to be useful in this task. We have built a model that catches ...
Credit Card Fraud Detection Using Machine Learning
Phase 4: Modeling. Four machine learning models were created in the modeling phase, KNN, SVM, Logistic Regression and Naïve Bayes. A comparison of the results will be presented later in the paper to know which technique is most suited in the credit card fraudulent transactions detection.
Electronic Thesis/Dissertation
More specifically, it leverages supervised machine learning to effectively detect intrusion of cyber-attacks on the UAV Attack dataset (Whelan, et. al., 2020) via binary and multi-class classification, while simultaneously aiming to identify a classifier that outperforms prior approaches using standard classification metrics.
Fraud Detection in Financial Services using Machine Learning
Most detection tools rely on a few key models which analyze the data and raise any suspicious behavior. The level of threshold in terms of. anomalies can be adjusted to differentiate legitimate transactions from fraudulent ones. The rising use of machine learning has prompted it to be utilized in many areas.
PDF Machine Learning for Automated Anomaly Detection in Semiconductor
This is why machine learning offers a great potential for anomaly detection in semiconductor manufacturing. If anomalies in the manufacturing process could be detected, or even predicted, earlier, then a manufacturing facility could halt the process and correct the affected machine. This would increase process yield and 13
Static Malware Detection using Deep Neural Networks on Portable Executables
13] as static-dynamic approach to use machine learning for detecting unknown malware. They proposed analyzing operational codes obtained from disassembly of exe-cutables and analyzing their execution trace to determine malicious intent. Similarly, a dynamic malware detection framework for Android called DroidDolphin managed to achieve 86.1% ...
Financial Fraud Detection using Machine Learning Techniques
Financial Fraud Detection using Machine Learning Techniques Matar Al Marri [email protected] Ahmad AlAli ... Ahmad, "Financial Fraud Detection using Machine Learning Techniques" (2020). Thesis. Rochester Institute of Technology. Accessed from This Master's Project is brought to you for free and open access by the RIT Libraries. For more ...
PDF Intrusion Detection Using Machine Learning Algorithms
iques based on machine learning, deep learning, and blockchain technology from 2009 to 2018. The survey identifies applications, drawbacks, and challenges of these th. ee intrusion detection methodologies that identify threats in computer network environments.The second half of this thesis proposes a new machine learning model f.
A Machine Learning Algorithm for Intrusion Detection System in Edge
This thesis presents the design, implementation, and evaluation of an Intrusion Detection System (IDS) specifically tailored for edge computing networks. ... The IDS incorporates machine learning algorithms, with a focus on decision tree classifiers, to enable effective intrusion detection. The NSL-KDD dataset is utilized for experimentation ...
Utilizing Process Mining and Deep Learning to Detect IoT / IIoT
This dissertation explores a critical issue in computational cybersecurity methods, emphasizing the limitations of Machine Learning (ML) and Deep Learning (DL) models that rely heavily on extensive datasets of normal and synthesized attack data points. Given the scarcity of real attack data and the impracticality of using synthesized data for training in real-world applications, the research ...
PDF Increasing the Predictive Potential of Machine Learning Models for
firewalls and intrusion detection systems (IDSs), etc. These techniques protect networks from ... My deepest thanks to my thesis advisor Dr. Kendall E. Nygard, for his continuous support, direction, and guidance from the beginning to end of this ... Machine learning techniques in cybersecurity.....24 2.7.1. Supervised learning.....25 . viii 2.7 ...
PhD Dissertations
The Machine Learning Department at Carnegie Mellon University is ranked as #1 in the world for AI and Machine Learning, we offer Undergraduate, Masters and PhD programs. ... (DIsH) Learning Junier Oliva, 2018. Stress Detection for Keystroke Dynamics Shing-Hon Lau, 2018. Sublinear-Time Learning and Inference for High-Dimensional Models Enxu Yan ...
A deep learning-based algorithm for pulmonary tuberculosis detection in
Singh, M. et al. Evolution of machine learning in tuberculosis diagnosis: A review of deep learning-based medical applications. Electronics 11 (17), 2634 (2022). Article Google Scholar
"A Machine Learning Approach to Network Intrusion Detection System Usin
Atawodi, Ilemona S., "A Machine Learning Approach to Network Intrusion Detection System Using K Nearest Neighbor and Random Forest" (2019). Master's Theses. 651. The evolving area of cybersecurity presents a dynamic battlefield for cyber criminals and security experts. Intrusions have now become a major concern in the cyberspace.
Towards Reducing Data Acquisition and Labeling for Defect Detection
In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. However, relying solely on synthetic data is frequently inadequate for ...
Malware detection using machine learning
Masters Thesis Malware detection using machine learning. It is highly important to detect a file if there is any malware is present or not. Due to increase in malware, a lot of problems are created and companies are losing their important data and facing various problems. The next point is that malware can easily create a lot of damage to the ...
A review of machine learning and deep learning algorithms for ...
Parkinson's Disease (PD) is a prevalent neurodegenerative disorder with significant clinical implications. Early and accurate diagnosis of PD is crucial for timely intervention and personalized treatment. In recent years, Machine Learning (ML) and Deep Learning (DL) techniques have emerged as promis …
Cybersecurity Attacks Detection For MQTT-IoT Networks Using Machine
8.1 Conclusion. In this thesis, we assessed how well ensemble machine-learning approaches performed when used to detect cybersecurity attacks for MQTT-IOT networks. The primary focus was on the three popular ensemble methods Bagging, Boosting, and Stacking.
PDF Master Thesis Using Machine Learning Methods for Evaluating the ...
Based on this background, the aim of this thesis is to select and implement a machine learning process that produces an algorithm, which is able to detect whether documents have been translated by humans or computerized systems. This algorithm builds the basic structure for an approach to evaluate these documents. 1.2 Related Work
Early Detection of Emotional Issues in High School Students: A Machine
The primary objective of this project is to develop a predictive model for early emotional issue detection in high school students. By harnessing academic performance, attendance and discipline records, this research work seeks to detect early indicators of emotional issues, thereby empowering school administrators and educators to promptly identify and support students. At the heart of this ...
Early detection and diagnosis of cancer with interpretable machine
Detection of the cancer states through binary classification of DNA methylation from individual tumour and normal tissues. XGBoost, a type of gradient boosted tree model, is an iterative ensemble machine learning approach . We trained 13 binary XGBoost models, one for each cancer type.

Information

Initiatives

Article Menu

JSmol Viewer

1. Introduction

Share and Cite

Article Metrics

Enhancing IoT Device Security: A Comparative Analysis of Machine Learning Algorithms for Attack Detection

Access this chapter

Author information

Corresponding author

Editor information

Rights and permissions

Copyright information

About this paper

Download citation

Share this paper

Select type of work

Machine Learning for Intrusion Detection of Cyber-Attacks in Unmanned Aerial Vehicles

Notice to Authors

Machine Learning - CMU

PhD Dissertations

A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography

Similar content being viewed by others

Deep learning for distinguishing normal versus abnormal chest radiographs and generalization to two unseen diseases tuberculosis and COVID-19

Deep learning, computer-aided radiography reading for tuberculosis: a diagnostic accuracy study from a tertiary hospital in India

Automated abnormality classification of chest radiographs using deep convolutional neural networks

Materials and methods

Training datasets

Algorithm: Google teachable machine

Dataset for external validation

Physician’s performance test

Statistical analysis

Internal validation

External validation

Physicians’ performance

CXR image patterns and cutoff value evaluation

Deployment of the TB CXR AI

Data availability

Acknowledgements

Author information

Contributions

Corresponding author

Ethics declarations

Additional information

Supplementary Information

About this article

Share this article

Quick links

Master's Theses

Date of Award

Degree Type

Degree Name

Committee Chair

Committee Chair School

Committee Member 2

Committee Member 2 School

Committee Member 3 School

Recommended Citation

Included in

Author Corner

Computer Science > Machine Learning

Submission history

References & Citations

BibTeX formatted citation

Bibliographic and Citation Tools

arXivLabs: experimental projects with community collaborators

Downloadable Content

Malware detection using machine learning

Save citation to file

Add to My Bibliography

A review of machine learning and deep learning algorithms for Parkinson's disease detection using handwriting and voice datasets

Conflict of interest statement

Similar articles

Related information

Early Detection of Emotional Issues in High School Students: A Machine Learning Approach

Article Contents

Early detection and diagnosis of cancer with interpretable machine learning to uncover cancer-specific DNA methylation patterns

Microarray-based methylation analysis

Data pre-processing