financial applications of machine learning a literature review

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Machine learning in internet financial risk management: A systematic literature review

Roles Conceptualization, Data curation, Methodology, Software, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliations Science and Technology Finance Key Laboratory of Hebei Province, Hebei Finance University, Baoding, Hebei, China, Faculty of Management, Universiti Teknologi Malaysia, Johor Baru, Malaysia, Faculty of Management, Hebei Finance University, Baoding, Hebei, China

Roles Project administration, Resources, Validation, Visualization, Writing – review & editing

Affiliation BoHai College, Hebei Agricultural University, Cangzhou, Hebei, China

Roles Methodology, Supervision, Writing – review & editing

Affiliation Faculty of Management, Universiti Teknologi Malaysia, Johor Baru, Malaysia

Roles Funding acquisition, Writing – review & editing

Affiliation Faculty of Management, Hebei Finance University, Baoding, Hebei, China

Xu Tian,
ZongYi Tian,
Saleh F. A. Khatib,

Published: April 16, 2024
https://doi.org/10.1371/journal.pone.0300195
Reader Comments

Internet finance has permeated into myriad households, bringing about lifestyle convenience alongside potential risks. Presently, internet finance enterprises are progressively adopting machine learning and other artificial intelligence methods for risk alertness. What is the current status of the application of various machine learning models and algorithms across different institutions? Is there an optimal machine learning algorithm suited for the majority of internet finance platforms and application scenarios? Scholars have embarked on a series of studies addressing these questions; however, the focus predominantly lies in comparing different algorithms within specific platforms and contexts, lacking a comprehensive discourse and summary on the utilization of machine learning in this domain. Thus, based on the data from Web of Science and Scopus databases, this paper conducts a systematic literature review on all aspects of machine learning in internet finance risk in recent years, based on publications trends, geographical distribution, literature focus, machine learning models and algorithms, and evaluations. The research reveals that machine learning, as a nascent technology, whether through basic algorithms or intricate algorithmic combinations, has made significant strides compared to traditional credit scoring methods in predicting accuracy, time efficiency, and robustness in internet finance risk management. Nonetheless, there exist noticeable disparities among different algorithms, and factors such as model structure, sample data, and parameter settings also influence prediction accuracy, although generally, updated algorithms tend to achieve higher accuracy. Consequently, there is no one-size-fits-all approach applicable to all platforms; each platform should enhance its machine learning models and algorithms based on its unique characteristics, data, and the development of AI technology, starting from key evaluation indicators to mitigate internet finance risks.

Citation: Tian X, Tian Z, Khatib SFA, Wang Y (2024) Machine learning in internet financial risk management: A systematic literature review. PLoS ONE 19(4): e0300195. https://doi.org/10.1371/journal.pone.0300195

Editor: Muhammad Usman Tariq, Abu Dhabi University, UNITED ARAB EMIRATES

Received: November 8, 2023; Accepted: February 22, 2024; Published: April 16, 2024

Copyright: © 2024 Tian et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files.

Funding: Hebei Social Science Fund (HB22YJ026); Open Fund Project of Science and Technology Finance Key Laboratory of Hebei Province (STFCIC202102;STFCIC202213); S&T Program of Hebei (22567630H); Baoding Science and Technology Bureau science and technology plan soft science project (2340ZZ013). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors declare no conflict of interest.

1. Introduction

With the rapid development of internet technology and the arrival of the intelligent era, traditional financial enterprises have gradually expanded their online business operations and are embracing the new format of internet financial services along with the internet financial platform companies that have emerged since 2012 [ 1 ]. Internet finance has rapidly developed due to its convenience, real-time nature, and no geographical limitations, resulting in the expansion of market size, number of participants, and services or products offered [ 2 ], However, it also faces significant risks, as evidenced by the large number of problems with P2P platforms in 2018. Compared with traditional financial services, internet finance has relatively low barriers to entry, smaller amounts, faster speeds, and more relaxed audits, which has led to higher requirements for credit risk control, fraud prediction, and other risk prevention measures in internet financial platforms [ 3 , 4 ]. Research related to risk identification, risk alert, and risk supervision based on big data [ 5 – 8 ], blockchain [ 9 , 10 ], artificial intelligence [ 4 , 11 , 12 ] and machine learning algorithms [ 1 , 2 ] is progressively unfolding.

The internet finance refers to a business model wherein traditional financial institutions or internet companies utilize internet technology to provide financial-related services such as financing, payment, investment, and information intermediation on the internet [ 13 ]. Over a span of two years starting from 2016, more than 200 internet finance companies in China alone faced instances of default, involving issues like borrower delinquency, platform fraud, and cyberattacks [ 2 ]. Only in 2018, the thriving P2P internet finance platforms in China plummeted from 6385 to 1595 by August, resulting in significant losses for investors [ 14 ]. Internet financial services offer rapid response times, thereby enhancing user satisfaction. Consequently, swift identification of potential risks is crucial [ 15 , 16 ]. Considering that the internet will remain a pivotal direction for the development of the financial industry for the foreseeable future, with an increasing number of services offered by major financial institutions, such as banks, being conducted through online channels, this paper focuses on the issue of financial risk prevention in the internet domain. Research on internet financial risk warning can effectively nip potential risks in the bud [ 17 ], as traditional credit scoring card models can no longer cater to the needs of business development and security balance [ 2 ]. The aim is to explore how different machine learning methods can better identify and mitigate internet finance risks, particularly when traditional credit rating methods are not well-suited for the rapid and efficient nature of the internet. This paper adopts a systematic literature review approach to examine the various machine learning models and algorithms utilized by different scholars in assessing internet finance risks. This comprehensive review aims to gain insights into the application of machine learning algorithms in this field and the outcomes across different contexts, thereby comparing the suitability of different algorithms in this domain.

The significance and main contributions of this paper are manifested in several aspects. Firstly, it innovatively employs a systematic literature review approach to delineate the landscape of machine learning models and algorithms in internet finance risk management. Through a systematic analysis of previous research achievements, this study comprehensively reviews and compares the approaches and outcomes of machine learning in internet financial risk warning and identification. Secondly, while traditional credit scoring methods and various machine learning algorithms are commonly used in risk management for internet finance platforms, previous literature has compared these methods in different contexts. This paper provides a clear and comprehensive classification and summary analysis of the application of these methods in internet finance platforms. Thirdly, building upon the existing landscape, we believe this paper provides a clear roadmap for future research on this topic, outlining research directions and themes to bridge knowledge gaps. Fourthly, from a practical standpoint, the various frameworks and methods for internet financial risk identification provided by this study can assist internet financial companies in identifying their weaknesses and enhancing risk prevention measures. This, in turn, can elevate their service quality, facilitating more widespread and stable financial services.

The subsequent structure of this study is outlined as follows. Section 2 presents the literature review; Section 3 introduces the methods and strategies of this paper; Section 4 shows the results; Section 5 discusses the findings; Section 6 presents the conclusions and the last section is the future research suggestions.

Due to scholars’ utilization of various data sets and scenarios in their research, coupled with the rapid development of machine learning model algorithms, including large models like Transformer, which currently lack research literature on internet finance risk, this paper cannot provide a unified conclusion. Instead, practitioners could select models and algorithms that best suit their own circumstances and data based on the evaluative findings presented in this paper.

2. Literature review

Scholars have proposed the utilization of machine learning techniques [ 14 , 18 , 19 ] to predict credit risks by collecting and mining internet data. This approach has yielded superior predictive outcomes compared to conventional methods. Even within the same data sources, machine learning models exhibit greater accuracy [ 2 , 8 ], stability [ 8 ], predictive precision [ 19 , 20 ], and efficiency [ 20 ] in contrast to traditional credit scoring models.

Mirza et al. [ 19 ] compared various methods such as Naïve Bayes, Random Forest, and DLNN, and computed the accuracy of different models, revealing an enhancement in the precision of internet finance credit detection and prediction. However, researchers have discovered variations in efficiency and outcomes among different machine learning models and algorithms. Thus, developing superior algorithms and more efficient, reliable machine learning models for internet financial risk prediction has become an urgent challenge to address.

The research on the topic of internet financial risk has a long history [ 21 ], encompassing both quantitative empirical analyses [ 22 ] and qualitative descriptions [ 1 ], as well as comprehensive review studies [ 13 ]. There are analyses employing quantitative platform data [ 2 , 14 ] and those conducted using textual data [ 23 , 24 ]. Studies have delved into various subtopics such as risk perception [ 22 ], risk identification [ 24 ], and risk regulation [ 12 ], rendering the research on internet financial risk quite extensive.

However, the exploration of internet financial risk from the perspective of machine learning models emerged relatively late. The application of this approach to internet financial risk warning and risk management research began as early as 2019 [ 15 ], gradually gaining momentum alongside technological development [ 11 , 19 , 25 ]. The primary focus of these studies lies in the selection of model methodologies [ 17 , 20 , 26 ] and the construction of risk systems [ 1 , 27 , 28 ]. However, to date, there has been no comprehensive review article or study systematically outlining the state of this emerging yet critical research field. This is precisely the contribution of the present study.

The primary object of this study is to elucidate the application and research status of various machine learning algorithms or models in identifying and warning about internet financial risks. Using a systematic literature review approach, a comprehensive analysis of relevant literature in this field is conducted. Currently, there are only a limited number of articles on this topic [ 11 , 25 , 27 ], and our study addresses the following three main questions through analysis, clarifying the current state of research advancement and literature gaps in this field, as well as the differences between various internet financial risk identification and warning methods.

Q1. What machine learning algorithms have been studied in the literature for internet financial risk identification and warning, and have these algorithms and models all shown improvement?
Q2. How is the application status of the aforementioned algorithms and models?
Q3. Is there a best-suited machine learning algorithm for most internet financial platforms?

In this study, a systematic literature review method is employed to investigate the above questions. This method is well-suited for concentrating on a specific topic, providing a panoramic view, offering a more comprehensive understanding of the chosen domain, and highlighting gaps and future research directions [ 29 – 31 ].

3. Methodology

Following the standardized Systematic Literature Review (SLR) [ 32 , 33 ], this study advanced its research. Initially, we opted for the Scopus and Web of Science (WOS) databases as sources, conducting searches for all publications related to "internet financial risk" across various years. Scopus, being the world’s largest abstract and citation database, provides an extensive repository of abstracts and citations. Web of Science, on the other hand, is a comprehensive, multidisciplinary, core journal citation indexing database. Both databases are globally authoritative and specialized platforms for data retrieval, offering advanced search functionalities. This facilitates our ability to obtain relevant search results quickly, efficiently, and comprehensively.

3.1 Sample identification

In this study, we employed a keyword-based literature retrieval strategy [ 29 , 34 ]. To gather all relevant literature and research, we formulated multiple search strings related to "internet financial risk". Considering the diverse expressions in English, where internet financial could also be referred to as "online finance," "network finance," or "Fintech," we compiled all potentially involved keywords listed in Table 1 and combined them through permutations using the Boolean operator "or". Furthermore, recognizing variations in the usage of terms like "finance" and "financial," we used the asterisk "*" to represent inconsistent parts, aiming to comprehensively cover the complete continuum of the phrase "internet financial risk". In the Scopus database, we employed the search method of "Title-Abstract-Keywords". In the Web of Science database, we used "Topic" as the search mode, and we narrowed down the search scope to three major citation databases: the Science Citation Index (SCI), Social Sciences Citation Index (SSCI), and Arts & Humanities Citation Index (A&HCI), to ensure the quality of the source journals. The final search strings are as presented in Table 1 . The search date for all the data is August 9, 2023, and all literature cited in this study is up to that date.

PPT PowerPoint slide
PNG larger image
TIFF original image

https://doi.org/10.1371/journal.pone.0300195.t001

3.2 Inclusion and exclusion criteria

Following the search using the aforementioned keyword strings, the initial results in the Scopus and Web of Science databases were 116 and 48 publications, respectively. After following the approach of Khatib et al. [ 31 ] and Khatib et al. [ 32 ], we refined the results by limiting the language to "English", reducing the counts to 113 and 48. Further refining to "journal articles" and resulted in 70 and 48 publications. Subsequently, in the Scopus database, we narrowed down the "Subject area" to categories including "Computer Science", "Economics, Econometrics and Finance", "Business, Management and Accounting", "Engineering", "Mathematics", "Social Sciences", "Decision Sciences" and "Multidisciplinary". In the WOS database, we limited the "research area" to "Business Economics", "Computer Science", "Mathematics", "Telecommunications", "Engineering", "Operations Research Management Science", "Environmental Sciences Ecology" and "Science Technology Other Topics", yielding 68 and 48 publications respectively.

Then we merged the above-mentioned literature while removing duplicates, resulting in 70 articles. Subsequently, we conducted title screening and excluded 7 articles. The remaining 63 publications were subjected to abstract reading and screening, yielding 47 relevant articles. Finally, we thoroughly read these remaining publications, retaining those that incorporated concepts related to machine learning and eliminating others unrelated to the subject. We have also excluded a paper that has been retracted. This led to the final selection of 17 literatures focusing on the application of machine learning for internet financial risk identification and warning.

Fig 1 illustrates the process conducted in this study, encompassing database searches, refinement, merging, deduplication, screening, and eligibility selection. Unlike existing articles that solely focus on "internet finance risk" [ 22 ], "financial technology" [ 35 ], or "credit risk" [ 13 ], this study employs a systematic review approach to concentrate on the application and exploration of various machine learning methodologies in the realm of "internet financial risk." Despite the limited number of publications, this review comprehensively assesses and evaluates literature in this field. It not only analyzes numerous models and algorithms applied in the domain of internet financial risk but also systematically examines aspects like annual publication trends, regional publication trends, relevant research methods, evaluation metrics, and more.

https://doi.org/10.1371/journal.pone.0300195.g001

For the aforementioned literature, this paper will focus on examining the machine learning models employed by scholars in the field of internet finance risk management, as well as how these models perform across different platforms and scenarios. Therefore, we will compare the applicability of different models and algorithms in this field based on the development history, application domains, and advantages of machine learning. The results will be presented in Section 4.5.

When evaluating and assessing model performance, the goal is to ensure that the model correctly classifies samples, meaning that the actual situation of the sample data matches the model’s predictions as closely as possible. Therefore, for binary classification problems, there are four different scenarios:

The model predicts positive, and the actual situation is also positive, indicating that the model prediction is true, known as the True Positive (TP) scenario.
The model predicts negative, but the actual situation is positive, indicating that the model prediction is false, known as the False Negatives (FN) scenario.
The model predicts positive, but the actual situation is negative, indicating that the model prediction is false, known as the False Positives (FP) scenario.
The model predicts negative, and the actual situation is also negative, indicating that the model prediction is true, known as the True Negatives (TN) scenario.

TP, FN, FP, and TN respectively represent the sample counts for the four scenarios described above. Therefore, machine learning model assessment is based on these four scenarios, and a series of metrics have been developed to judge the model’s performance. This paper will present the results and explanations based on the main metrics applied in the literature in the "Results" section.

Despite including data from WOS and Scopus, there is still a possibility of not encompassing all relevant literature. However, considering the authority of the literature research, this paper still relies on the two aforementioned databases, which are of higher quality and more authoritative in content.

4.1 Publication trends

The popularization of Internet financial services occurred around 2010, while research focusing on Internet financial risks began in 2012 [ 21 ]. Thanks to a plethora of algorithmic innovations in the field of computer algorithms, machine learning, deep learning, and other methods have gradually been applied to Internet financial risk analysis. This has led to a growing interest in the subject. In our sample literature, the earliest document on this topic dates back to 2019 which was conducted by Noor et al. [ 15 ], with only one publication. Subsequently, the number of publications started to increase gradually, reaching 6 by 2022. As of August 2023, there have been three more publications, indicating a relatively limited volume overall. This suggests that research on the application of these specific methods in this particular field is still relatively insufficient. The yearly publications volume shown in ( Fig 2 ).

https://doi.org/10.1371/journal.pone.0300195.g002

4.2 Geographical distribution

As shown in Table 2 , this section presents the annual regional distribution of all the references in this paper. It’s quite evident that out of the 17 documents, 11 of them are based on Chinese Internet financial data [ 14 , 27 ]. Chinese scholars or researchers using Chinese data for machine learning algorithms in Internet financial risk analysis stand out as the driving force behind research on this topic. The sources of all this data primarily fall into three categories: national-level data [ 26 ], data from Internet financial platforms or related enterprises [ 17 , 20 ], and individual lending data from platforms [ 2 , 14 ], all of which are also detailed in Table 2 . Regarding this subject, there are studies focused on Europe, the United States, and those utilizing global Internet financial platform data. Additionally, three articles do not precisely specify the regional focus of their research [ 7 , 25 , 36 ].

https://doi.org/10.1371/journal.pone.0300195.t002

This phenomenon might be attributed to the fact that in China, after a period of rapid and unchecked growth of Internet financial platforms [ 17 ], serious risk issues emerged, involving numerous defaults, platform escape with money, and other problems [ 23 ], the number of platform drop to 1/4 from the top year [ 14 ]. Although Internet financial is an emerging financial service model, it has not altered the fundamental nature of financial services. Risk prevention remains a crucial and central aspect [ 27 , 28 ]. Consequently, Chinese scholars and professionals in the financial industry have shown a great deal of concern about Internet financial risk. They aim to utilize various methods to mitigate these risks, promote the healthy development of the industry and Internet financial services, thus generating a heightened demand [ 8 ].

4.3 Literature focus

Upon reviewing all the literatures, it becomes evident that these documents broadly focus on two distinct core aspects. One category of literature primarily revolves around comparison. These papers compare the differences in final risk identification, risk prediction, and risk supervision using various algorithms or models [ 11 , 19 , 20 ]. The objective is to identify the most suitable approach for applying sample data, thereby better assisting platform companies or other entities in mitigating Internet financial risks. A total of 14 documents fall into this category. The other category of literature centers on designing or innovating Internet financial risk systems, applying relevant data to construct appropriate risk identification or risk prediction systems [ 27 , 28 , 36 ]. Although these two categories of literature emphasize slightly different core points, their ultimate goals are risk reduction and enhancing operational stability. Both categories utilize machine learning-related models or algorithms, leading to a convergence of approaches. This underscores the diverse perspectives and research angles in understanding the practical applications of computer technology in the realm of Internet financial risk. As shown in Table 3 .

https://doi.org/10.1371/journal.pone.0300195.t003

Currently, there are numerous sources of risk in internet finance, and the application of machine learning in internet finance risk management covers a wide range of areas and directions. From the literature reviewed, machine learning is primarily applied in the following five different types of risk management:

Internet financial platforms risk: This category focuses on analyzing and alerting various risks that may occur during the operation and management processes of internet finance platforms using different machine learning algorithms [ 7 , 23 , 28 ]. For instance, Feng and Qu [ 18 ] designed an RBF neural network model optimized by genetic algorithms and established an evaluation index system for internet finance risk. Han et al. [ 8 ] decomposed it into four major components: credit risk, liquidity risk, interest rate risk, and technology risk.
Credit risk assessment and early warning: This area primarily studies the early identification and prediction of borrower credit using various machine learning algorithms. It is believed that suitable machine learning algorithms can effectively promote the identification of credit risks in lending, leading to higher predictive accuracy [ 2 , 11 , 14 , 25 , 36 ].
Internet financial market risk: This category focuses on identifying and analyzing risks in the internet finance market to enhance the level of internet finance risk management [ 18 , 27 ].
Fraud Detection: This involves analyzing the efficacy of machine learning models in fraud detection, aiming to identify danger signals in economic datasets to detect future fraudulent activities [ 19 ].
Cyber threat: This area explores how machine learning models and algorithms can identify advanced network attack patterns and conduct automated network threat attribution analysis and prediction [ 15 ]. The distribution of different risk types in the literature is shown in Table 3 .

4.4 Fields of sciences

Fig 3 provides a detailed overview of the science subject areas in which the articles from the Scopus database are classified. According to the categorization method of the Scopus database, all the literature has been divided into a total of eight different subject areas. The highest number of papers falls under "Computer Science," followed by "Mathematics" and "Engineering," with no more than two papers in any other category. This indicates that although the theme of "Internet financial risk" leans more toward the field of economics and management, the literature predominantly focuses on the methodological aspects of risk identification and prediction. This aligns with the content discussed in the previous section.

https://doi.org/10.1371/journal.pone.0300195.g003

4.5 Machine learning methods

Based on the machine learning methods employed in the literature covered in this paper, they can be broadly categorized into five types: Traditional Machine Learning Algorithms, Deep Learning and Neural Networks, Optimization Algorithms, Data Preprocessing and Enhancement, and Other Methods. In the following sections, we will categorically discuss the methods utilized in the literature. Table 4 presents the annual distribution of all methods used in the sample literature. It should be noted that the same method can be classified into different categories based on various classification approaches. The above classification is solely aimed at facilitating the organization and expression of the literature content.

https://doi.org/10.1371/journal.pone.0300195.t004

4.5.1 Traditional machine learning algorithms.

Among all the literature, traditional machine learning methods are mentioned and utilized a total of 30 times, making it the most frequently used category among the four mentioned above. This suggests that methods introduced or adopted earlier have a higher frequency of use in the context of internet financial risk, implying their relatively mature applicability. Among these, the most commonly used method in the literature is the Logistic Model, appearing in 6 articles, followed by the Random Forest, Gaussian Naïve Bayes Model and Decision Tree methods are each used in 4 articles. It’s worth noting that among all the traditional machine learning methods, the more frequently used methods belong to the category of Classification Algorithms.

Due to its strong interpretability, the Logistic Model is the most frequently utilized model in credit scoring [ 2 , 36 ]. Since the Logistic Model’s predictions can output the probability of belonging to a certain category for a record [ 25 ], adopting methods like the logistic model reveals that machine learning models have an advantage in identifying key influencing factors affecting credit customer default performance. Bussmann et al. [ 11 ] and Wu et al. [ 1 ] have also compared the logistic model with other machine learning models. Scholars utilized data from European Credit Assessment Institutions (ECAIs) focusing on commercial loans for small and medium-sized enterprises (SMEs) obtained from P2P platforms to construct a logistic regression scoring model. This model incorporated financial data on assets and liabilities, as well as network centrality indicators derived from similarity networks, to estimate the default probability of each company. A comparison was made with the results of models employing the XGBoost tree algorithm, and it was found that for internet financial risk, newer deep learning methods generally exhibit higher predictive accuracy [ 11 ].

The Random Forest method is based on the decision tree approach, where each branch of the decision tree represents a potential decision, event, or response [ 15 ]. Decision trees can achieve very low bias, but they also exhibit strong instability and sensitivity [ 25 ]. Therefore, the Random Forest method employs random data sampling and replacement strategies to construct decision trees, mitigating the issue of inconsistent sample selection due to varying tree shapes [ 19 ]. When using the Random Forest method for risk stratification on internet financial platforms, its objective is to reduce variance. This method is applicable to machine learning tasks involving classification and regression, offering higher accuracy and robustness [ 19 ]. Compared to other machine learning methods, it yields more robust and accurate results [ 1 , 25 ]. For internet financial platforms, greater accuracy and robustness are crucial for identifying risks with greater precision and reliability. Therefore, the Random Forest method is widely applied across various platforms.

Due to the relatively simpler implementation of Naive Bayes models and their requirement of smaller training data, they are capable of handling both continuous and discrete data. Naive Bayes is a probabilistic classifier based on the principles of conditional probability in Bayes theorem [ 15 ]. They also offer rapid prediction capabilities, making them particularly suitable for real-time forecasting [ 15 ]. Furthermore, they can conduct sentiment analysis and scoring of online user information, effectively evaluating user eligibility [ 1 ]. Additionally, Bayes models exhibit a higher level of accuracy [ 19 , 25 ]. K-Nearest Neighbor (KNN) is a supervised machine learning algorithm that doesn’t require prior knowledge and can classify based on the majority vote of its neighbors. It’s particularly well-suited for large-scale financial service platforms [ 15 , 17 ]. The KNN method can be employed in conjunction with other techniques to obtain the fitness value. However, in the study by Mirza et al. [ 19 ], KNN was found to have the lowest accuracy among the five methods employed.

Of course, there are also other methods like traditional RBF-NN [ 18 , 27 ], and complementary-neural network (CMTNN) [ 27 ] applied in the existing literature as innovative models and approaches for internet financial risk prediction. Overall, these methods have the potential to enhance the accuracy and speed of traditional predictions. Consequently, models like the Logistic model, Bayes model, and Random Forest, have become more mature in the field of machine recognition [ 1 , 7 ], and have found wide application. However, from scholars’ perspectives, newer deep learning and reinforcement learning methods have shown superior performance on specific datasets compared to traditional machine learning algorithms [ 2 , 19 ]. These methods may find broader applications in the future in the field of internet financial risk identification and early warning.

4.5.2 Deep learning and neural networks.

Firstly, in terms of overall quantity, applications related to deep learning and neural network methods in the context of internet financial risk have appeared a total of 25 times, which is equal to the count of applications of traditional machine learning methods. Additionally, from a temporal perspective, deep learning and neural network models had only one literature in 2019, and the combined occurrences in 2020 and 2021 were merely 8. However, since 2022, the frequency has escalated to 17 occurrences, surpassing more than 2 times the occurrences in the preceding three years. Specifically, in 2022 alone, there were 8 occurrences, and by August 2023, there were already 9 instances, signifying a gradual and increasing integration and utilization of deep learning and neural network-related models in the domain of internet financial risk management. Turning to the specifics of method applications, the most utilized is the BP neural network, referenced and employed in a total of 7 literature sources. Following this, there are 4 instances mentioning the Deep Learning Neural Network (DLNN), and subsequently, for the XGBoost Model, Convolutional Neural Network (CNN) and Long- and Short-Term Memory (LSTM), each mentioned in 3, 2, and 2 literature sources, respectively.

In the analysis of internet financial risk, the Backpropagation (BP) neural network stands out as the most frequently applied method across all literature sources. Typically, a BP neural network comprises at least three layers: the input layer, hidden layer, and output layer [ 8 ]. This approach does not require a predefined mathematical expression between the input and output layers [ 25 ]. Its principle is rooted in the error backpropagation algorithm of a multi-layer feedback network, which involves adjusting thresholds and weights based on the error of results [ 20 , 26 ]. As a result, the structure of the BP neural network is simpler, while its predictive accuracy and nonlinear processing capabilities are stronger [ 18 , 20 ]. In the context of analyzing internet financial platform risk management issues, this approach has been widely adopted by scholars [ 1 , 18 , 27 ]. The study utilized data from 65 publicly listed Chinese companies to train optimized neural networks. Testing was conducted using big data from internet finance enterprises spanning from 2015 to 2018, with a comparison drawn against the actual development of the internet finance sector., it has been observed that compared to other models, although the BP neural network yields higher predictive accuracy, it requires the longest training time [ 18 ]. Therefore, as a foundational deep learning and neural network method, when combined with other algorithms in subsequent steps, it can produce improved outcomes [ 18 ].

In the literature on deep learning for internet financial risk, it is mentioned that the foundation of deep learning operates akin to the neural network systems in the human brain [ 15 ], capable of learning from unlabeled or unstructured data. It fundamentally follows a supervised learning approach, enabling a better understanding of the mapping relationship between x and y [ 26 , 37 ]. Thanks to significant advancements in algorithms and hardware, deep learning can leverage increased levels and neuron counts for modeling, thus making it feasible for application in internet financial risk management [ 1 ]. Mirza et al. [ 19 ] constructed a database spanning 10 years, comprising 95 companies, using KBW and Nasdaq Financial Technology Rankings, as well as the Nasdaq Insurance (IXIS) Index. The aforementioned data was then used to compare five algorithms, including Naive Bayes, KNN, Decision Tree, Random Forest, and DLNN and found that, in comparison to traditional machine learning methods, the accuracy of deep learning (DLNN) is the highest among all five methods. Scholars have been consistently combining foundational deep learning models with other algorithms in an attempt to explore more suitable deep learning algorithms.

The highly renowned XGBoost optimization model is also rooted in the decision tree algorithm, essentially utilizing the gradient boosting ensemble technique to combine multiple decision tree models [ 2 ]. Leveraging gradient descent methods to minimize errors [ 11 ], inappropriate trees are pruned, resulting in a high-accuracy gradient tree boosting model [ 19 ]. This uniqueness positions the XGBoost Model with a distinctive advantage in handling sparse data. Fan et al. [ 2 ] selected a P2P online lending platform in China as the research subject and utilized data from 30,225 short-term loans issued by the platform from August to December 2018. Logistic regression, GMDH, SVM, and XGBoost algorithms were compared for internet finance risk assessment. It was found that the XGBoost model achieved the highest overall accuracy, with a testing set accuracy of 90.1%. Similar conclusions were also drawn in the study by Bussmann et al. [ 11 ].

Convolutional Neural Networks (CNN), built upon the foundation of deep learning (DLNN), incorporate convolutional layers designed for data feature extraction [ 23 ]. These extracted features are then passed to different network nodes, allowing for layered representation and data learning, ensuring efficient learning processes. CNNs are characterized by sparse connections and weight sharing [ 23 ], and have been attempted for prediction tasks, demonstrating performance on par with human experts [ 19 ].

Scholars have employed Long Short-Term Memory (LSTM) for researching internet financial risk [ 19 ]. LSTM, a specialized Recurrent Neural Network, comprises three control units: input gate, output gate, and forget gate, enabling it to address the challenge of long sequence dependencies in neural networks [ 23 ]. Consequently, this enhances the predictive accuracy for high-risk groups in internet finance. Xia et al. [ 23 ] improved classification outcomes by incorporating an attention mechanism, and further elevated accuracy by introducing Bi-directional Long Short-Term Memory (BiLSTM) with reverse sequence information using 42,590 Q&A pairs text. This is because BiLSTM consists of both positive and negative LSTMs, enabling a thorough consideration of the contextual information’s influence on the current output. This facilitates the learning of more accurate semantic representations of text, leading to a more comprehensive understanding of its semantics [ 38 ]. The consideration of contextual information in the output led to even higher recognition accuracy.

Adaptive Boosting (AdaBoost) involves combining outputs from various methods to enhance classification performance, ensuring a reduction in overall classifier error after each iteration. As a result, this method has achieved high accuracy in internet financial risk models [ 19 ]. Methods such as Probabilistic-Neural Network (PNN) [ 27 ], general regression neural network (GR-NN) [ 27 ], and Restricted Boltzmann Machines (RBMs) can also accelerate the learning process, improving optimization efficiency [ 1 ]. By employing various algorithms based on deep learning and neural networks in internet financial risk management, scholars generally find that improvements in machine learning algorithms lead to enhanced accuracy, robustness on validation sets, and even reduced response times. Hence, it can be said that with the aid of more applicable machine learning algorithms, the capability of internet financial risk management is continuously improving, and this improvement process remains ongoing.

4.5.3 Optimization algorithms.

The literature also enumerates some optimization algorithms, with the most prominent being Genetic Algorithms(GA) and their enhanced variants based on genetic algorithms [ 20 ]. The Genetic Algorithm (GA) is a global optimization algorithm based on probabilistic optimization [ 20 ], known for its strong global search capabilities and wide adaptability [ 18 , 23 ]. The ACO-optimized RBF algorithm possesses high spatial mapping and generalization capabilities [ 18 ]. The GABP neural network adopts a distributed storage structure. It demonstrates fast iteration speed, accurate results, good redundancy, and robustness in financial risk identification [ 20 ].

Combining the aforementioned Genetic Algorithm and Simulated Annealing Algorithm, the GABP Algorithm Based on Simulated Annealing Optimization method was used, and it was found to have higher accuracy and predictive speed compared to BP neural networks and GABP networks. Guang et al. [ 20 ] selected 36 internet finance companies as samples and grouped them based on financial conditions for optimization using GA, GABP, and SA-GABP algorithms. They found that leveraging the global optimization capabilities of various genetic algorithms and applying the optimized networks to predict internet finance risks resulted in favorable prediction outcomes. This provides a scientific basis for credit decision-making and risk prevention in internet finance and banking.

4.5.4 Data preprocessing and enhancement.

This section primarily concerns data preprocessing and selection, encompassing three main methods: Synthetic Minority Over-sampling Technique Algorithm (SMOTE), Group Method of Data Handling (GMDH), and Weight of Evidence (WOE), with a total of only 5 applications. The SMOTE algorithm, based on synthetic sample synthesis, enhances data discriminability accuracy by generating new synthetic samples to form a new dataset [ 2 ]. The Group Method of Data Handling (GMDH) is a technique to extract significant information from vast and complex data, thereby improving analytical efficiency [ 14 ]. This is crucial for handling the substantial and intricate data inherent to open attributes in internet financial platforms. In the research by Fan et al. [ 2 ], GMDH achieved accuracy second only to XGBoost. Weight of Evidence (WOE) is primarily employed to assess the relationship between features and targets, examining default situations in internet financial platforms [ 7 , 36 ]. Through appropriate data preprocessing, feature selection, and effective algorithm integration, this serves as a pivotal step in ensuring accurate risk assessment for internet financial platforms.

4.5.5 Other methods.

Other methods mentioned in the literature include Named Entity Recognition (NER), which primarily involves text processing and entity identification. NER falls within the domain of text processing and natural language processing techniques and can identify names, specific locations, and other contextually significant content within text [ 23 ]. There is also the Fuzzy Analytic Hierarchy Process (FAHP), an analytical method used for multi-criteria decision-making problems [ 28 ], and methods related to big data and the Internet of Things (IoT) [ 27 ]. While not the main focus here, it’s evident that these methods, particularly NER in conjunction with emotional analysis, can effectively broaden the applicability of machine learning in internet financial risk identification.

4.6 Literature findings

Table 5 presents the titles and research findings of selected literature, aiming to comprehend the overall research conclusions, current status, and trends of this issue. This provides potential research directions for future studies. Based on the aforementioned analysis and the research findings listed in Table 5 . (1). It can be established that internet financial risk is a widely recognized and crucial latent issue. Machine learning, as a novel computational technology, whether through foundational algorithms or complex algorithm combinations, offers significant advancements in risk prevention compared to traditional credit scoring methods. (2). Different algorithms exhibit varying effectiveness in internet financial risk prediction. Overall, there is an improvement in prediction accuracy, time efficiency, and robustness with algorithm optimization. (3). Technological advancements also bring about technological risks [ 28 ], emphasizing the need for continuous improvement in risk anticipation and prevention.

https://doi.org/10.1371/journal.pone.0300195.t005

Therefore, future research should continue to explore and expand various machine learning algorithms, particularly the application of deep learning algorithms in the field of internet financial risk. A comprehensive and sustainable risk management strategy is imperative for internet financial platform companies, investors, borrowers, regulatory authorities, and even traditional institutions like banks engaged in internet financial operations.

4.7 Evaluation criteria

Table 6 lists all the evaluation metrics and the formula of the metrics used in the literature for assessing internet financial risks. These metrics are employed to gauge the strengths and weaknesses of various machine learning and other methods. TP represents the number of true positive predictions, FN represents the number of false negative predictions, FP represents the number of false positive predictions, and TN represents the number of true negative predictions. ROC is commonly used to evaluate the performance of binary classifiers, where the vertical axis represents the True Positive Rate (TPR) and the horizontal axis represents the False Positive Rate (FPR). The dashed line represents the baseline, indicating the lowest standard. ROC is used on this coordinate axis to measure the accuracy of the model. The closer the ROC curve is to the upper left corner, the higher the predictive accuracy of the model. Compared to other metrics, the ROC curve can more visually display the strengths and weaknesses of different models on a graph. The Area Under Curve (AUC) refers to the area enclosed by the Receiver Operating Characteristic (ROC) curve and the x-axis. Its maximum value is 1. A larger AUC indicates a higher efficiency of the model in identifying targets [ 2 ].

https://doi.org/10.1371/journal.pone.0300195.t006

From the perspective of the final evaluation metrics, undoubtedly, the most important evaluation metric is accuracy, which is mentioned in 16 articles. Accuracy refers to the proportion of correctly classified samples out of the total number of samples, i.e., the sum of the number of instances where the predicted value matches the actual value, divided by the total number of samples. This metric is the fundamental indicator for evaluating model performance. However, for imbalanced datasets, accuracy may not be reliable. Hence, although accuracy is widely used in literature, it is not considered the sole measure of performance.

Following that is true positive rate (Recall) which is used in 6 papers. Recall is the proportion of correctly classified positive samples out of the total number of true positive samples. Recall focuses on the statistical measure of some samples and emphasizes the correct identification of true positive samples. Then precision is used in 5 papers. Precision examines the probability of true positive samples among all predicted positive samples, indicating the confidence in correctly predicting positive samples. It measures the accuracy of positive predictions or the proportion of accurately identified positive samples. Recall focuses on how many positive instances were missed. The higher the recall, the stronger the model’s ability to distinguish positive samples. Precision, on the other hand, focuses on the proportion of predicted positives that are actually true positives. A higher precision indicates a stronger ability of the model to distinguish negative samples. Therefore, precision and recall have a trade-off relationship, each serving its purpose.

Then false positive rate (FPR) is used in 4 papers, which measures the percentage of all actual negative samples that were incorrectly classified as positive by the model. But a more comprehensive and objective evaluation metric and measurement method are the ROC curve and the AUC, which are the combined curves composed of true positive rate and false positive rate and the area under the curves, respectively. They are often used to assess the model overall, and these evaluation metrics can reduce interference from different test sets, providing a more objective measure of the model’s performance compared to individual metrics mentioned above. The ROC and AUC metrics were used in 4 and 3 articles, respectively.

F1 score is used in 3 articles. The F1 score integrates both precision and recall factors, achieving a balance between the two, ensuring both "precision" and "recall" are considered without bias. The F1 score is the harmonic mean of precision and recall, thus it simultaneously considers both the accuracy and recall of the model. However, because it is composed of the product of recall and precision, when the values of recall or precision are very small, the F1 score will also be very small. Regardless of how high one value is, if the other value is very small, the F1 score will be small as well. Therefore, it comprehensively reflects the effectiveness of the model. Using the F1 score as an evaluation metric can prevent the occurrence of extreme cases as mentioned above. Noor et al. [ 15 ] utilizes metrics such as Accuracy, Precision, Recall, F1-measure, False Positive Rate, etc., to assess and compare the effectiveness of Naïve Bayes, KNN, Decision Tree, Random Forest, and DLNN methods, thereby enabling a more comprehensive analysis and evaluation.

The results of these evaluation index indicate that higher accuracy or recall are the most intuitive indicator for assessing different methods, and it’s highly regarded by all researchers. This core metric is crucial in comparing various algorithms. The extensive and diverse set of metrics also provides us with analytical insights and frameworks for assessing the applicability of different methods in the future. Consequently, regardless of how far machine learning algorithms evolve in the future, these metrics and frameworks will continue to help us establish an effective judgment system.

5. Findings and discussion

The development of internet financial platforms has gone through initial rapid expansion followed by a period of gradual regulation, eventually transitioning into a stable operating phase guided by long-term goals. Given the rapid advancement of the internet and the significant role of finance in societal development, recognizing, anticipating, supervising, and managing internet financial risks have become critical topics. Utilizing techniques like machine learning to address the challenges of open internet environments and the abundance of data in financial risk prevention is both timely and necessary. In this study, we employed a systematic approach to review the internet financial risk research conducted using machine learning methods up to the present. This paper listed the machine learning models and algorithms currently used in internet finance risk management, addressing the first question posed. Future research can continue to explore areas such as research methods, data analysis, evaluation metrics, and research scope.

First and foremost, through our analysis, we have observed that whether it’s traditional machine learning algorithms, deep learning, neural networks, or other methods, all have the potential to improve prediction accuracy, surpassing traditional credit indicator calculation methods. This addresses the first question raised in this paper and also touches upon the effectiveness of machine learning methods applied to internet finance risk, addressing the second question. However, the accuracy of neural network models in predicting internet financial risks is contingent on factors such as model structure, sample data, and parameter settings [ 18 ]. There exist issues of data imbalance in the utilized datasets [ 23 ], and most algorithms exhibit certain biases in their final accuracy [ 27 ]. Hence, in the future, due to the specific requirements of the financial industry, ongoing optimization and improvement are necessary at both the algorithmic and data levels. This could involve the incorporation of new or updated algorithms more tailored to financial risks, especially algorithms suitable for extreme value research in risk identification. Although studies have developed models that are well-suited for handling fuzzy, heterogeneous, and incomplete data [ 17 ], currently, analysis of extreme cases is lacking, but financial risks or issues demand attention to extreme situations [ 23 ]. Simultaneously, in terms of data, the inherent nature of financial platforms makes obtaining timely, reliable, stable, and diverse data somewhat challenging. However, this aspect is crucial for enhancing the effectiveness of algorithms and models, given the limited quantity of research in this area at present.

Comparing different machine learning models and algorithms, the current state of affairs generally reflects that intelligent algorithms, represented by various deep learning algorithms, exhibit higher predictive accuracy compared to traditional machine learning models. They can address issues such as uncertainty, poor fault tolerance, and lack of self-learning capabilities in traditional warning models [ 8 ]. However, overall, scholars employ diverse platforms and datasets, and no study has comprehensively compared all mainstream machine learning models and algorithms. Consequently, there is no universally optimal model applicable to all platforms, addressing the third question posed in this paper.

Currently, there are multiple sources of risk in internet finance, including financial risk, legal risk, credit risk, market risk, and technological risk. Scholars primarily focus on credit risk [ 25 ] and technological risk [ 19 ]. Although some researchers have found that technological risk, ethical risk, and legal risk are the predominant factors affecting fintech risk [ 28 ], and even attempted to establish an internet finance risk control system based on deep learning algorithms [ 1 ], a considerable portion of literature still assesses machine learning algorithms from the perspective of credit risk. They evaluate whether single or multiple models can reduce expected losses [ 36 ], increase platform revenue [ 7 ], and obtain more reliable risk predictions [ 2 , 14 ].

Given the characteristics of the internet finance sector, which involve short timeframes and large quantities of data [ 2 ], it is inevitable to opt for artificial intelligence risk warning and management models based on machine learning algorithms. However, the mentioned literature predominantly focuses on data within internet financial platforms or companies, without considering the influence of the external environment and other external sources or third-party data [ 26 ], which limits the generalizability of prediction results. Few studies have concentrated on machine learning identification of textual data, even though in the operational process of internet financial platforms, effective communication among users can be enhanced. Developing more timely and effective sentiment analysis algorithms for textual data could improve risk identification strategies. Thus, from this perspective, the existing internet financial risk assessment metric system should be further refined. It should incorporate external environmental data, existing credit scoring factors, third-party data, and the evaluation metrics presented in this study [ 18 ]. Establishing a more comprehensive and rational internet financial risk assessment metric system can be a potential direction for future research.

Through the analysis of evaluation metrics used in all the literature reviewed, it is evident that most studies choose accuracy, recall, and precision as metrics for evaluation and comparison of results, while fewer studies apply metrics such as ROC, AUC, F-score, and even more comprehensive and complex indicators. None of the literature covered the use of newer models and algorithms like Transformer. These observations indicate that although machine learning has been extensively applied in many fields, research in the domain of internet finance risk management remains limited. Therefore, we outline potential research directions in the "Future Research" section.

Finally, it’s evident that the majority of current applications and research on machine learning in the field of internet financial risk are conducted by Chinese scholars, using Chinese data, and considering Chinese scenarios (11 articles). Therefore, the scope and focus of research are still quite limited. With the increasing adoption of financial technology, digital currencies, big data, the Internet of Things, artificial intelligence, cloud computing, and other technologies across various countries [ 28 ], a more extensive and diverse range of research scenarios and scopes should become a mainstream in future research. This would contribute to providing a safer internet financial environment for individuals, businesses, platforms, local governments, and regulatory authorities.

6. Conclusion

With the gradual penetration of internet financial services in society and the maturation of machine learning algorithms, this study systematically introduces the research of machine learning models and algorithms in the field of internet financial risk. The focus is on exploring various algorithms and their characteristics used in previous studies. While, in general, machine learning enhances the accuracy of internet financial risk identification, scholars’ conclusions vary due to different approaches, and research is overly concentrated in China. Using permutations and combinations of different expressions related to "internet," "finance," and "risk" as keywords, comprehensive searches were conducted in both the Scopus and Web of Science databases, yielding 116 and 48 articles respectively. After filtering by language, document type, topic, merging, deduplication, and focusing on reading and screening content related to "machine learning," the final sample was narrowed down to 17 articles. This paper provides a comprehensive analysis of the sample literature from aspects such as annual trends, regional distribution, literature focus, fields of sciences, used models and algorithms, research findings, and evaluation metrics. Subsequently, the findings of this paper are discussed. Ultimately, it identifies research gaps and proposes future research directions in this field.

The research findings of this paper reveal that although the overall quantity is limited, the research on this topic has tripled in the past three years, with two-thirds of the studies focusing on China. Looking at the machine learning algorithms employed by scholars, a range of traditional algorithms, deep learning algorithms, and novel algorithms like neural networks have been used. The research findings consistently show that compared to traditional credit evaluation methods, machine learning models and algorithms can significantly enhance the accuracy of internet financial risk identification. However, there are noticeable differences among different algorithms, and though conclusions differ with varying datasets, generally, more recent algorithms yield higher accuracy. Additionally, scholars evaluate the effectiveness of various algorithms from aspects such as learning efficiency, recall rate, true positive rate, and more. Our study provides a comprehensive review of the current state of research involving the application of machine learning to internet financial risk. We have identified certain limitations in existing literature, such as the restrictions in research methods, the limited application of various algorithms, incomplete data analysis, exclusion of external environmental data, optimization of evaluation metrics, and over-concentration on China.

7. Future research

The uniqueness of this study lies in its exploration of this emerging research field, offering a comprehensive review of the application of machine learning algorithms in internet financial risk management. Overall, research on machine learning in the field of internet finance risk management is not extensive, and the findings are inconsistent. Thus, it provides innovative analytical outcomes and future research suggestions for this area. Firstly, due to scholars using different platforms, data, models, and algorithms, there is no universally accepted best model. Hence, industry practitioners can categorize discussions on different machine learning algorithms in internet finance risk management based on our research, exploring the most suitable machine learning algorithms for their own specific scenarios. Secondly, a more detailed analysis of the application considerations of deep learning models and algorithms in internet finance risk management practice is needed, starting with data acquisition to improve model efficiency. Thirdly, as mentioned earlier, the literature used in this study comes from two databases, WOS and Scopus. Expanding the literature sources while ensuring quality could be beneficial. Fourthly, future research could gradually expand its scope by merging traditional statistical analysis with machine learning methods for studying internet financial risks. Lastly, some listed companies have claimed that models based on the Transformer architecture have been applied in vertical fields such as financial risk and public security, utilizing encoders and decoders for multi-step prediction. This is also an important research direction for future identification of internet finance risks. Additionally, attention could be directed towards the impact of emerging technologies or business models like digital currencies, metaverse, and blockchain on internet financial risks.

Supporting information

S1 file. prisma checklist..

https://doi.org/10.1371/journal.pone.0300195.s001

S2 File. Data search result of Scopus.

https://doi.org/10.1371/journal.pone.0300195.s002

S3 File. Data search result of WOS.

https://doi.org/10.1371/journal.pone.0300195.s003

Acknowledgments

The authors would like to thank Science and Technology Finance Key Laboratory of Hebei Province for their funding support.

View Article
Google Scholar
PubMed/NCBI
21. Li Q, Cai D, Wang H, editors. Study on network finance risk on the basis of logit model. Technology for Education and Learning; 2012. Berlin Heidelberg: Springer; 2012.

Advanced Search

Financial applications of machine learning: A literature review

Goa Business School, Goa University, Goa 403206, India

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Publisher Site

Expert Systems with Applications: An International Journal

This systematic literature review analyses the recent advances of machine learning and deep learning in finance. The study considers six financial domains: stock markets, portfolio management, cryptocurrency, forex markets, financial crisis, bankruptcy and insolvency. We provide an overview of previously proposed techniques in these areas by examining 126 selected articles across 44 reputed journals. The main contributions of this review include an extensive examination of data characteristics and features used for model training, evaluation of validation approaches, and model performance addressing each financial problem. A systematic literature review methodology, PRISMA, is used to carry out this comprehensive review. The study also analyses bibliometric information to understand the current status of research focused on machine learning in finance. The study finally points out possible research directions which might lead to new inquiries in machine learning and finance.

Recommendations

Machine learning models for financial applications.

Stock market is the aggregation of purchasers and venders of stocks and it represents ownership claims on businesses. The purpose of predicting stock market is to anticipate the price value and direction of stock. Higher profits will investors can made ...

News-based intelligent prediction of financial markets using text mining and machine learning: A systematic literature review

Researchers and practitioners have attempted to predict the financial market by analyzing textual (e.g., news articles and social media) and numeric data (e.g., hourly stock prices, and moving averages). Among textual data, while many ...

Emerging articles (2015–2021) on news-based stock market prediction are reviewed.

Optimal investment risks and debt management with backup security in a financial crisis

This paper examines a theoretical and an empirical study of an optimal investment management strategies and debt profile of an investor in a financial crisis. In order to minimize the incident of credit risks, the debts are backup with collaterals. The ...

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Information
Contributors

Published in

Elsevier Ltd

In-Cooperation

Pergamon Press, Inc.

United States

Publication History

Published: 1 June 2023

Funding Sources

Other metrics.

Bibliometrics
Citations 4

Article Metrics

4 Total Citations View Citations
0 Total Downloads
Downloads (Last 12 months) 0
Downloads (Last 6 weeks) 0

Digital Edition

View this article in digital edition.

Share this Publication link

https://dlnext.acm.org/doi/abs/10.1016/j.eswa.2023.119640

Share on Social Media

0 References

Export Citations

Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
Download citation
Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Machine Learning: Algorithms, Real-World Applications and Research Directions

Review Article
Published: 22 March 2021
Volume 2 , article number 160 , ( 2021 )

Cite this article

financial applications of machine learning a literature review

Iqbal H. Sarker ORCID: orcid.org/0000-0003-1740-5517 1 , 2

520k Accesses

1482 Citations

29 Altmetric

Explore all metrics

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Machine learning and deep learning

What Is Machine Learning?

Avoid common mistakes on your manuscript.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of \(0 \; (minimum)\) to \(100 \; (maximum)\) has been shown in y - axis . According to Fig. 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.

To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

To discuss the applicability of machine learning-based solutions in various real-world application domains.

To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.

Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.

Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.

Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

Various types of machine learning techniques

Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.

Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.

Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.

Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.

Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.

Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.

Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification.

K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.

Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

Decision tree (DT): Decision tree (DT) [ 88 ] is a well-known non-parametric supervised learning method. DT learning methods are used for both the classification and regression tasks [ 82 ]. ID3 [ 87 ], C4.5 [ 88 ], and CART [ 20 ] are well known for DT algorithms. Moreover, recently proposed BehavDT [ 100 ], and IntrudTree [ 97 ] by Sarker et al. are effective in the relevant application domains, such as user behavior analytics and cybersecurity analytics, respectively. By sorting down the tree from the root to some leaf nodes, as shown in Fig. 4 , DT classifies the instances. Instances are classified by checking the attribute defined by that node, starting at the root node of the tree, and then moving down the tree branch corresponding to the attribute value. For splitting, the most popular criteria are “gini” for the Gini impurity and “entropy” for the information gain that can be expressed mathematically as [ 82 ].

An example of a decision tree structure

An example of a random forest structure considering multiple decision trees

Random forest (RF): A random forest classifier [ 19 ] is well known as an ensemble classification technique that is used in the field of machine learning and data science in various application areas. This method uses “parallel ensembling” which fits several decision tree classifiers in parallel, as shown in Fig. 5 , on different data set sub-samples and uses majority voting or averages for the outcome or final result. It thus minimizes the over-fitting problem and increases the prediction accuracy and control [ 82 ]. Therefore, the RF learning model with multiple decision trees is typically more accurate than a single decision tree based model [ 106 ]. To build a series of decision trees with controlled variation, it combines bootstrap aggregation (bagging) [ 18 ] and random feature selection [ 11 ]. It is adaptable to both classification and regression problems and fits well for both categorical and continuous values.

Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.

Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.

Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, \(\alpha\) is the learning rate, and \(J_i\) is the training example cost of \(i \mathrm{th}\) , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the \(j^\mathrm{th}\) iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations.

Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations:

where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .

Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of \(n^\mathrm{th}\) in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below:

Here, y is the predicted/target output, \(b_0, b_1,... b_n\) are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is \(n^\mathrm{th}\) degree of polynomial then we use polynomial regression to get desired output.

LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.

Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

Hierarchical-based methods: Hierarchical clustering typically seeks to construct a hierarchy of clusters, i.e., the tree structure. Strategies for hierarchical clustering generally fall into two types: (i) Agglomerative—a “bottom-up” approach in which each observation begins in its cluster and pairs of clusters are combined as one, moves up the hierarchy, and (ii) Divisive—a “top-down” approach in which all observations begin in one cluster and splits are performed recursively, moves down the hierarchy, as shown in Fig 7 . Our earlier proposed BOTS technique, Sarker et al. [ 102 ] is an example of a hierarchical, particularly, bottom-up clustering algorithm.

Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.

Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.

Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.

Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.

DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.

GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.

Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.

Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.

Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is \([-1, 1]\) , where \(-1\) means perfect negative correlation, \(+1\) means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ]

ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.

Chi square: The chi-square \({\chi }^2\) [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on \({\chi }^2\) . The chi-square \({\chi }^2\) is commonly used for testing relationships between categorical variables. If \(O_i\) represents observed value and \(E_i\) represents expected value, then

Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.

Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

Principal component analysis (PCA): Principal component analysis (PCA) is a well-known unsupervised learning approach in the field of machine learning and data science. PCA is a mathematical technique that transforms a set of correlated variables into a set of uncorrelated variables known as principal components [ 48 , 81 ]. Figure 8 shows an example of the effect of PCA on various dimensions space, where Fig. 8 a shows the original features in 3D space, and Fig. 8 b shows the created principal components PC1 and PC2 onto a 2D plane, and 1D line with the principal component PC1 respectively. Thus, PCA can be used as a feature extraction technique that reduces the dimensionality of the datasets, and to build an effective machine learning model [ 98 ]. Technically, PCA identifies the completely transformed with the highest eigenvalues of a covariance matrix and then uses those to project the data into a new subspace of equal or fewer dimensions [ 82 ].

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.

Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.

ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.

FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].

ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.

Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.

Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

A structure of an artificial neural network modeling with multiple processing layers

MLP: The base architecture of deep learning, which is also known as the feed-forward artificial neural network, is called a multilayer perceptron (MLP) [ 82 ]. A typical MLP is a fully connected network consisting of an input layer, one or more hidden layers, and an output layer, as shown in Fig. 10 . Each node in one layer connects to each node in the following layer at a certain weight. MLP utilizes the “Backpropagation” technique [ 41 ], the most “fundamental building block” in a neural network, to adjust the weight values internally while building the model. MLP is sensitive to scaling features and allows a variety of hyperparameters to be tuned, such as the number of hidden layers, neurons, and iterations, which can result in a computationally costly model.

CNN or ConvNet: The convolution neural network (CNN) [ 65 ] enhances the design of the standard ANN, consisting of convolutional layers, pooling layers, as well as fully connected layers, as shown in Fig. 11 . As it takes the advantage of the two-dimensional (2D) structure of the input data, it is typically broadly used in several areas such as image and video recognition, image processing and classification, medical image analysis, natural language processing, etc. While CNN has a greater computational burden, without any manual intervention, it has the advantage of automatically detecting the important features, and hence CNN is considered to be more powerful than conventional ANN. A number of advanced deep learning models based on CNN can be used in the field, such as AlexNet [ 60 ], Xception [ 24 ], Inception [ 118 ], Visual Geometry Group (VGG) [ 44 ], ResNet [ 45 ], etc.

LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.

Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.

Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.

Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO \(_2\) pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.

Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.

E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.

NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.

Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.

Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.

User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ (Accessed on 20 October 2019).

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ (Accessed on 28 March 2020).

World health organization: WHO. http://www.who.int/ .

Google trends. In https://trends.google.com/trends/ , 2019.

Adnan N, Nordin Shahrina Md, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998; 94–105

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM. 1993;22: 207–216

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Fast algorithms for mining association rules. In: Proceedings of the International Joint Conference on Very Large Data Bases, Santiago Chile. 1994; 1215: 487–499.

Aha DW, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article Google Scholar

Alakus TB, Turkoglu I. Comparison of deep learning approaches to predict covid-19 infection. Chaos Solit Fract. 2020;140:

Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–88.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record. 1999;28(2):49–60.

Anzai Y. Pattern recognition and machine learning. Elsevier; 2012.

MATH Google Scholar

Ardabili SF, Mosavi A, Ghamisi P, Ferdinand F, Varkonyi-Koczy AR, Reuter U, Rabczuk T, Atkinson PM. Covid-19 outbreak prediction with machine learning. Algorithms. 2020;13(10):249.

Article MathSciNet Google Scholar

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, 2012; 37–49 .

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Boukerche A, Wang J. Machine learning-based traffic prediction models for intelligent transportation systems. Comput Netw. 2020;181

Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

Article MATH Google Scholar

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC Press; 1984.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):43.

Google Scholar

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 pages 4774–4778. IEEE .

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.

Cobuloglu H, Büyüktahtakın IE. A stochastic multi-criteria decision analysis for sustainable biomass crop selection. Expert Syst Appl. 2015;42(15–16):6065–74.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management, pages 474–481. ACM, 2001.

de Amorim RC. Constrained clustering with minkowski weighted k-means. In: 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI), pages 13–17. IEEE, 2012.

Dey AK. Understanding and using context. Person Ubiquit Comput. 2001;5(1):4–7.

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Person Ubiquit Comput. 2006;10(4):255–68.

Essien A, Petrounias I, Sampaio P, Sampaio S. Improving urban traffic speed prediction using data source fusion and deep learning. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. 2019: 1–8. .

Essien A, Petrounias I, Sampaio P, Sampaio S. A deep-learning model for urban traffic flow prediction with traffic events mined from twitter. In: World Wide Web, 2020: 1–24 .

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst Appl. 2017;9(01):1.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, Citeseer. 1996; 96: 148–156

Fujiyoshi H, Hirakawa T, Yamashita T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019;43(4):244–52.

Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inform Theory. 1975;21(1):32–40.

Article MathSciNet MATH Google Scholar

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014: 2672–2680.

Guerrero-Ibáñez J, Zeadally S, Contreras-Castillo J. Sensor technologies for intelligent transportation systems. Sensors. 2018;18(4):1212.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, ACM. 2000;29: 1–12.

Harmon SA, Sanford TH, Sheng X, Turkbey EB, Roth H, Ziyue X, Yang D, Myronenko A, Anderson V, Amalou A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770–778.

Hinton GE. A practical guide to training restricted boltzmann machines. In: Neural networks: Tricks of the trade. Springer. 2012; 599-619

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Edu Psychol. 1933;24(6):417.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Data Engineering, 1995. Proceedings of the Eleventh International Conference on, IEEE.1995:25–33.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and covid-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. 1995; 338–345

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Protect. 2018;117:408–25.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Radha Krishna MK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE. 2018; 1–6

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Fut Gen Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012: 1097–1105

Kushwaha S, Bahl S, Bagha AK, Parmar KS, Javaid M, Haleem A, Singh RP. Significant applications of machine learning for covid-19 pandemic. J Ind Integr Manag. 2020;5(4).

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Lalmuanawma S, Hussain J, Chhakchhuak L. Applications of machine learning and artificial intelligence for covid-19 (sars-cov-2) pandemic: a review. Chaos Sol Fract. 2020:110059 .

LeCessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Liu H, Motoda H. Feature extraction, construction and selection: A data mining perspective, vol. 453. Springer Science & Business Media; 1998.

López G, Quesada L, Guerrero LA. Alexa vs. siri vs. cortana vs. google assistant: a comparison of speech-based natural user interfaces. In: International Conference on Applied Human Factors and Ergonomics, Springer. 2017; 241–250.

Liu B, HsuW, Ma Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967;volume 1, pages 281–297. Oakland, CA, USA.

Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161–75.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

McCallum A. Information extraction: distilling structured data from unstructured text. Queue. 2005;3(9):48–57.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September, 2016; pp. 1223–1234. ACM, New York, USA. .

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Mohammed M, Khan MB, Bashier Mohammed BE. Machine learning: algorithms and applications. CRC Press; 2016.

Book Google Scholar

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), 2015;pages 1–6. IEEE .

Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Yujin O, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.

Otter DW, Medina JR , Kalita JK. A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Netw Learn Syst. 2020.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Liii Pearson K. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet MATH Google Scholar

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Santi P, Ram D, Rob C, Nathan E. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst. 2017;86(2):153–73.

Puterman ML. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons; 2014.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite gaussian mixture model. Adv Neural Inform Process Syst. 1999;12:554–60.

Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Syst. 2015;89:14–46.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pages 269–298. Springer, 2010.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH, Abushark YB, Alsolami F, Khan A. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Abushark YB, Khan A. Contextpca: predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Sarker IH, Alqahtani H, Alsolami F, Khan A, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Alan C, Jun H, Khan AI, Abushark YB, Khaled S. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2019; 1–11.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Ubicomp): Adjunct, Germany, pages 630–634. ACM, 2016.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J Oxf Univ UK. 2018;61(3):349–68.

Sarker IH, Hoque MM, MdK Uddin, Tawfeeq A. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl, pages 1–19, 2020.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020; page 102762

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Watters P, Kayes ASM. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Sarker IH, Salah K. Appspred: predicting context-aware smartphone apps using random forest learning. Internet Things. 2019;8:

Scheffer T. Finding association rules that trade support optimally against confidence. Intell Data Anal. 2005;9(4):381–95.

Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A. A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res. 2020;119:

Shengli S, Ling CX. Hybrid cost-sensitive decision tree, knowledge discovery in databases. In: PKDD 2005, Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, volume 3721, 2005.

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for covid-19. J Big Data. 2021;8(1):1–54.

Gökhan S, Nevin Y. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature. 2016;529(7587):484–9.

Ślusarczyk B. Industry 4.0: Are we ready? Polish J Manag Stud. 17, 2018.

Sneath Peter HA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948; 5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 13-17 September, pp. 389–400. ACM, New York, USA. 2014.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pages 1–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009;2009:1–6.

Tsagkias M. Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR Forum. volume 54. NY, USA: ACM New York; 2021. p. 1–23.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. Icml. 2001;1:577–84.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big data. 2016;3(1):9.

Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Wu C-C, Yen-Liang C, Yi-Hung L, Xiang-Yu Y. Decision tree induction with a constrained number of leaf nodes. Appl Intell. 2016;45(3):673–85.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inform Syst. 2008;14(1):1–37.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M. Internet of things for smart cities. IEEE Internet Things J. 2014;1(1):22–32.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zheng Y, Rajasegarar S, Leckie C. Parking availability prediction for sensor-enabled car parks in smart cities. In: Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 2015; pages 1–6.

Zhu H, Cao H, Chen E, Xiong H, Tian J. Exploiting enriched contextual information for mobile app classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012; pages 1617–1621

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), IEEE, 2020; pages 1–7

Zulkernain S, Madiraju P, Ahamed SI. A context aware interruption management system for mobile devices. In: Mobile Wireless Middleware, Operating Systems, and Applications. Springer. 2010; pages 221–234

Zulkernain S, Madiraju P, Ahamed S, Stamm K. A mobile intelligent interruption management system. J UCS. 2010;16(15):2060–80.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349, Chattogram, Bangladesh

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN COMPUT. SCI. 2 , 160 (2021). https://doi.org/10.1007/s42979-021-00592-x

Download citation

Received : 27 January 2021

Accepted : 12 March 2021

Published : 22 March 2021

DOI : https://doi.org/10.1007/s42979-021-00592-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Machine learning
Deep learning
Artificial intelligence
Data science
Data-driven decision-making
Predictive analytics
Intelligent applications
Find a journal
Publish with us
Track your research

IMAGES

(PDF) Machine Learning in Management Accounting Research: Literature
(PDF) Machine Learning for Social Network Analysis: A Systematic
Machine Learning in Finance
(PDF) Applications of Machine Learning in Finance
Machine Learning for Finance (Paperback)
Få Machine Learning in Finance af Bob Mather som Paperback bog på

VIDEO

DS 7335
Application of Machine Learning in Finance
Why you should read Research Papers in ML & DL? #machinelearning #deeplearning
AI & ML in Finance
AI & ML in Finance
What Machine Learning Tells Us About the Mathematical Structures of Concepts

COMMENTS

Financial applications of machine learning: A literature review
Review of financial applications of machine learning. This section presents a comprehensive review of existing literature across the six financial areas: stock markets, portfolio management, cryptocurrency, foreign exchange markets, financial crisis, and bankruptcy and insolvency. The performed review of the 126 selected articles includes an ...
Financial applications of machine learning: : A literature review
This systematic literature review analyses the recent advances of machine learning and deep learning in finance. The study considers six financial domains: stock markets, portfolio management, cryptocurrency, forex markets, financial crisis, bankruptcy and insolvency.
Financial applications of machine learning: A literature review
2024. TLDR. This study innovatively uses the machine learning method and explores the differences in the predictive effects of multi-dimensional features on the digital transformation of enterprises based on the Technology-Organization-Environment (TOE) theory, identifying the main drivers affecting digital transformation and the fitting models ...
Full article: Machine learning In the financial industry: A
It is observed that many of the publications (Table 3) appear in non-technology-based journals, suggesting that the articles are primarily on the application of machine learning tools to the financial environment, as the search was limited to specific relevant keywords (machine learning and finance) as well as journals limited to Economics ...
Machine learning for financial forecasting, planning and analysis
While the naive application of machine learning usually fails in this context, the recently developed double machine learning framework can address causal questions of interest. We review the current literature on machine learning in FP&A and illustrate in a simulation study how machine learning can be used for both forecasting and planning.
Financial applications of machine learning: A literature review
Journal of International Money and Finance, 2008. Neural network applications in business: A review and analysis of the literature (1988-1995) Decision Support Systems, 1997. Read more Read more. Cited by 33 articles. Scilit is a comprehensive content aggregator platform for scholarly publications. It is developed and maintained by the open ...
Machine Learning in Finance: A Metadata-Based Systematic Review of the
Machine learning in finance has been on the rise in the past decade. The applications of machine learning have become a promising methodological advancement. The paper's central goal is to use a metadata-based systematic literature review to map the current state of neural networks and machine learning in the finance field. After collecting a large dataset comprised of 5053 documents, we ...
Artificial intelligence in Finance: a comprehensive review through
From the review of the literature represented by this stream, it emerges that neural networks and machine learning algorithms are used to build intelligent automated trading systems. To give some examples, Creamer and Freund ( 2010 ) create a machine learning-based model that analyses stock price series and then selects the best-performing ...
Transforming the Financial Industry Through Machine and Deep Learning
The domain of finance has been one of the most extensively delved into application areas for Machine Learning (ML). The application of ML and DL in finance has garnered substantial consideration from both academic researchers and financial industry practitioners over the past few decades. A plethora of studies have been conducted, yielding ...
Financial applications of machine learning: A literature review
Abstract. This systematic literature review analyses the recent advances of machine learning and deep learning in finance. The study considers six financial domains: stock markets, portfolio ...
Artificial Intelligence & Machine Learning in Finance: A literature review
A literature review. Abstract: In the 2020s, Artificial Intelligence (AI) has been increasingly becoming a domi nant. technology, and thanks to new computer technologies, M achine Learning (ML ...
(PDF) Financial Modeling With Machine Learning
Process automation is a most common application of machine learning in finance This technology. can automate repetitive tasks and replace manual labor, as well as increase productivity. Machine ...
Machine Learning in Finance- Emerging Trends and Challenges
machine learning and artificial intelligencebased models and applications in their - day-to-day operations. The rest of the chapter is organized as follows. Section 2 presents some emerging applications of machine learning in the financial domain. Section 3 highlights emerging computing paradigms in finance. Some important modeling paradigms in
Machine learning in internet financial risk management: A systematic
Thus, based on the data from Web of Science and Scopus databases, this paper conducts a systematic literature review on all aspects of machine learning in internet finance risk in recent years, based on publications trends, geographical distribution, literature focus, machine learning models and algorithms, and evaluations.
Machine learning techniques and data for stock market forecasting: A
In this literature review, we investigate machine learning techniques that are applied for stock market prediction. A focus area in this literature review is the stock markets investigated in the literature as well as the types of variables used as input in the machine learning techniques used for predicting these markets.
Financial applications of machine learning: : A literature review
The study considers six financial domains: stock markets, portfolio manageme... Financial applications of machine learning: : A literature review: Expert Systems with Applications: An International Journal: Vol 219, No C
A Literature Review on Machine Learning Applications in Financial
DOI: 10.15415/jtmge.2020.111004 Corpus ID: 225387777; A Literature Review on Machine Learning Applications in Financial Forecasting @article{Muskaan2020ALR, title={A Literature Review on Machine Learning Applications in Financial Forecasting}, author={Muskaan and Pradeepta Kumar Sarangi}, journal={Journal of Technology Management for Growing Economies}, year={2020}, url={https://api ...
The Rise of AI and ML in Financial Technology: An In-depth ...
Choudhury et al. "Application of machine learning in fintech: A review" provides an overview of the use of machine learning in the financial technology (fintech) industry. The authors discuss the benefits and challenges of using machine learning in fintech, as well as some of the most common applications of the technology in the industry.
Literature review: Machine learning techniques applied to financial
The literature review is a method for investigating the approaches of a studied topic, as stated by Lage Junior and Godinho Filho (2010, p. 14). The following section briefly presents a review of the main machine learning techniques covered in the articles selected for this study.
Applications of Artificial Intelligence and Machine Learning-based
Using the PSALSAR framework, we conduct a Systematic Literature Review (SLR) of 55 articles published between 1999 and 2022. The review highlights the potential of SupTech to navigate the complexities of the financial landscape, and its versatile applications for financial supervisors and regulators, stock exchanges, intermediaries, and investors.
Machine Learning: Algorithms, Real-World Applications and ...
Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [].It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [], i.e., a task-driven ...