phishing detection using machine learning thesis

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Detecting phishing websites using machine learning technique

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Computer Science and Information System, College of Applied Sciences, Almaarefa University, Riyadh, Saudi Arabia

Ashit Kumar Dutta

Published: October 11, 2021
https://doi.org/10.1371/journal.pone.0258361
Reader Comments

In recent years, advancements in Internet and cloud technologies have led to a significant increase in electronic trading in which consumers make online purchases and transactions. This growth leads to unauthorized access to users’ sensitive information and damages the resources of an enterprise. Phishing is one of the familiar attacks that trick users to access malicious content and gain their information. In terms of website interface and uniform resource locator (URL), most phishing webpages look identical to the actual webpages. Various strategies for detecting phishing websites, such as blacklist, heuristic, Etc., have been suggested. However, due to inefficient security technologies, there is an exponential increase in the number of victims. The anonymous and uncontrollable framework of the Internet is more vulnerable to phishing attacks. Existing research works show that the performance of the phishing detection system is limited. There is a demand for an intelligent technique to protect users from the cyber-attacks. In this study, the author proposed a URL detection technique based on machine learning approaches. A recurrent neural network method is employed to detect phishing URL. Researcher evaluated the proposed method with 7900 malicious and 5800 legitimate sites, respectively. The experiments’ outcome shows that the proposed method’s performance is better than the recent approaches in malicious URL detection.

Citation: Dutta AK (2021) Detecting phishing websites using machine learning technique. PLoS ONE 16(10): e0258361. https://doi.org/10.1371/journal.pone.0258361

Editor: Zhihan Lv, Qingdao University, CHINA

Received: April 26, 2021; Accepted: September 26, 2021; Published: October 11, 2021

Copyright: © 2021 Ashit Kumar Dutta. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are located within the manuscript and its Supporting information files, and at https://github.com/shreyagopal/Phishing-Website-Detection-by-Machine-Learning-Techniques.git .

Funding: No funding received for this research.

Competing interests: No conflict of interest.

1. Introduction

Phishing is a fraudulent technique that uses social and technological tricks to steal customer identification and financial credentials. Social media systems use spoofed e-mails from legitimate companies and agencies to enable users to use fake websites to divulge financial details like usernames and passwords [ 1 ]. Hackers install malicious software on computers to steal credentials, often using systems to intercept username and passwords of consumers’ online accounts. Phishers use multiple methods, including email, Uniform Resource Locators (URL), instant messages, forum postings, telephone calls, and text messages to steal user information. The structure of phishing content is similar to the original content and trick users to access the content in order to obtain their sensitive data. The primary objective of phishing is to gain certain personal information for financial gain or use of identity theft. Phishing attacks are causing severe economic damage around the world. Moreover, Most phishing attacks target financial/payment institutions and webmail, according to the Anti-Phishing Working Group (APWG) latest Phishing pattern studies [ 1 ].

In order to receive confidential data, criminals develop unauthorized replicas of a real website and email, typically from a financial institution or other organization dealing with financial data [ 2 – 4 ]. This e-mail is rendered using a legitimate company’s logos and slogans. The design and structure of HTML allow copying of images or an entire website [ 5 ]. Also, it is one of the factors for the rapid growth of Internet as a communication medium, and enables the misuse of brands, trademarks and other company identifiers that customers rely on as authentication mechanisms [ 6 – 8 ]. To trap users, Phisher sends "spooled" mails to as many people as possible. When these e-mails are opened, the customers tend to be diverted from the legitimate entity to a spoofed website.

There is a significant chance of exploitation of user information. For these reasons, phishing in modern society is highly urgent, challenging, and overly critical [ 9 , 10 ]. There have been several recent studies against phishing based on the characteristics of a domain, such as website URLs, website content, incorporating both the website URLs and content, the source code of the website and the screenshot of the website [ 11 ]. However, there is a lack of useful anti-phishing tools to detect malicious URL in an organization to protect its users. In the event of malicious code being implanted on the website, hackers may steal user information and install malware, which poses a serious risk to cybersecurity and user privacy. Malicious URLs on the Internet can be easily identified by analyzing it through Machine Learning (ML) technique [ 12 , 13 ]. The conventional URL detection approach is based on a blacklist (set of malicious URLs) obtained by user reports or manual opinions. On the one hand, the blacklist is used to verify an URL and on the other hand the URL in the blacklist is updated, frequently. However, the numbers of malicious URLs not on the blacklist are increasing significantly. For instance, cybercriminals can use a Domain Generation Algorithm (DGA) to circumvent the blacklist by creating new malicious URLs. Thus, an exhaustive blacklist of malicious URLs [ 14 , 15 ] is almost impossible to identify the malicious URLs. Thusnew malicious URLs cannot be identified with the existing approaches. Researchers suggested methods based on the learning of computer to identify malicious URLs to resolve the limitations of the system based on the blacklist [ 16 – 18 ]. Malicious URL detection is considered a binary classification task with two-class predictions: malicious and benign. The training of the ML method consists of finding the best mapping between the d-dimensional vector space and the output variable [ 19 – 21 ]. This strategy has a strong generalization capacity to find unknown malicious URLs compared to the blacklist approach.

Recurrent Neural Network (RNN)—Long Short-Term Memory (LSTM) is one of the ML techniques that presents a solution for the complex real—time problems [ 22 ]. LSTM allow RNN to store inputs for a larger period [ 23 ]. It is similar to the concept of storage in computer. In addition, each feature will be processed according to the uniform distribution [ 24 ]. The combination of RNN and LSTM enables to extract a lot of information from a minimum set of data. Therefore, it supports phishing detection system to identify a malicious site in a shorter duration.

In comparison to most previous approaches, researchers focus on identifying malicious URLs from the massive set of URLs. Therefore, the study proposes Recurrent Neural Network (RNN) based URL detection approach. The objectives of the study are as follows:

To develop a novel approach to detect malicious URL and alert users.
To apply ML techniques in the proposed approach in order to analyze the real time URLs and produce effective results.
To implement the concept of RNN, which is a familiar ML technique that has the capability to handle huge amount of data.

The rest of the paper is organized as follows: Section 1 introduces the concept of malicious URL and objective of the study. The background of the study and related literature in detecting URL is discussed in section 2. Section 3 presents the methodology of the research. Results and discussion are presented in section 4. Finally, section 5 concludes the study with its future direction.

2. Research background and related works

Phishing attacks are categorized according to Phisher’s mechanism for trapping alleged users. Several forms of these attacks are keyloggers, DNS toxicity, Etc., [ 2 ]. The initiation processes in social engineering include online blogs, short message services (SMS), social media platforms that use web 2.0 services, such as Facebook and Twitter, file-sharing services for peers, Voice over IP (VoIP) systems where the attackers use caller spoofing IDs [ 3 , 4 ]. Each form of phishing has a little difference in how the process is carried out in order to defraud the unsuspecting consumer. E-mail phishing attacks occur when an attacker sends an e-mail with a link to potential users to direct them to phishing websites.

2.1 Classification of phishing attack techniques

Phishing websites are challenging to an organization and individual due to its similarities with the legitimate websites [ 5 ]. Fig 1 presents the multiple forms of phishing attacks. Technical subterfuge refers to the attacks include Keylogging, DNS poisoning, and Malwares. In these attacks, attacker intends to gain the access through a tool / technique. On the one hand, users believe the network and on the other hand, the network is compromised by the attackers. Social engineering attacks include Spear phishing, Whaling, SMS, Vishing, and mobile applications. In these attacks, attackers focus on the group of people or an organization and trick them to use the phishing URL [ 6 , 7 ]. Apart from these attacks, many new attacks are emerging exponentially as the technology evolves constantly.

PPT PowerPoint slide
PNG larger image
TIFF original image

https://doi.org/10.1371/journal.pone.0258361.g001

2.2 Phishing detection approaches

Phishing detection schemes which detect phishing on the server side are better than phishing prevention strategies and user training systems. These systems can be used either via a web browser on the client or through specific host-site software [ 8 , 9 ]. Fig 2 presents the classification of Phishing detection approaches. Heuristic and ML based approach is based on supervised and unsupervised learning techniques. It requires features or labels for learning an environment to make a prediction. Proactive phishing URL detection is similar to ML approach. However, URLs are processed and support a system to predict a URL as a legitimate or malicious [ 11 – 15 ]. Blacklist and Whitelist approaches are the traditional methods to identify the phishing sites [ 16 – 21 ]. The exponential growth of web domains reduces the performance of the traditional method [ 22 – 24 ].

https://doi.org/10.1371/journal.pone.0258361.g002

The existing methods rely on new internet users to a minimum. Once they identify phishing website, the site is not accessible, or the user is informed of the probability that the website is not genuine. This approach requires minimum user training and requires no modifications to existing website authentication systems. The performance of the detection systems is calculated according to the following:

Number of True Positives (TP): The total number of malicious websites.
Number of True Negatives (TN): The total number of legitimate websites.
Number of False Positives (FP): The total number of incorrect predictions of legitimate websites as a malicious website.
Number of False Negatives (FN): The total number of incorrect predictions of malicious websites as a legitimate website.

Using some benchmark dataset, the accuracy of phishing detection systems is usually evaluated. The familiar phishing dataset to train the ML based techniques are as follows:

2.2.1 Normal dataset.

AlexaRank [ 25 ] is used as a benign and natural website benchmarking dataset. Alexa is a commercial enterprise which carries out web data analysis. It obtains the browsing habits of users from different sources and analyses them objectively for the reporting and classification of Internet web-based URLs. Researchers use the rankings provided by Alexa to collect a number of high standard websites as the normal dataset to test and classify websites. Alexa presents the dataset in the form of a raw text file where each line in the order ascended mentions the grade and domain name of a website.

2.2.2 Phishing dataset.

Phishtank is a familiar phishing website benchmark dataset which is available at https://phishtank.org/ . It is a group framework that tracks websites for phishing sites. Various users and third parties send alleged phishing sites that are ultimately selected as legitimate site by a number of users. Thus, Phishtank offers a phishing website dataset in real-time. Researchers to establish data collection for testing and detection of Phishing websites use Phishtank’s website. Phishtank dataset is available in the Comma Separated Value (CSV) format, with descriptions of a specific phrase used in every line of the file. The site provides details include ID, URL, time of submission, checked status, online status and target URLs.

2.3 Research questions

Researcher framed the Research Questions (RQ) according to the objective of the study and its background. They are as follows:

RQ1—How URL detectors identify the phishing URLs or websites?
RQ2—How to apply ML methods to classify malicious and legitimate websites?
RQ3—How to evaluate a URL detector performance?

On the one hand, RQ1 and RQ2 assist to develop a ML based phishing detection system for securing an network from phishing attacks. On the other hand, RQ3 specifies the importance of the performance evaluation of a phishing technique. To address RQ1, authors found some recent literature related to URL detection using Artificial Intelligence (AI) techniques. The following part of this section presents the studies in detail with Table 2.

Authors in the study [ 2 ] proposed a URL-based anti-phishing machine learning method. They have taken 14 features of the URL to detect the website as a malicious or legitimate to test the efficiency of their method. More than 33,000 phishing and valid URLs in Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers were used to train the proposed system. The phishing detection method focused on the learning process. They extracted 14 different features, which make phishing websites different from legitimate websites. The outcome of their experiment reached over 90% of precision when websites with SVM Classification are detected.

The study [ 3 ] explored multiple ML methods to detect URLs by analyzing various URL components using machine learning and deep learning methods. Authors addressed various methods of supervised learning for the identification of phishing URLs based on lexicon, WHOIS properties, PageRank, traffic rank information and page importance properties. They studied how the volume of different training data influences the accuracy of classifiers. The research includes Support Vector Machine (SVM), K-NN, random forest classification (RFC) and Artificial Neural Network (ANN) techniques for the classification.

Based on the output without and with the functionality selection a comparative study of machine learning algorithms is carried out in the study [ 4 ]. Experiments on a phishing dataset were carried out with 30 features including 4898 phished and 6157 benign web pages. Several ML methods were used to yield a better outcome. A method for selecting functions is subsequently employed to increase model performance. Random forests algorithm achieved the highest accuracy prior to and after the selection of features and dramatically increase building time. The results of the experiment shown that using the selection approach with machine learning algorithms can boost the effectiveness of the classification models for the detection of phishing without reducing their performance.

In this study [ 5 ], authors proposed URLNet, a CNN-based deep-neural URL detection network. They argued that current methods often use Bag of Words(BoW) such as features and suffered some essential limitations, such as the failure to detect sequential concepts in a URL string, the lack of automated feature extraction and the failure of unseen features in real—time URLs. They developed a CNNs and Word CNNs for character and configured the network. In addition, they suggested advanced techniques that were particularly effective for handling uncommon terms, a problem commonly exist in malicious URL detection tasks. This method can permit URLNet to identify embeddings and use sub word information from invisible words during testing phase.

Authors in [ 6 ] introduced a method for phishing URLs with innovative lexical features and blacklist. They collected a list of URLs using a crawler from URL repositories and collected 18 common lexical features. They implemented advanced ML techniques consisting of under/oversamples and classification. The automated approaches outperform other existing ML apporaches. The study has focused on content features and not lexical features, which was difficult to implement in real-world environments. The experimental results were better than the existing classification algorithms.

In the study [ 7 ], author investigated how well phishing URLs can be classified in the set of URLs which contain benign URLs. They discussed randomisation, characteristics engineering, the extraction of characteristics using host-based lexical analysis and statistical analysis. For the comparative study, several classifiers were applied and found that the results across the different classifiers are almost consistent. Authors argued that they proposed a convenient approach to remove functionality from URLs with simple standard words. More features could be experimented that lead to an optimum results. The dataset used in the study includes some older URLs. Thus, there is a possibility of lack of performance.

Authors [ 8 ] suggested a URL detector for high precision phishing attacks. They argued that the technique could be scaled to various sizes and proactively adapted. For both legitimate and malicious URLs a limited data collection of 572 cases had been employed. The characteristics were extracted and then weighed as cases to use in the prediction process. The test results were highly reliable with and without online phishing threats. For the improvement of the accuracy, Genetic algorithm (GA) has been used. Table 1 presents the outcome of the comparative study of literature.

https://doi.org/10.1371/journal.pone.0258361.t001

Authors [ 9 ] developed a detection approach for classifying malicious and normal webpages. The outcome of this study indicated that the value of true positive was higher rather than the false positive rate. In other study [ 10 ], authors proposed a Convolutional Neural Network (CNN) to detect a phishing URL. In this study, researchers employed a sequential pattern to capture the URL information. It achieved an accuracy of 98.58%, 95.46%, and 95.22%, respectively on benchmark datasets.

In study [ 11 ], authors employed a generative adversarial network for classifying the URLs and bypass the blacklist-based phishing detectors. In addition, researchers argued that the system can by pass both simple and novice ML detection techniques.

Based on the related work and its performance, authors selected a couple of studies for comparing with the proposed URL detector. The studies of Hung Le et al., [ 5 ] and Hong J. et al., [ 6 ] were selected. The reason for selecting studies is that the studies were applied deep learning methods and achieved an average accuracy of 90%.

3. Research methodology

RQ3 stated that how ML method can be employed to identify a malicious or legitimate URL. To present a solution, authors proposed a framework as shown in Fig 3 for classifying URLs and identify the phishing URLs.

https://doi.org/10.1371/journal.pone.0258361.g003

phishing detection using machine learning thesis

During the training phase, RNN stores the properties Pm and Pl to learn the environment. Moreover, each URL of the dataset from Phishtank [ 23 ] and crawled URL is utilized in a way to instruct the model. Algorithm 3.1 and 3.2 presents the steps involved in the data collection and pre-process, correspondingly. Algorithm 3.3 and 3.4 shows the training phase and testing phase, individually. The training phase uses the labels to train RNN to learn the malicious and legitimate URLs. Thus, the testing phase of the proposed RNN model receives each URL and predicts the type of URL. RNN (LSTM) is developed with Python 3.0 in Windows 10 environment with the support of i7 processor.

LSTM model is an effective predictive model. It generates an output based on the arbitrary number of steps. There are five essential components that enables the model to produce long—term and short—term data.

Cell state (CS)—It indicates the cell space that accommodate both long term and short-term memories.

Hidden state (HS)—This is the output status information that user use to determine URL with respect to the current data, hidden condition and current cell input. The secret state is used to recover both short-term and long-term memory, in order to make a prediction.

Input gate (IT)—The total number of information flows to the cell state.

Forget gate (FT)—The total number of data flows from the current input and past cell state into the present cell state.

Output gate (OT)—The total number of information flows to the hidden state.

3.1. Input gate

It identifies an input value for memory alteration. Sigmoid defines the values that can be up to 0,1. And the tanh function weights the values passed by, evaluating their significance from-1 to 1. Eqs 5 and 6 represents the input gate and cell state, respectively. Wn is the weight, HT t −1 is the previous state of hidden state, x t is the input, and b n is the bias vector which need to be learnt during the training phase. CT is calculated using tanh function.

3.2. Forget gate

It finds out the necessary block information to be discarded from the memory. The sigmoid function is used to describe it. Eq 7 contains ( HT t −1 ) and content( x t ) are examined, and the number of outputs between 0 and 1 is verified by each cell state CT t −1 number.

3.3. Output gate

The input and the memory of the block is used to determine the output. Sigmoid function determines which values to let through 0 and 1. The tanh function presents weightage to the values which are transferred to determine their degree of importance ranging from-1 to 1 and multiplied with output of Sigmoid.

Fig 4 represents the processes involved in data collection. Data Repositories such as Phishtank and Crawler are used to collect Malicious and Benign URLs. A crawler is developed in order to collect URLs from AlexaRank website. AlexaRank publishes set of URLs with ranking to support to research community. In this study, the crawler crawled a number of 7658 URLs from AlexaRank between June 2020 to November 2020. 6042 URLs were collected through Phishtank datasets. During the data collection, extracted data are stored in W and returned as W1 with number of URLS, N.

https://doi.org/10.1371/journal.pone.0258361.g004

Fig 5 illustrates the steps of data pre—process. url is one of the elements of URL dataset. In this process, the raw data is pre—processed by scanning each URL in th dataset. A set of functions are developed in order to remove the irrelevant data. Finally, D2 is the set of features returned by the pre—process activity.

https://doi.org/10.1371/journal.pone.0258361.g005

Fig 6 represents the processes of data transformation. “Num” is the vector returned by the data transformation process. During this process, each feature of D2 is converted as a vector. Each data in D2 is processed using the GenerateVectors function. A vector is generated and passed as an input to the training phase.

https://doi.org/10.1371/journal.pone.0258361.g006

Fig 7 provides the processes involved in the training phase. Each URL is processed with the support of vector. LSTMLib is one of the functions in the LSTM to predict an output using the vectors. The library is updated with the extracted features that contains the necessary data related to malicious and normal web pages. Thus, the iterative process is used to scan each vector and suspicious URL and generate a final outcome. Lastly, op is the prediction returned by the proposed method during the training phase.

https://doi.org/10.1371/journal.pone.0258361.g007

Fig 8 indicates the testing phase of the proposed URL detection. The proposed processes each element from LSTMMemory function is compared with the vector of URL and decide an output. The f is the element of the feedback which is collected from the crawler that indicates the page rank of a website. The page rank indicates the value of a website and the lowest ranking website will be declared as malicious or suspicious to alert the users.

https://doi.org/10.1371/journal.pone.0258361.g008

Fig 9 shows the snippet of epoch settings in the training phase. The epoch value is used to indicate the execution time of a method. The learning rate can be increased to improve the performance of a method.

https://doi.org/10.1371/journal.pone.0258361.g009

4. Results and discussions

The proposed method (LURL) is developed in Python 3.0 with the support of Sci—Kit Learn and NUMPY packages. Also, the existing URL detectors are constructed for evaluating the performance of LURL. Table 2 shows the parameters settings of methods during training and testing phases. Learning rate, maximum epoch, batch size, and decay are the parameters to instruct the methods to execute the results for certain number of times. Threshold values and vocabulary size are the important parameters for testing phase to generate results using test dataset.

https://doi.org/10.1371/journal.pone.0258361.t002

The methods are evaluated in terms of learning rate, accuracy, and precision. Table 3 presents the learning rate of the methods during the training phase. The performance of three detectors during the training phase are similar. It is evident that the learning ability of methods are same. Authors maintained similar parameters for all detectors. However, the proposed method, LURL produced a better outcome rather than Hung Le et al. [ 5 ] and Hong J. et al. [ 6 ]. LURL covered 94.3 percent of data with learning rate of 5.0 whereas Hung Le et al. and Hong J. et al. have reached 93.8 and 92.8, respectively. The learning rate of LURL is reasonable comparing to other two methods. It indicates that ML based methods able to scan an average of 84% of dataset to learn the environment at the rate of 1.0.

https://doi.org/10.1371/journal.pone.0258361.t003

Table 4 shows the learning rate of the methods for Crawler dataset. As discussed in the section 3, Crawler dataset was generated with the support of AlexaRank dataset. It contains larger number of normal URLs comparing to the malicious URLs. The intention for employing Crawler is to teach the methods to predict legitimate URLs. It is very difficult to predict a website without analysing content; however, the phishing site is similar to legitimate website. Therefore, it is necessary for methods to understand the differences between legitimate and malicious website. Based on the outcome, it is obvious that the performance of all detectors is like each other. Similar to Phishtank dataset, all three methods consumed an average of 86% of data at the rate of 1.0. The reason for the faster rate is that RNN can read numeric data at faster rate rather than images [ 12 ].

https://doi.org/10.1371/journal.pone.0258361.t004

There is a demand for an effective phishing detection system to secure a network or individual’s privacy and data. RQ3 supports to evaluate the performance of the proposed method using the learning rate, accuracy, and F1 score. RQ3 states that how to measure the efficiency of URL detectors. Tables 5 and 6 presents a solution for it. Table 5 shows the accuracy of detectors with Phishtank and Crawler datasets, accordingly. LURL has produced an average of 97.4% and 96.8% for Phishtank and Crawler datasets respectively. Both Hung Le et al., and Hong J. et al., have reached an average of 93.8, 94.1, 96.7, and 93.6 for Phishtank and Crawler datasets. It is evident that the performance of LURL is better comparing to other URL detectors. Fig 10 illustrates the corresponding graph of Table 4 . It represents that LURL has generated the output in less amount of time rather than the other predictors.

https://doi.org/10.1371/journal.pone.0258361.g010

https://doi.org/10.1371/journal.pone.0258361.t005

https://doi.org/10.1371/journal.pone.0258361.t006

Finally, Table 6 provides the comparison of F1—score of URL detectors. As presented in section 2, TP and TN indicate the malicious and legitimate URLs, accordingly. Based on the TP, TN, FP, and FN, both precision and recall value are calculated. Using these values, F1—measure is computed. It indicates the retrieving ability of URL detector. From the outcome, it is obvious that the proposed URL detector, LURL is superior rather than other two URL detectors. The reason for the better F1—measure is the capability of LSTM memory. Fig 11 shows the F1—score against the computation time. It represents that LURL achieved a F1—Score of 96.4 in 4.62 seconds for Phishtank dataset whereas Hung Le et al., and Hong J. et al., have achieved 95.8, 92.7 in 3.87 and 5.23 respectively. For Crawler dataset, F1—Score of LURL is 94.8 whereas Hung Le et al. and Hong J. et al. has reached 95.6, and 95.3, accordingly.

https://doi.org/10.1371/journal.pone.0258361.g011

5. Conclusion

The proposed study emphasized the phishing technique in the context of classification, where phishing website is considered to involve automatic categorization of websites into a predetermined set of class values based on several features and the class variable. The ML based phishing techniques depend on website functionalities to gather information that can help classify websites for detecting phishing sites. The problem of phishing cannot be eradicated, nonetheless can be reduced by combating it in two ways, improving targeted anti-phishing procedures and techniques and informing the public on how fraudulent phishing websites can be detected and identified. To combat the ever evolving and complexity of phishing attacks and tactics, ML anti-phishing techniques are essential. Authors employed LSTM technique to identify malicious and legitimate websites. A crawler was developed that crawled 7900 URLs from AlexaRank portal and also employed Phishtank dataset to measure the efficiency of the proposed URL detector. The outcome of this study reveals that the proposed method presents superior results rather than the existing deep learning methods. A total of 7900 malicious URLS were detected using the proposed URL detector. It has achieved better accuracy and F1—score with limited amount of time. The future direction of this study is to develop an unsupervised deep learning method to generate insight from a URL. In addition, the study can be extended in order to generate an outcome for a larger network and protect the privacy of an individual.

Supporting information

https://doi.org/10.1371/journal.pone.0258361.s001

https://doi.org/10.1371/journal.pone.0258361.s002

Acknowledgments

The author would like to acknowledge the support provided by AlMaarefa University while conducting this research work.

1. Anti-Phishing Working Group (APWG), https://docs.apwg.org//reports/apwg_trends_report_q4_2019.pdf
View Article
Google Scholar
4. Gandotra E., Gupta D, “An Efficient Approach for Phishing Detection using Machine Learning”, Algorithms for Intelligent Systems , Springer, Singapore, 2021, https://doi.org/10.1007/978-981-15-8711-5_12 .
5. Hung Le, Quang Pham, Doyen Sahoo, and Steven C.H. Hoi, “URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection”, Conference’17 , Washington, DC, USA, arXiv:1802.03162, July 2017.
6. Hong J., Kim T., Liu J., Park N., Kim SW, “Phishing URL Detection with Lexical Features and Blacklisted Domains”, Autonomous Secure Cyber Systems . Springer, https://doi.org/10.1007/978-3-030-33432-1_12 .
7. J. Kumar, A. Santhanavijayan, B. Janet, B. Rajendran and B. S. Bindhumadhava, “Phishing Website Classification and Detection Using Machine Learning,” 2020 International Conference on Computer Communication and Informatics (ICCCI) , Coimbatore, India, 2020, pp. 1–6, 10.1109/ICCCI48352.2020.9104161.
11. AlEroud A, Karabatis G. Bypassing detection of URL-based phishing attacks using generative adversarial deep neural networks. In: Proceedings of the Sixth International Workshop on Security and Privacy Analytics 2020 Mar 16 (pp. 53–60).
13. J. Anirudha and P. Tanuja,”Phishing Attack Detection using Feature Selection Techniques “, Proceedings of International Conference on Communication and Information Processing (ICCIP) , 2019, http://dx.doi.org/10.2139/ssrn.3418542
14. Wu CY, Kuo CC, Yang CS,” A phishing detection system based on machine learning” In: 2019 International Conference on Intelligent Computing and its Emerging Applications (ICEA) , pp 28–32, 2019.
16. Srinivasa Rao R, Pais AR, “Detecting phishing websites using automation of human behavior”, In: Proceedings of the 3rd ACM workshop on cyber-physical system security , ACM, pp 33–42, 2017.
21. Gull S and SA Parah, “Color image authentication using dual watermarks”, In: 2019 fifth international conference on image information processing (ICIIP) , pp 240–245, 2019.
25. AlexaRank, https://www.alexa.com/siteinfo , Accessed: 2020–06–01

Phishing Websites Detection Using Machine Learning

7 Pages Posted: 27 May 2022

Suhani Jain

affiliation not provided to SSRN

Phishing is an online crime in which a criminal tries to persuade unsuspecting users to reveal sensitive (and valuable) personal information to the miscreant, such as usernames, passwords, financial account details, personal addresses, SSNs, and social contacts, for harmful purposes. Phishing is usually carried out by impersonating a reliable entity in Internet communication, which is accomplished through a combination of social engineering and technical trickery. Attackers regularly employ spoofing emails and deceptive websites to persuade users to provide personal information. Spoofing emails frequently pretend to be from legitimate companies and direct consumers to fake websites where they can enter important information. Phishing is one of the most common forms of online crime in today's world. Checking URLs against blacklists of known phishing websites, which are generally built based on manual verification, is a frequent countermeasure that is inefficient. As the Internet develops in size, automatic URL recognition becomes more necessary to offer end users with timely protection. This thesis explains how to use machine learning to detect dangerous phishing websites, with an emphasis on attributes retrieved just from the URL. It starts with a description of the available data and the feature engineering process, then moves on to choosing acceptable machine learning approaches. It compares algorithm performance and assesses the outcomes obtained.

Keywords: machine learning, classification, algorithm, Features Extraction

Suggested Citation: Suggested Citation

Suhani Jain (Contact Author)

Affiliation not provided to ssrn ( email ).

No Address Available

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, artificial intelligence ejournal.

Subscribe to this fee journal for more curated articles on this topic

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 25 May 2022

An effective detection approach for phishing websites using URL and HTML features

Ali Aljofey 1 , 2 ,
Qingshan Jiang 1 ,
Abdur Rasool 1 , 2 ,
Hui Chen 1 , 2 ,
Wenyin Liu 3 ,
Qiang Qu 1 &
Yang Wang 4

Scientific Reports volume 12 , Article number: 8842 ( 2022 ) Cite this article

22k Accesses

32 Citations

Metrics details

Computer science
Information technology
Scientific data

Today's growing phishing websites pose significant threats due to their extremely undetectable risk. They anticipate internet users to mistake them as genuine ones in order to reveal user information and privacy, such as login ids, pass-words, credit card numbers, etc. without notice. This paper proposes a new approach to solve the anti-phishing problem. The new features of this approach can be represented by URL character sequence without phishing prior knowledge, various hyperlink information, and textual content of the webpage, which are combined and fed to train the XGBoost classifier. One of the major contributions of this paper is the selection of different new features, which are capable enough to detect 0-h attacks, and these features do not depend on any third-party services. In particular, we extract character level Term Frequency-Inverse Document Frequency (TF-IDF) features from noisy parts of HTML and plaintext of the given webpage. Moreover, our proposed hyperlink features determine the relationship between the content and the URL of a webpage. Due to the absence of publicly available large phishing data sets, we needed to create our own data set with 60,252 webpages to validate the proposed solution. This data contains 32,972 benign webpages and 27,280 phishing webpages. For evaluations, the performance of each category of the proposed feature set is evaluated, and various classification algorithms are employed. From the empirical results, it was observed that the proposed individual features are valuable for phishing detection. However, the integration of all the features improves the detection of phishing sites with significant accuracy. The proposed approach achieved an accuracy of 96.76% with only 1.39% false-positive rate on our dataset, and an accuracy of 98.48% with 2.09% false-positive rate on benchmark dataset, which outperforms the existing baseline approaches.

Detecting hallucinations in large language models using semantic entropy

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data

Introduction.

Phishing offenses are increasing, resulting in billions of dollars in loss 1 . In these attacks, users enter their critical (i.e., credit card details, passwords, etc.) to the forged website which appears to be legitimate. The Software-as-a-Service (SaaS) and webmail sites are the most common targets of phishing 2 . The phisher makes websites that look very similar to the benign websites. The phishing website link is then sent to millions of internet users via emails and other communication media. These types of cyber-attacks are usually activated by emails, instant messages, or phone calls 3 . The aim of the phishing attack is not only to steal the victims' personality, but it can also be performed to spread other types of malware such as ransomware, to exploit approach weaknesses, or to receive monetary profits 4 . According to the Anti-Phishing Working Group (APWG) report in the 3rd Quarter of 2020, the number of phishing attacks has grown since March, and 28,093 unique phishing sites have been detected between July to September 2 . The average amount demanded during wire transfer Business E-mail Compromise (BEC) attacks was $48,000 in the third quarter, down from $80,000 in the second quarter and $54,000 in the first.

Detecting and preventing phishing offenses is a significant challenge for researchers due to the way phishers carry out the attack to bypass the existing anti-phishing techniques. Moreover, the phisher can even target some educated and experienced users by using new phishing scams. Thus, software-based phishing detection techniques are preferred for fighting against the phishing attack. Mostly available methods for detecting phishing attacks are blacklists/whitelists 5 , natural language processing 6 , visual similarity 7 , rules 8 , machine learning techniques 9 , 10 , etc. Techniques based on blacklists/whitelists fail to detect unlisted phishing sites (i.e. 0-h attacks) as well as these methods fail when blacklisted URL is encountered with minor changes. In the machine learning based techniques, a classification model is trained using various heuristic features (i.e., URL, webpage content, website traffic, search engine, WHOIS record, and Page Rank) in order to improve detection efficiency. However, these heuristic features are not warranted to present in all phishing websites and might also present in the benign websites, which may cause a classification error. Moreover, some of the heuristic features are hard to access and third-party dependent. Some third-party services (i.e., page rank, search engine indexing, WHOIS etc.) may not be sufficient to identify phishing websites that are hosted on hacked servers and these websites are inaccurately identified as benign websites because they are contained in search results. Websites hosted on compromised servers are usually more than a day old unlike other phishing websites which only take a few hours. Also, these services inaccurately identify the new benign website as a phishing site due to the lack of domain age. The visual similarity-based heuristic techniques compare the new website with the pre-stored signature of the website. The website’s visual signature includes screenshots, font styles, images, page layouts, logos, etc. Thus, these techniques cannot identify the fresh phishing websites and generate a high false-negative rate (phishing to benign). The URL based technique does not consider the HTML of the webpage and may misjudge some of the malicious websites hosted on free or compromised servers. Many existing approaches 11 , 12 , 13 extract hand-crafted URL based features, e.g., number of dots, presence of special “@”, “#”, “–” symbol, URL length, brand names in URL, position of Top-Level domain, check hostname for IP address, presence of multiple TLDs, etc. However, there are still hurdles to extracting manual URL features due to the fact that human effort requires time and extra maintenance labor costs. Detecting and preventing phishing offense is a major defiance for researchers because the scammer carries out these offenses in a way that can avoid current anti-phishing methods. Hence, the use of hybrid methods rather than a single approach is highly recommended by the networks security manager.

This paper provides an efficient solution for phishing detection that extracts the features from website's URL and HTML source code. Specifically, we proposed a hybrid feature set including URL character sequence features without expert’s knowledge, various hyperlink information, plaintext and noisy HTML data-based features within the HTML source code. These features are then used to create feature vector required for training the proposed approach by XGBoost classifier. Extensive experiments show that the proposed anti-phishing approach has attained competitive performance on real dataset in terms of different evaluation statistics.

Our anti-phishing approach has been designed to meet the following requirements.

High detection efficiency: To provide high detection efficiency, incorrect classification of benign sites as phishing (false-positive) should be minimal and correct classification of phishing sites (true-positive) should be high.

Real-time detection: The prediction of the phishing detection approach must be provided before exposing the user's personal information on the phishing website.

Target independent: Due to the features extracted from both URL and HTML the proposed approach can detect new phishing websites targeting any benign website (zero-day attack).

Third-party independent: The feature set defined in our work are lightweight and client-side adaptable, which do not rely on third-party services such as blacklist/whitelist, Domain Name System (DNS) records, WHOIS record (domain age), search engine indexing, network traffic measures, etc. Though third-party services may raise the effectiveness of the detection approach, they might misclassify benign websites if a benign website is newly registered. Furthermore, the DNS database and domain age record may be poisoned and lead to false negative results (phishing to benign).

Hence, a light-weight technique is needed for phishing websites detection adaptable at client side. The major contributions in this paper are itemized as follows.

We propose a phishing detection approach, which extracts efficient features from the URL and HTML of the given webpage without relying on third-party services. Thus, it can be adaptable at the client side and specify better privacy.

We proposed eight novel features including URL character sequence features (F1), textual content character level (F2), various hyperlink features (F3, F4, F5, F6, F7, and F14) along with seven existing features adopted from the literature.

We conducted extensive experiments using various machine learning algorithms to measure the efficiency of the proposed features. Evaluation results manifest that the proposed approach precisely identifies the legitimate websites as it has a high true negative rate and very less false positive rate.

We release a real phishing webpage detection dataset to be used by other researchers on this topic.

The rest of this paper is structured as follows: The " Related work " section first reviews the related works about phishing detection. Then the " Proposed approach " section presents an overview of our proposed solution and describes the proposed features set to train the machine learning algorithms. The " Experiments and result analysis ” section introduces extensive experiments including the experimental dataset and results evaluations. Furthermore, the " Discussion and limitation " section contains a discussion and limitations of the proposed approach. Finally, the " Conclusion " section concludes the paper and discusses future work.

Related work

This section provides an overview of the proposed phishing detection techniques in the literature. Phishing methods are divided into two categories; expanding the user awareness to distinguish the characteristics of phishing and benign webpages 14 , and using some extra software. Software-based techniques are further categorized into list-based detection, and machine learning-based detection. However, the problem of phishing is so sophisticated that there is no definitive solution to efficiently bypass all threats; thus, multiple techniques are often dedicated to restrain particular phishing offenses.

List-based detection

List-based phishing detection methods use either whitelist or blacklist-based technique. A blacklist contains a list of suspicious domains, URLs, and IP addresses, which are used to validate if a URL is fraudulent. Simultaneously, the whitelist is a list of legitimate domains, URLs, and IP addresses used to validate a suspected URL. Wang et al. 15 , Jain and Gupta 5 and Han et al. 16 use white list-based method for the detection of suspected URL. Blacklist-based methods are widely used in openly available anti-phishing toolbars, such as Google safe browsing, which maintains a blacklist of URLs and provides warnings to users once a URL is considered as phishing. Prakash et al. 17 proposed a technique to predict phishing URLs called Phishnet. In this technique, phishing URLs are identified from the existing blacklisted URLs using the directory structure, equivalent IP address, and brand name. Felegyhazi et al. 18 developed a method that compares the domain name and name server information of new suspicious URLs to the information of blacklisted URLs for the classification process. Sheng et al. 19 demonstrated that a forged domain was added to the blacklist after a considerable amount of time, and approximately 50–80% of the forged domains were appended after the attack was carried out. Since thousands of deceptive websites are launched every day, the blacklist requires to be updated periodically from its source. Thus, machine learning-based detection techniques are more efficient in dealing with phishing offenses.

Machine learning-based detection

Data mining techniques have provided outstanding performance in many applications, e.g., data security and privacy 20 , game theory 21 , blockchain systems 22 , healthcare 23 , etc. Due to the recent development of phishing detection methods, various machine learning-based techniques have also been employed 6 , 9 , 10 , 13 to investigate the legality of websites. The effectiveness of these methods relies on feature collection, training data, and classification algorithm. The feature collection is extracted from different sources, e.g., URL, webpage content, third party services, etc. However, some of the heuristic features are hard to access and time-consuming, which makes some machine learning approaches demand high computations to extract these features.

Jain and Gupta 24 proposed an anti-phishing approach that extracts the features from the URL and source code of the webpage and does not rely on any third-party services. Although the proposed approach attained high accuracy in detecting phishing webpages, it used a limited dataset (2141 phishing and 1918 legitimate webpages). The same authors 9 present a phishing detection method that can identify phishing attacks by analyzing the hyperlinks extracted from the HTML of the webpage. The proposed method is a client-side and language-independent solution. However, it entirely depends on the HTML of the webpage and may incorrectly classify the phishing webpages if the attacker changes all webpage resource references (i.e., Javascript, CSS, images, etc.). Rao and Pais 25 proposed a two-level anti-phishing technique called BlackPhish. At first level, a blacklist of signatures is created using visual similarity based features (i.e., file names, paths, and screenshots) rather than using blacklist of URLs. At second level, heuristic features are extracted from URL and HTML to identify the phishing websites which override the first level filter. In spite of that, the legitimate websites always undergo two-level filtering. In some researches 26 authors used search engine-based mechanism to authenticate the webpage as first-level authentication. In the second level authentication, various hyperlinks within the HTML of the website are processed for the phishing websites detection. Although the use of search engine-based techniques increases the number of legitimate websites correctly identified as legitimate, it also increases the number of legitimate websites incorrectly identified as phishing when newly created authentic websites are not found in the top results of search engine. Search based approaches assume that genuine website appears in the top search results.

In a recent study, Rao et al. 27 proposed a new phishing websites detection method with word embedding extracted from plain text and domain specific text of the html source code. They implemented different word embedding to evaluate their model using ensemble and multimodal techniques. However, the proposed method is entirely dependent on plain text and domain specific text, and may fail when the text is replaced with images. Some researchers have tried to identify phishing attacks by extracting different hyperlink relationships from webpages. Guo et al. 28 proposed a phishing webpages detection approach which they called HinPhish. The approach establishes a heterogeneous information network (HIN) based on domain nodes and loading resources nodes and establishes three relationships between the four hyperlinks: external link, empty link, internal link and relative link. Then, they applied an authority ranking algorithm to calculate the effect of different relationships and obtain a quantitative score for each node.

In Sahingoz et al. 6 work, the distributed representation of words is adopted within a specific URL, and then seven various machine learning classifiers are employed to identify whether a suspicious URL is a phishing website. Rao et al. 13 proposed an anti-phishing technique called CatchPhish. They extracted hand-crafted and Term Frequency-Inverse Document Frequency (TF-IDF) features from URLs, then trained a classifier on the features using random forest algorithm. Although the above methods have shown satisfactory performance, they suffer from the following restrictions: (1) inability to handle unobserved characters because the URLs usually contain meaningless and unknown words that are not in the training set; (2) they do not consider the content of the website. Accordingly, some URLs, which are distinctive to others but imitate the legitimate sites, may not be identified based on URL string. As their work is only based on URL features, which is not enough to detect the phishing websites. However, we have provided an effective solution by proposing our approach to this domain by utilizing three different types of features to detect the phishing website more efficiently. Specifically, we proposed a hybrid feature set consisting of URL character sequence, various hyperlinks information, and textual content-based features.

Deep learning methods have been used for phishing detection e.g., Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Recurrent Convolutional Neural Networks (RCNN) due to the success of the Natural Language Processing (NLP) attained by these techniques. However, deep learning methods are not employed much in phishing detection due to the inclusive training time. Aljofey et al. 3 proposed a phishing detection approach with a character level convolutional neural network based on URL. The proposed approach was compared by using various machine and deep learning algorithms, and different types of features such as TF-IDF characters, count vectors, and manually-crafted features. Le et al. 29 provided a URLNet method to detect phishing webpage from URL. They extract character-level and word-level features from URL strings and employ CNN networks for training and testing. Chatterjee and Namin 30 introduced a phishing detection technique based on deep reinforcement learning to identify phishing URLs. They used their model on a balanced, labeled dataset of benign and phishing URLs, extracting 14 hand-crafted features from the given URLs to train the proposed model. In recent studies, Xiao et al. 31 proposed phishing website detection approach named CNN–MHSA. CNN network is applied to extract characters features from URLs. In the meanwhile, multi-head self-attention (MHSA) mechanism is employed to calculate the corresponding weights for the CNN learned features. Zheng et al. 32 proposed a new Highway Deep Pyramid Neural Network (HDP-CNN) which is a deep convolutional network that integrates both character-level and word-level embedding representation to identify whether a given URL is phishing or legitimate. Albeit the above approaches have shown valuable performances, they might misclassify phishing websites hosted on compromised servers since the features are extracted only from the URL of the website.

The features extracted in some previous studies are based on manual work and require additional effort since these features need to be reset according to the dataset, which may lead to overfitting of anti-phishing solutions. We got the motivation from the above-mentioned studies and proposed our approach. In which, the current work extract character sequences feature from URL without manual intervention. Moreover, our approach employs noisy data of HTML, plaintext, and hyperlinks information of the website with the benefit of identifying new phishing websites. Table 1 presents the detailed comparison of existing machine learning based phishing detection approaches.

Proposed approach

Our approach extracts and analyzes different features of suspected webpages for effective identification of large-scale phishing offenses. The main contribution of this paper is the combined uses of these feature set. For improving the detection accuracy of phishing webpages, we have proposed eight new features. Our proposed features determine the relationship between the URL of the webpage and the webpage content.

System architecture

The overall architecture of the proposed approach is divided into three phases. In the first phase, all the essential features are extracted and HTML source code will be crawled. The second phase applies feature vectorization to generate a particular feature vector for each webpage. The third phase identifies if the given webpage is phishing. Figure 1 shows the system structure of the proposed approach. Details of each phase are described as follows.

General architecture of the proposed approach.

Feature generation

The features are generated in this component. Our features are based on the URL and HTML source code of the webpage. A Document Object Model (DOM) tree of the webpage is used to extract the hyperlink and textual content features using a web crawler automatically. The features of our approach are categorized into four groups as depicted in Table 2 . In particular, features F1–F7, and F14 are new and proposed by us; Features F8–F13, and F15 are taken from other approaches 9 , 11 , 12 , 24 , 33 but we adjusted them for better results. Moreover, the observational method and strategy regarding the interpretation of these features are applied differently in our approach. A detailed explanation of the proposed features is provided in the feature extraction section of this paper.

Feature vectorization

After the features are extracted, we apply feature vectorization to generate a particular feature vector for each webpage to create a labeled dataset. We integrate URL character sequences features with textual content TF-IDF features and hyperlink information features to create feature vector required for training the proposed approach. The hyperlink features combination outputs 13-dimensional feature vector as $F_{H} = \left\langle {f_{3} ,f_{4} ,f_{5} , \ldots ,f_{{15}} } \right\rangle$ , and the URL character sequence features combination outputs 200-dimensional feature vector as $F_{U} = \left\langle {c_{1} ,c_{2} ,c_{3} , \ldots ,c_{{200}} } \right\rangle$ , we set a fixed URL length to 200. If the URL length is greater than 200, the additional part will be ignored. Otherwise, we put a 0 in the remainder of the URL string. The setting of this value depends on the distribution of URL lengths within our dataset. We have noticed that most of the URL lengths are less than 200 which means that when a vector is long, it may contain useless information, in contrast when the feature vector is too short, it may contain insufficient features. TF-IDF character level combination outputs $D$ -dimensional feature vector as $F_{T} = \left\langle {t_{1} ,t_{2} ,t_{3} , \ldots ,t_{D} } \right\rangle$ where $D$ is the size of dictionary computed from the textual content corpus. It is observed from the experimental analysis that the size of dictionary $D$ = 20,332 and the size increases with an increase in number of corpus. The above three feature vectors are combined to generate final feature vector $F_{V} = F_{T} \cup F_{U} \cup F_{H} = \left\langle {t_{1} ,t_{2} , \ldots ,t_{D} ,c_{1} ,c_{2} \ldots ,c_{{200}} ,f_{3} ,f_{4} ,f_{5} , \ldots ,f_{{15}} } \right\rangle$ that is fed as input to machine learning algorithms to classify the website.

Detection module

The Detection phase includes building a strong classifier by using the boosting method, XGBoost classifier. Boosting integrates many weak and relatively accurate classifiers to build a strong and therefore robust classifier for detecting phishing offences. Boosting also helps to combine diverse features resulting in improved classification performance 34 . Here, XGBoost classifier is employed on integrated feature sets of URL character sequence ${F}_{U}$ , various hyperlinks information ${F}_{H}$ , login form features ${F}_{L}$ , and textual content-based features ${F}_{T}$ to build a strong classifier for phishing detection. In the training phase, XGBoost classifier is trained using the feature vector $({F}_{U}\cup {F}_{H} \cup {F}_{L} \cup {F}_{T})$ collected from each record in the training dataset. At the testing phase, the classifier detects whether a particular website is a malicious website or not. The detailed description is shown in Fig. 2 .

Phishing detection algorithm.

Features extraction

Due to the limited search engine and third-party methods discussed in the literature, we extract the particular features from the client side in our approach. We have introduced eleven hyperlink features (F3–F13), two login form features (F14 and F15), character level TF-IDF features (F2), and URL character sequence features (F1). All these features are discussed in the following subsections.

URL character sequence features (F1)

The URL stands for Uniform Resource Locator. It is used for providing the location of the resources on the web such as images, files, hypertext, video, etc. URL. Each URL starts with a protocol (http, https, and ftp) used to access the resource requested. In this part, we extract character sequence features from URL. We employ the method used in 35 to process the URL at the character level. More information is contained at the character level. Phishers also imitate the URLs of legitimate websites by changing many unnoticeable characters, e.g., “ www.icbc.com ” as “ www.1cbc.com ”. Character level URL processing is a solution to the out of vocabulary problem. Character level sequences identify substantial information from specific groups of characters that appear together which could be a symptom of phishing. In general, a URL is a string of characters or words where some words have little semantic meanings. Character sequences help find this sensitive information and improve the efficiency of phishing URL detection. During the learning task, machine learning techniques can be applied directly using the extracted character sequence features without the expert intervention. The main processes of character sequences generating include: preparing the character vocabulary, creating a tokenizer object using Keras preprocessing package ( https://Keras.io ) to process URLs in char level and add a “UNK” token to the vocabulary after the max value of chars dictionary, transforming text of URLs to sequence of tokens, and padding the sequence of URLs to ensure equal length vectors. The description of URL features extraction is shown in Algorithm 1.

HTML features

The webpage source code is the programming behind any webpage, or software. In case of websites, this code can be viewed by anyone using various tools, even in the web browser itself. In this section, we extract the textual and hyperlink features existing in the HTML source code of the webpage.

Textual content-based features (F2)

TF-IDF stands for Term Frequency-Inverse Document Frequency. TF-IDF weight is a statistical measure that tells us the importance of a term in a corpus of documents 36 . TF-IDF vectors can be created at various levels of input tokens (words, characters, n-grams) 37 . It is observed that TF-IDF technique has been implemented in many approaches to catch phish of webpages by inspecting URLs 13 , obtain the indirect associated links 38 , target website 11 , and validity of suspected website 39 . In spite of TF-IDF technique extracts outstanding keywords from the text content of the webpage, it has some limitations. One of the limitations is that TF-IDF technique fails when the extracted keywords are meaningless, misspelled, skipped or replaced with images. Since plaintext and noisy data (i.e., attribute values for div, h1, h2, body and form tags) are extracted in our approach from the given webpage using BeautifulSoup parser, TF-IDF character level technique is applied with max features as 25,000. To obtain valid textual information, extra portions (i.e., JavaScript code, CSS code, punctuation symbols, and numbers) of the webpage are removed through regular expressions, including Natural Language Processing packages ( http://www.nltk.org/nltk_data/ ) such as sentence segmentation, word tokenization, text lemmatization and stemming as shown in Fig. 3 .

The process of generating text features.

Phishers usually mimic the textual content of the target website to trick the user. Moreover, phishers may mistake or override some texts (i.e., title, copyright, metadata, etc.) and tags in phishing webpages to bypass revealing the actual identification of the webpage. However, tag attributes stay the same to preserve the visual similarity between phishing and targeted site using the same style and theme as that of the benign webpage. Therefore, it is needful to extract the text features (plaintext and noisy part of HTML) of the webpage. The basic of this step is to extract the vectored representation of the text and the effective webpage content. A TF-IDF object is employed to vectorize text of the webpage. The detailed process of the text vector generation algorithm as follows.

Script, CSS, img, and anchor files (F3, F4, F5, and F6)

External JavaScript or external Cascading Style Sheets (CSS) files are separate files that can be accessed by creating a link within the head section of a webpage. JavaScript, CSS, images, etc. files may contain malicious code while loading a webpage or clicking on a specific link. Moreover, phishing websites have fragile and unprofessional content as the number of hyperlinks referring to a different domain name increases. We can use <img> and <script> tags that have the "src" attribute to extract images and external JavaScript files in the website. Similarly, CSS and anchor files are within "href" attribute in <link> and <a> tags. In Eqs. ( 1 – 4 ), basically we calculated the rate of img and script tags that have the “src” attribute, link and anchor tags that have “href” attribute to the total hyperlinks available in a webpage, these tags usually link to image, Javascript, anchor, and CSS files required for a website

where ${\text{F}}_{\text{Script}\_\text{files}}$ , ${\text{F}}_{\text{CSS}\_\text{files}}$ , ${\text{F}}_{\text{Img}\_\text{files}}$ , ${\text{F}}_{\text{a}\_\text{files}}$ are the numbers of Javascript, CSS, image, anchor files existing in a webpage, and ${\text{F}}_{\text{Total}}$ is the total hyperlinks available in a webpage.

Empty hyperlinks (F7 and F8)

In the empty hyperlink, the “href” or “src” attributes of anchor, link, script, or img tags do not contain any URL. The empty link returns on the same webpage again when the user clicks on it. A benign website contains many webpages; thus, the scammer does not place any values in hyperlinks to make a phishing website behave like the benign website, and the hyperlinks look active on the phishing website. For example, <a href = “#”>, <a href = “#content”> and <a href = “javascript:void(0);”> HTML coding are used to design null hyperlinks 24 . To establish the empty hyperlink features, we define the rate of empty hyperlinks to the total number of hyperlinks available in a webpage, and the rate of anchor tag without “href” attribute to the total number of hyperlinks in a webpage. Following formulas are used to compute empty hyperlink features

where ${\text{F}}_{\text{a}\_\text{null}}$ and ${\text{F}}_{\text{null}}$ are the numbers of anchor tags without href attribute, and null hyperlinks in a webpage.

Total hyperlinks feature (F9)

Phishing websites usually contain minimal pages as compared to benign websites. Furthermore, sometimes the phishing webpage does not contain any hyperlink because the phishers usually only create a login page. Equation ( 7 ) computes the number of hyperlinks in a webpage by extracting the hyperlinks from an anchor, link, script, and img tags in the HTML source code.

Internal and external hyperlinks (F10, F11, and F12)

The base domain name in the external hyperlink is different from the website domain name, unlike the internal hyperlink; the base domain name is the same as the website domain name. The phishing websites may contain many external hyperlinks that indicate to the target websites due to the cybercriminals commonly copy the HTML code from the targeted authorized websites to create their phishing websites. Most of hyperlinks in a benign website contain the similar base domain name, whereas many hyperlinks in a phishing site may include the corresponding benign website domain. In our approach, the internal and external hyperlinks are extracted from the “src” attribute of img, script, frame tags, “action” attribute of form tag, and “href” attribute of the anchor and link tags. We compute the rate of internal hyperlinks to the total links available in a webpage (Eq. 8 ) to establish the internal hyperlink feature, and the rate of external hyperlinks to the total links (Eq. 9 ) to set the external hyperlink feature. Moreover, to set the external/internal hyperlink feature, we compute the rate of external hyperlinks to the internal hyperlinks (Eq. 10 ). A specified number has been used as a way of detecting the suspected websites in some previous studies 5 , 9 , 24 that these features used for classification. For example, if the rate of external hyperlinks to the total links is greater than 0.5, it will indicate that the website is phishing. However, determining a specific number as a parametric detection may cause errors in classification.

where ${\text{F}}_{\text{Internal}}$ , ${\text{F}}_{\text{External}}$ , and ${\text{F}}_{\text{Total}}$ are the number of external, internal, and total hyperlinks in a website.

Error in hyperlinks (F13)

Phishers sometimes add some hyperlinks in the fake website which are dead or broken links. In the hyperlink error feature, we check whether the hyperlink is a valid URL in the website. We do not consider the 403 and 404 error response code of hyperlinks due to the time consumed of the internet access to get the response code of each link. Hyperlink error is defined by dividing the total number of invalid links to the total links as represented in Eq. ( 11 )

where ${\text{F}}_{\text{Error}}$ is the total invalid hyperlinks.

Login form features (F14 and F15)

In the fraudulent website, the common trick to acquire the user's personal information is to include a login form. In the benign webpage, the action attribute of login form commonly includes a hyperlink that has the similar base domain as appear in in the browser address bar 24 . However, in the phishing websites, the form action attribute includes a URL that has a different base domain (external link), empty link, or not valid URL (Eq. 13 ). The suspicious form feature (Eq. 14 ) is defined by dividing the total number of suspicious forms S to the total forms available in a webpage (Eq. 12 )

where ${\text{F}}_{\text{S}}$ and ${\text{L}}_{\text{Total}}$ are the number of suspicious forms and total forms present in a webpage.

Figure 4 shows a comparison between benign and fishing hyperlink features based on the average occurrence rate per feature within each website in our dataset. From the figure, we noticed that the ratios of the external hyperlinks to the internal hyperlinks, and null hyperlinks in the phishing websites are higher than that in benign websites. Whereas, benign sites contain more anchor files, internal hyperlinks, and total hyperlinks.

Distribution of hyperlink-based features in our data.

Classification algorithms

To measure the effectiveness of the proposed features, we have used various machine learning classifiers such as eXtreme Gradient Boosting (XGBoost), Random Forest, Logistic Regression, Naïve Bayes, and Ensemble of Random Forest and Adaboost classifiers to train our proposed approach. The major aim of comparing different classifiers is to expose the best classifier fit for our feature set. To apply different machine learning classifiers, Scikit-learn.org package is used, and Python is employed for feature extraction. From the empirical results, we noticed that XGBoost outperformed other classifiers. XGBoost algorithm is a type of ensemble classifiers, that transform weak learners to robust ones and convenient for our proposed feature set, thus it has high performance.

XGBoost (extreme gradient boosting) is a scalable machine learning system for tree boosting proposed by Chen and Guestrin 40 . Suppose there are $N$ websites in the dataset $\left\{ {\left( {x_{i} ,y_{i} } \right)|i = 1,2,...,N} \right\}$ , where $x_{i} \in R^{d}$ is the extracted features associated with the $i - th$ website, $y_{i} \in \left\{ {0,\left. 1 \right\}} \right.$ is the class label, such that $y_{i} = 1$ if and only if the website is a labelled phishing website. The final output $f_{K} \left( x \right)$ of model is as follows 41 , 46 :

where l is the training loss function and $\Omega \left( {G_{k}} \right) = \gamma T + \frac{1}{2}\lambda \sum\limits_{t = 1}^{T} {\omega_{t}^{2} }$ is the regulation term, since XGBoost introduces additive training and all previous k-1 base learners are fixed, here we assumed that we are in step k that optimizes our function $f_{k} \left( x \right)$ , T is the number of leaves nodes in the base learner G k , γ is the complexity of each leaf, λ is a parameter to scale the penalty, and ω t is the output value at each final leaf node. If we apply the Taylor expansion to expand the Loss function at f k-1 ( x ) we will have 41 :

where $g_{i} = \frac{{\partial l\left( {y_{i} ,f_{k - 1} \left( {x_{i} } \right)} \right)}}{{\partial f_{k - 1} \left( x \right)}},h_{i} = \frac{{\partial l\left( {y_{i} ,f_{k - 1} \left( {x_{i} } \right)} \right)}}{{\partial f_{k - 1}^{2} \left( x \right)}}$ are respectively first and second derivative of the Loss function.

XGBoost classifier is a type of ensemble classifiers, that transform weak learners to robust ones and convenient for our proposed feature set for the prediction of phishing websites, thus it has high performance. Moreover, XGBoost provides a number of advantages, some of which include: (i) The strength to handle missing values existing within the training set, (ii) handling huge datasets that do not fit into memory and (iii) For faster computing, XGBoost can make use of multiple cores on the CPU. The websites are classified into two possible categories: phishing and benign using a binary classifier. When a user requests a new site, the trained XGBoost classifier determines the validity of a particular webpage from the created feature vector.

Experiments and result analysis

In this section we describe the training and testing dataset, performance metrics, implementation details, and outcomes of our approach. The proposed features described in “ Features extraction ” section are used to build a binary classifier, which classify phishing and benign websites accurately.

We collected the dataset from two sources for our experimental implementation. The benign webpages are collected in February 2020 from Stuff Gate 42 , whereas the phishing webpages are collected from PhishTank 43 , which have been validated from August 2016 to April 2020. Our dataset consists of 60,252 webpages and their HTML source codes, wherein 27,280 ones are phishing and 32,972 ones are benign. Table 3 provides the distribution of the benign and phishing instances. We have divided the dataset into two groups where D1 is our dataset, and D2 is dataset used in existing literature 6 . The database management system (i.e., pgAdmin) has been employed with python to import and pre-process the data. The data sets were randomly split in 80:20 ratios for training and testing, respectively.

Performance metrics

To measure the performance of proposed anti-phishing approach, we used different statistical metrics such true-positive rate (TPR), true-negative rate (TNR), false-positive rate (FPR), false-negative rate (FNR), sensitivity or recall, accuracy (Acc), precision (Pre), F-Score, AUC, and they are presented in Table 4 . ${N}_{B}$ and ${N}_{P}$ indicate the total number of benign and phishing websites, respectively. ${N}_{B\to B}$ are the benign websites are correctly marked as benign, ${N}_{B\to P}$ are the benign websites are incorrectly marked as phishing, ${N}_{P\to P}$ are the phishing websites are correctly marked as phishing, and ${N}_{P\to B}$ are the phishing websites are incorrectly marked as benign. The receiver operating characteristic (ROC) arch and AUC are commonly used to evaluate the measures of a binary classifier. The horizontal coordinate of the ROC arch is FPR, which indicates the probability that the benign website is misclassified as a phishing; the ordinate is TPR, which indicates the probability that the phishing website is identified as a phishing.

Evaluation of features

In this section, we evaluated the performance of our proposed features (URL and HTML). We have implemented different Machine Learning (ML) classifiers for feature evaluation used in our approach. In Table 5 , we extracted various text features such as TF-IDF word level, TF-IDF N-gram level (the length of n-gram between 2 and 3), TF-IDF character level, count vectors (bag-of-words), word sequences vectors, global to vector (GloVe) pre-trained word embedding, trained word embedding, character sequences vectors and implemented various classifiers such as XGBoost, Random forest, logistic regression, Naïve Bayes, Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) network. The main intention of this experiment was to reveal the best textual content features convenient for our data. From the experimental results, it is noticed that TF-IDF character level features outperformed other features with significant accuracy, precision, F-Score, Recall, and AUC using XGBoost and DNN classifiers. Hence, we implemented TF-IDF character level technique to generate text features (F2) of the webpage. Figure 5 presents the performance of textual content-based features. As shown in the figure, text features can correctly filter a high amount of phishing websites and achieved an accuracy of 88.82%.

Performance of textual content features.

Table 6 shows the experiment results with hyperlinks features. From the empirical results, it is noticed that Random Forest classifier superior to the other classifiers with an accuracy of 82.27%, precision of 77.59%, F_Measure of 81.63%, recall of 86.10%, and AUC of 82.57%. It is also noticed that ensemble and XGBoost classifiers attained good accuracy of 82.18% and 80.49%, respectively. Figure 6 presents the classification results of hyperlink based features (F3–F15). As shown in the figure, hyperlink based features can accurately clarify 79.04% of benign websites and 86.10% of phishing websites.

Performance of hyperlink based features.

In Table 7 , we integrated features of URL and HTML (hyperlink and text) using various classifiers to verify complementary behavior in phishing websites detection. From the empirical results, it is noticed that LR classifier has sufficient accuracy, precision, F-Score, AUC, and recall in terms of the HTML features. In contrast, NB classifier has good accuracy, precision, F-Score, AUC, and recall with respect to combining all the features. RF and ensemble classifiers achieved high accuracy, recall, F-Score, and AUC with respect to URL based features. XGBoost classifier outperformed the others with an accuracy of 96.76%, F-Score of 96.38%, AUC of 96.58% and recall of 94.56% with respect to combining all the features. It is observed that URL and HTML features are valuable in phishing detection. However, one type of feature is not suitable to identify all kinds of phishing webpages and does not result in high accuracy. Thus, we have combined all features to get more comprehensive features. The results on various classifiers of combined feature set are also shown in Fig. 7 . In Fig. 8 we compare the three feature sets in terms of accuracy, TNR, FPR, FNR, and TPR.

Test results of various classifiers with respect to combined features.

Performance of different feature combinations using XGBoost on dataset D1.

The confusion matrix is used to measure results where each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). The confusion matrix of the proposed approach is created as represented in Table 8 . From the results, combining all kind of features together as an entity correctly identified 5212 out of 5512 phishing webpages and 6448 out of 6539 benign webpages and attained an accuracy of 96.76%. Our approach results in low false positive rate (i.e., less than 1.39% of benign webpages incorrectly classified as phishing), and high true positive rate (i.e., more than 94.56% of phishing webpages accurately classified). We have also tested our feature sets (URL and HTML) on the existing dataset D2. Since dataset D2 only contains legitimate and malicious URLs, we needed to extract the HTML source code features for these URLs. The results are given in Table 9 and Fig. 9 . From the results, it is noticed that combining all kinds of features had outperformed other feature sets with a significant accuracy of 98.48%, TPR of 99.04%, and FPR of 2.09%.

Performance of the proposed approach on dataset D2.

Comparison with existing approaches

In this experiment, we compare our approach with existing anti-phishing approaches. Notice that we have applied Le et al. 29 and Aljofey et al. 3 works on dataset D1 to evaluate the efficiency of the proposed approach. While for comparison of the proposed approach with Sahingoz et al. 6 , Rao et al. 13 , Chatterjee and Namin 30 works, we evaluated our approach on benchmark dataset D2 6 , 13 , 30 based on the four-statistics metrics used in the papers. The comparison results are shown in Table 10 . From the results, it is observed that our approach gives better performance than other approaches discussed in the literature, which shows the efficiency of detecting phishing websites over the existing approaches.

In Table 11 , we implemented Le et al. 29 and Aljofey et al. 3 methods to our dataset D1 and our approach outperformed the others with an accuracy of 96.76%, precision of 98.28%, and F-Score of 96.38%. It should also be mentioned that Aljofey et al. method achieved 97.86% recall, which is 3.3% greater than our method, whereas our approach gives TNR that is higher by 4.97%, and FPR that is lesser by 4.96%. Our approach accurately identifies the legitimate websites with a high TNR and low FPR. Some phishing detection methods achieve high recall, however inaccurate classification of the legitimate websites is more serious compared to inaccurate classification of the phishing sites.

Discussion and limitations

The phishing website seems similar to its benign official website, and the defiance is how to distinguish between them. This paper proposed a novel anti-phishing approach, which involves different features (URL, hyperlink, and text) that have never been taken into consideration. The proposed approach is a completely client-side solution. We applied these features on various machine learning algorithms and found that XGBoost attained the best performance. Our major aim is to design a real-time approach, which has a high true-negative rate and low false-positive rate. The results show that our approach correctly filtered the benign webpages with a low amount of benign webpages incorrectly classified as phishing. In the process of phishing webpage classification, we construct the dataset by extracting the relevant and useful features from benign and phishing webpages.

A desktop machine having a core™ i7 processor with 3.4 GHz clock speed and 16 GB RAM is used to executed the proposed anti-phishing approach. Since Python provides excellent support of its libraries and has sensible compile-time, the proposed approach is implemented using Python programming language. BeautifulSoup library is employed to parse the HTML of the specified URL. The detection time is the time between entering URL to generating outputs. When the URL is entered as a parameter, the approach attempts to fetch all specific features from the URL and HTML code of the webpage as debated in feature extraction section. This is followed by current URL classification in form of benign or phishing based on the value of the extracted feature. The total execution time of our approach in phishing webpage detection is around 2–3 s, which is quite low and acceptable in a real-time environment. Response time depends on different factors, such as input size, internet speed, and server configuration. Using our data D1, we also attempted to compute the time taken for training, testing and detecting of proposed approach (all feature combinations) for the webpage classification. The results are given in Table 12 .

In pursuit of a further understanding of the learning capabilities, we also present the classification error as well as log loss regarding the number of iterations implemented by XGBoost. Log loss, short for logarithmic loss is a loss function for classification that indicates the price paid for the inaccuracy of predictions in classification problems. Figure 10 show the logarithmic loss and the classification error of the XGBoost approach for each epoch on the training and test dataset D1. From reviewing the figure, we might note that the learning algorithm is converging after approximately 100 iterations.

XGBoost learning curve of logarithmic loss and classification error on dataset D1.

Limitations

Although our proposed approach has attained outstanding accuracy, it has some limitations. First limitation is that the textual features of our phishing detection approach depend on the English language. This may cause an error in generating efficient classification results when the suspicious webpage includes language other than English. About half (60.5%) of the websites use English as a text language 44 . However, our approach employs URL, noisy part of HTML, and hyperlink based features, which are language-independent features. The second limitation is that despite the proposed approach uses URL based features, our approach may fail to identify the phishing websites in case when the phishers use the embedded objects (i.e., Javascript, images, Flash, etc.) to obscure the textual content and HTML coding from the anti-phishing solutions. Many attackers use single server-side scripting to hide the HTML source code. Based on our experiments, we noticed that legitimate pages usually contain rich textual content features, and high amount of hyperlinks (At least one hyperlink in the HTML source code). At present, some phishing webpages include malware, for example, a Trojan horse that installs on user’s system when the user opens the website. Hence, the next limitation of this approach is that it is not sufficiently capable of detecting attached malware because our approach does not read and process content from the web page's external files, whether they are cross-domain or not. Finally, our approach's training time is relatively long due to the high dimensional vector generated by textual content features. However, the trained approach is much better than the existing baseline methods in terms of accuracy.

Conclusion and future work

Phishing website attacks are a massive challenge for researchers, and they continue to show a rising trend in recent years. Blacklist/whitelist techniques are the traditional way to alleviate such threats. However, these methods fail to detect non-blacklisted phishing websites (i.e., 0-day attacks). As an improvement, machine learning techniques are being used to increase detection efficiency and reduce the misclassification ratio. However, some of them extract features from third-party services, search engines, website traffic, etc., which are complicated and difficult to access. In this paper, we propose a machine learning-based approach which can speedily and precisely detect phishing websites using URL and HTML features of the given webpage. The proposed approach is a completely client-side solution, and does not rely on any third-party services. It uses URL character sequence features without expert intervention, and hyperlink specific features that determine the relationship between the content and the URL of a webpage. Moreover, our approach extracts TF-IDF character level features from the plaintext and noisy part of the given webpage's HTML.

A new dataset is constructed to measure the performance of the phishing detection approach, and various classification algorithms are employed. Furthermore, the performance of each category of the proposed feature set is also evaluated. According to the empirical and comparison results from the implemented classification algorithms, the XGBoost classifier with integration of all kinds of features provides the best performance. It acquired 1.39% false-positive rate and 96.76% of overall detection accuracy on our dataset. An accuracy of 98.48% with a 2.09% false-positive rate on a benchmark dataset.

In future work, we plane to include some new features to detect the phishing websites that contain malware. As we said in “ Limitations ” section, our approach could not detect the attached malware with phishing webpage. Nowadays, blockchain technology is more popular and seems to be a perfect target for phishing attacks like phishing scams on the blockchain. Blockchain is an open and distributed ledger that can effectively register transactions between receiving and sending parties, demonstrably and constantly, making it common among investors 45 . Thus, detecting phishing scams in the blockchain environment is a defiance for more research and evolution. Moreover, detecting phishing attacks in mobile devices is another important topic in this area due to the popularity of smart phones 47 , which has made them a common target of phishing offenses.

Data availability

The dataset generated during the current study are available in the Google Drive repository: https://drive.google.com/file/d/18ZZHsCeMmF9HKTaL_yd41oJ_3Fgk0gWE/view?usp=sharing .

RSA. Rsa fraud report. https://go.rsa.com/l/797543/2020-07-08/3njln/797543/48525/RSA_Fraud_Report_Q1_2020.pdf (2020) (Accessed 14 January 2021).

APWG. Phishing Attack Trends Reports, 24, November 2020. https://docs.apwg.org/reports/apwg_trends_report_q3_2020.pdf (2020) (Accessed 14 January 2021).

Aljofey, A., Jiang, Q., Qu, Q., Huang, M. & Niyigena, J.-P. An effective phishing detection model based on character level convolutional neural network from URL. Electronics 9 , 1514 (2020).

Article Google Scholar

Dhamija, R., Tygar, J.D., & Hearst, M. Why phishing works. in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 22–27 April 2006 , 581–590 (2006).

Jain, A. K. & Gupta, B. B. A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP J. on Info. Security. 9 , 1–11. https://doi.org/10.1186/s13635-016-0034-3 (2016).

Sahingoz, O. K., Buber, E., Demir, O. & Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019 (117), 345–357 (2019).

Cook, D. L., Gurbani, V. K., & Daniluk, M. Phishwish: A stateless phishing filter using minimal rules. in Financial Cryptography and Data Security , (ed. Gene Tsudik) 324, (Berlin, Heidelberg, Springer-Verlag, 2008).

Jain, A. K. & Gupta, B. B. A machine learning based approach for phishing detection using hyperlinks information. J. Ambient. Intell. Humaniz. Comput. https://doi.org/10.1007/s12652-018-0798-z (2018).

Li, Y., Yang, Z., Chen, X., Yuan, H. & Liu, W. A stacking model using URL and HTML features for phishing webpage detection. Futur. Gener. Comput. Syst. 94 , 27–39 (2019).

Article ADS Google Scholar

Xiang, G., Hong, J., Rose, C. P. & Cranor, L. CANTINA+: a feature rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. 14 (2), 1–28. https://doi.org/10.1145/2019599.2019606 (2011).

Zhang, W., Jiang, Q., Chen, L. & Li, C. Two-stage ELM for phishing Web pages detection using hybrid features. World Wide Web 20 (4), 797–813 (2017).

Rao, R. S., Vaishnavi, T. & Pais, A. R. CatchPhish: Detection of phishing websites by inspecting URLs. J. Ambient. Intell. Humanized Comput. 11 , 813–825 (2019).

Arachchilage, N. A. G., Love, S. & Beznosov, K. Phishing threat avoidance behaviour: An empirical investigation. Comput. Hum. Behav. 60 , 185–197 (2016).

Wang, Y., Agrawal, R., & Choi, B.Y. Light weight anti-phishing with user whitelisting in a web browser. in Region 5 conference, 2008 IEEE, IEEE , 1–4 (2008).

Han, W., Cao, Y., Bertino, E. & Yong, J. Using automated individual white-list to protect web digital identities. Expert Syst. Appl. 39 (15), 11861–11869 (2012).

Prakash, P., Kumar, M., Kompella, R.R., Gupta, M. Phishnet: Predictive blacklisting to detect phishing attacks. in INFOCOM, 2010 Proceedings IEEE, IEEE , 1–5. https://doi.org/10.1109/INFCOM.2010.5462216 (2010)

Felegyhazi, M., Kreibich, C. & Paxson, V. On the potential of proactive domain blacklisting. LEET 10 , 6–6 (2010).

Google Scholar

Sheng, S., Wardman, B., Warner, G., Cranor, L.F., Hong, J., & Zhang, C. An empirical analysis of phishing blacklists. in Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09) (2010).

Qi, L. et al. Privacy-aware data fusion and prediction with spatial-temporal context for smart city industrial environment. IEEE Trans. Ind. Inform. 17 (6), 4159–4167. https://doi.org/10.1109/TII.2020.3012157 (2021).

Liu, Y. et al. A label noise filtering and label missing supplement framework based on game theory. Digital Commun. Netw. https://doi.org/10.1016/j.dcan.2021.12.008 (2022).

Muzammal, M., Qu, Q. & Nasrulin B. Renovating blockchain with distributed databases: An open source system. Future Gener. Comput. Syst. 90 , 105–117. https://doi.org/10.1016/j.future.2018.07.042 (2019).

Liu, Y. et al. Bidirectional GRU networks-based next POI category prediction for healthcare. Int. J. Intell. Syst. https://doi.org/10.1002/int.22710 (2021).

Jain, A. K. & Gupta, B. B. Towards detection of phishing websites on client-side using machine learning based approach. Telecommun. Syst. https://doi.org/10.1007/s11235-017-0414-0 (2017).

Rao, R. S. & Pais, A. R. Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach. J. Ambient. Intell. Humaniz. Comput. https://doi.org/10.1007/s12652-019-01637-z (2019).

Jain, A. K. & Gupta, B. B. Two-level authentication approach to protect from phishing attacks in real time. J. Ambient. Intell. Human Comput. https://doi.org/10.1007/s12652-017-0616-z (2017).

Rao, R. S., Umarekar, A. & Pais, A. R. Application of word embedding and machine learning in detecting phishing websites. Telecommun. Syst. 79 , 33–45. https://doi.org/10.1007/s11235-021-00850-6 (2022).

Guo, B. et al. HinPhish: An effective phishing detection approach based on heterogeneous information networks. Appl. Sci. 11 (20), 9733. https://doi.org/10.3390/app11209733 (2021).

Le, H., Pham, Q., Sahoo, D., & Hoi, S.C.H. Urlnet: Learning a URL representation with deep learning for malicious URL detection. arXiv 2018, arXiv: 1802.03162 (2018).

Chatterjee, M., & Namin, A.S. Detecting phishing websites through deep reinforcement learning. in 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC) . 978-1-7281-2607-4/19/$31.00 ©2019 IEEE. (IEE Computer Society, 2019). https://doi.org/10.1109/COMPSAC.2019.10211 .

Xiao, X., Zhang, D., Hu, G., Jiang, Y. & Xia, S. CNN-MHSA: A convolutional neural network and multi-head self- attention combined approach for detecting phishing websites. Neural Netw. 125 , 303–312. https://doi.org/10.1016/j.neunet.2020.02.013 (2020).

Article PubMed Google Scholar

Zheng, F., Yan Q., Victor C.M. Leung, F. Richard Yu, Ming Z. HDP-CNN: Highway deep pyramid convolution neural network combining word-level and character-level representations for phishing website detection, computers & security. https://doi.org/10.1016/j.cose.2021.102584 (2021)

Mohammad, R. M., Thabtah, F. & McCluskey, L. Predicting phishing websites based on self-structuring neural network. Neural Comput. Appl. 25 (2), 443–458 (2014).

Ramanathan, V. & Wechsler, H. Phishing detection and impersonated entity discovery using Conditional Random Field and Latent Dirichlet Allocation. Comput. Security. 34 , 123–139 (2013).

Zhang, X., Zhao, J., & LeCun, Y. Character-level convolutional networks for text classification. in Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015 (2015).

Stecanella, B. What is TF-IDF? https://monkeylearn.com/blog/what-is-tf-idf/ . (2019) (Accessed 20 December 2020).

Bansal, S.A. Comprehensive guide to understand and implement text classification in python. https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-andimplement-text-classification-in-python/ (2018) (Accessed 1 July 2020).

Ramesh, G., Krishnamurthi, I. & Kumar, K. S. S. An efficacious method for detecting phishing webpages through target domain identification. Decis. Support Syst. 2014 (61), 12–22 (2014).

Zhang, Y., Hong, J.I., & Cranor, L.F. Cantina: A content- based approach to detecting phishing websites. in Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007 , 639–648 (2007).

Chen, T., & Guestrin, C.: Xgboost: A scalable tree boosting system. in Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. ACM , 785–794 (2016)

Aljofey, A., Jiang, Q. & Qu, Q. A supervised learning model for detecting Ponzi contracts in Ethereum Blockchain. In Big Data and Security. ICBDS 2021. Communications in Computer and Information Science Vol. 1563 (eds Tian, Y. et al. ) (Springer, 2022). https://doi.org/10.1007/978-981-19-0852-1_52 .

Chapter Google Scholar

http://stuffgate.com/stuff/website/ . (Accessed February 2020).

http://www.phishtank.com . (Accessed April 2020).

Usage of content languages for websites. https://w3techs.com/technologies/overview/content_language/all . (2021) (Accessed 19 January 2021).

Iansiti, M. & Lakhani, K. R. The truth about blockchain. Harvard Bus. Rev. 95 (1), 118–127 (2017).

https://github.com/YC-Coder-Chen/Tree-Math/blob/master/XGboost.md . (Accessed September 2021).

Qu, Q., Liu, S., Yang, B. & Jensen, C. S. Efficient top-k spatial locality search for co-located spatial web objects. 2014 IEEE 15th International Conference on Mobile Data Management. 1 , 269–278 (2014).

Download references

Acknowledgements

This research work is supported by the National Key Research and Development Program of China Grant nos. 2021YFF1200104 and 2021YFF1200100.

Author information

Authors and affiliations.

Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China

Ali Aljofey, Qingshan Jiang, Abdur Rasool, Hui Chen & Qiang Qu

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Beijing, 100049, China

Ali Aljofey, Abdur Rasool & Hui Chen

Department of Computer Science, Guangdong University of Technology, Guangzhou, China

Cloud Computing Center, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China

You can also search for this author in PubMed Google Scholar

Contributions

Data curation, A.A. and Q.J.; Funding acquisition, Q.J. and Q.Q.; Investigation, Q.J. and Q.Q.; Methodology, A.A. and Q.J.; Project administration, Q.J.; Software, A.A.; Supervision, Q.J.; Validation, A.R. and H.C.; Writing—original draft, A.A.; Writing—review & editing, Q.J., W.L, Y.W, and Q.Q; All authors reviewed the manuscript.

Corresponding author

Correspondence to Qingshan Jiang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Aljofey, A., Jiang, Q., Rasool, A. et al. An effective detection approach for phishing websites using URL and HTML features. Sci Rep 12 , 8842 (2022). https://doi.org/10.1038/s41598-022-10841-5

Download citation

Received : 17 December 2021

Accepted : 06 April 2022

Published : 25 May 2022

DOI : https://doi.org/10.1038/s41598-022-10841-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Spark-based multi-verse optimizer as wrapper features selection algorithm for phishing attack challenge.

Jamil Al-Sawwa
Mohammad Almseidin
Remah Younisse

Cluster Computing (2024)

Detection of phishing URLs with deep learning based on GAN-CNN-LSTM network and swarm intelligence algorithms

Abbas Jabr Saleh Albahadili
Ayhan Akbas
Javad Rahebi

Signal, Image and Video Processing (2024)

A CNN-Based SIA Screenshot Method to Visually Identify Phishing Websites

Dong-Jie Liu
Jong-Hyouk Lee

Journal of Network and Systems Management (2024)

Life-long phishing attack detection using continual learning

Adnan Noor Mian
Sanaullah Manzoor

Scientific Reports (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Email Spam Detection by Machine Learning Approaches: A Review

Conference paper
First Online: 26 June 2024
Cite this conference paper

Mohammad Talib Hadi 12 &
Salwa Shakir Baawi ORCID: orcid.org/0000-0003-3866-5916 13

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 1035))

Included in the following conference series:

International Conference on Forthcoming Networks and Sustainability in the AIoT Era

Currently, technology has exhibited substantial advancement, resulting in the improvement of communication. Emails are often regarded as the most effective method for both informal and formal communication. Furthermore, individuals utilize email as a means to save and distribute significant data, encompassing textual content, images, documents, and various other things. Due to emails’ simple and easy-to-use nature, some people abuse this mode of communication by sending an excessive amount of unwanted emails, usually referred to as spam emails. The spam emails may include malicious content that is disguised as attachments or URLs, posing a risk of security breaches to the host system and potential theft of sensitive information such as credit card data. These days, spam detection poses serious and massive challenges to email and IoT service providers. Various previous studies have concentrated on machine-learning methods to detect spam emails in the mailbox. The primary aim of this work is to provide a comprehensive examination and comparative evaluation of machine learning techniques utilized in the detection of email spam. Also, it highlights the main challenges that face spam email detection. Furthermore, a thorough evaluation of various strategies is conducted, taking into account metrics such as accuracy, precision, recall, and F1-score. Finally, a thorough analysis and potential areas for future research are also examined.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime
Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Compact, lightweight edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cevik, T., Cevik, N., Rasheed, J., Asuroglu, T., Alsubai, S., Turan, M.: Reversible logic-based hexel value differencing—a spatial domain steganography method for hexagonal image processing. IEEE Access 11 , 118186–118203 (2023). https://doi.org/10.1109/ACCESS.2023.3326857

Article Google Scholar

Ahmed, N., Amin, R., Aldabbas, H., Koundal, D., Alouffi, B., Shah, T.: Machine learning techniques for spam detection in email and iot platforms: analysis and research challenges. Secur. Commun. Netw. 2022 (2022). https://doi.org/10.1155/2022/1862888

Idris, I., Selamat, A.: Improved email spam detection model with negative selection algorithm and particle swarm optimization. Appl. Soft Comput. J. 22 , 11–27 (2014). https://doi.org/10.1016/j.asoc.2014.05.002

Guo, Z., Tang, L., Guo, T., Yu, K., Alazab, M., Shalaginov, A.: Deep graph neural network-based spammer detection under the perspective of heterogeneous cyberspace. Futur. Gener. Comput. Syst. 117 , 205–218 (2021). https://doi.org/10.1016/j.future.2020.11.028

Bagui, S., Nandi, D., Bagui, S., White, R.J.: Machine learning and deep learning for phishing email classification using one-hot encoding. J. Comput. Sci. 17 (7), 610–623 (2021). https://doi.org/10.3844/jcssp.2021.610.623

Tahir, T., et al.: Early software defects density prediction: training the international software benchmarking cross projects data using supervised learning. IEEE Access 11 , 141965–141986 (2023). https://doi.org/10.1109/ACCESS.2023.3339994

Karim, A., Azam, S., Shanmugam, B., Kannoorpatti, K., Alazab, M.: A comprehensive survey for intelligent spam email detection. IEEE Access 7 , 168261–168295 (2019). https://doi.org/10.1109/ACCESS.2019.2954791

Olatunji, S.O.: Extreme Learning machines and Support Vector Machines models for email spam detection. In: Canadian Conference on Electrical and Computer Engineering, pp. 1–6 (2017). https://doi.org/10.1109/CCECE.2017.7946806

Khan, S.A., Iqbal, K., Mohammad, N., Akbar, R., Ali, S.S.A., Siddiqui, A.A.: A novel fuzzy-logic-based multi-criteria metric for performance evaluation of spam email detection algorithms. Appl. Sci. 12 (14) (2022). https://doi.org/10.3390/app12147043

Mathur, S., Purohit, A.: Performance evaluation of machine learning algorithms on textual datasets for spam email classification. Int. J. Res. Appl. Sci. Eng. Technol. 10 (7), 4726–4734 (2022). https://doi.org/10.22214/ijraset.2022.46072

Lanka, S.C., Akhila, K., Pujita, K., Sagar, P.V., Mondal, S., Bulla, S.: Spam based email identification and detection using machine learning techniques. In: 2nd International Conference on Sustainable Computing and Data Communication Systems, ICSCDS 2023 - Proceedings, pp. 69–74 (2023). https://doi.org/10.1109/ICSCDS56580.2023.10104659

Vejendla, L.N., Bysani, B., Mundru, A., Setty, M., Kunta, V.J.: Score based support vector machine for spam mail detection. In: 7th International Conference on Trends in Electronics and Informatics, ICOEI 2023 - Proceedings, no. Icoei, pp. 915–920 (2023). https://doi.org/10.1109/ICOEI56765.2023.10125718

Saini, A., Guleria, K., Sharma, S.: Machine learning approaches for an automatic email spam detection. In: 2023 International Conference on Artificial Intelligence and Applications (ICAIA 2023) Alliance Technology Conference (ATCON-1 2023) - Proceeding, pp. 1–5 (2023). https://doi.org/10.1109/ICAIA57370.2023.10169201

Ghosh, A., Das, R., Dey, S., Mahapatra, G.: Ensemble learning and its application in spam detection. In: ICCECE 2023 - International Conference on Computer, Electrical & Communication Engineering, pp. 1–6 (2023). https://doi.org/10.1109/ICCECE51049.2023.10085378

Thakur, P., Joshi, K., Thakral, P., Jain, S.: Detection of email spam using machine learning algorithms: a comparative study. In: 2022 8th International Conference on Signal Processing and Communication, ICSC 2022, pp. 349–352 (2022). https://doi.org/10.1109/ICSC56524.2022.10009149

Cota, R.P., Zinca, D.: Comparative results of spam email detection using machine learning algorithms. In: 14th International Conference on Communications, COMM 2022 - Proceedings, pp. 4–8 (2022). https://doi.org/10.1109/COMM54429.2022.9817305

Rawat, A., Behera, S., Rajaram, V.: Email spam classification using supervised learning in different languages. In: 2022 1st International Conference on Computer, Power and Communiction, ICCPC 2022 - Proceedings, pp. 294–298 (2022). https://doi.org/10.1109/ICCPC55978.2022.10072054

Chakravarty, A., Manikandan, V.: An intelligent model of email spam classification. In: 4th International Conference on Emerging Research in Electronics, Computer Science and Technology, ICERECT 2022, pp. 1–6 (2022). https://doi.org/10.1109/ICERECT56837.2022.10059620

Sasikala, V., Mounika, K., Sravya Tulasi, Y., Gayathri, D., Anjani, M.: Performance evaluation of spam and non-spam E-mail detection using machine learning algorithms. In: Proceedings of the International Conference on Electronics and Renewable Systems, ICEARS 2022, no. Icears, pp. 1359–1365 (2022). https://doi.org/10.1109/ICEARS53579.2022.9752202

Raja, P.V., Sangeetha, K., Suganthakumar, G., Madesh, R.V., Vimal Prakash, N.K.K.: Email spam classification using machine learning algorithms. In: Proceedings of the 2nd International Conference on Artificial Intelligence and Smart Energy, ICAIS 2022, pp. 343–348 (2022). https://doi.org/10.1109/ICAIS53314.2022.9743033

Toma, T., Hassan, S., Arifuzzaman, M.: An analysis of supervised machine learning algorithms for spam email detection. In: 2021 International Conference on Automation, Control and Mechatronics for Industry 4.0, ACMI 2021, no. July, pp. 1–5 (2021). https://doi.org/10.1109/ACMI53878.2021.9528108

Riya, Gupta, S., Vishvashdeep, Kumar, V.: Performance metrices of different machine learning algorithms. In: Proceedings - 2021 3rd International Conference on Advances in Computing, Communication Control and Networking, ICAC3N 2021, pp. 262–264 (2021). https://doi.org/10.1109/ICAC3N53548.2021.9725404

Ablel-Rheem, D.M.: Hybrid feature selection and ensemble learning method for spam email classification. Int. J. Adv. Trends Comput. Sci. Eng. 9 (1.4), 217–223 (2020). https://doi.org/10.30534/ijatcse/2020/3291.42020

Ghosh, A., Senthilrajan, A.: A modified naïve bayes classifier for detecting spam E-mails based on feature selection. In: Proceedings - 2022 6th International Conference on Intelligent Computing and Control Systems, ICICCS 2022, no. May, pp. 1634–1641 (2022). https://doi.org/10.1109/ICICCS53718.2022.9788340

Ahmed, B.: Wrapper feature selection approach based on binary firefly algorithm for spam E-mail filtering. J. Soft Comput. Data Min. 2 (1), 44–52 (2020)

Google Scholar

Sharma, S., Azad, C.: A hybrid approach for feature selection based on global and local optimization for email spam detection. In: 2021 12th International Conference on Computing Communication and Networking Technologies, ICCCNT 2021, pp. 1–6 (2021). https://doi.org/10.1109/ICCCNT51525.2021.9580038

Bansal, C., Sidhu, B.: Machine learning based hybrid approach for email spam detection. In: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions ICRITO 2021), pp. 1–4 (2021). https://doi.org/10.1109/ICRITO51393.2021.9596149

Taloba, A.I., Ismail, S.S.I.: An intelligent hybrid technique of decision tree and genetic algorithm for e-mail spam detection. In: Proceedings - 2019 IEEE 9th International Conference on Intelligent Computing and Information Systems, ICICIS 2019, pp. 99–104 (2019). https://doi.org/10.1109/ICICIS46948.2019.9014756

Saleh, H.M.: An Efficient feature selection algorithm for the spam email classification. Period. Eng. Nat. Sci. 9 (3), 520–531 (2021). https://doi.org/10.21533/pen.v9i3.2202

Hassani, Z., Hajihashemi, V., Borna, K., Sahraei Dehmajnoonie, I.: A classification method for e-mail spam using a hybrid approach for feature selection optimization. J. Sci. Islam. Repub. Iran 31 (2), 165–173 (2020). https://doi.org/10.22059/JSCIENCES.2020.288729.1007444

Agarwal, K., Kumar, T.: Email spam detection using integrated approach of naïve bayes and particle swarm optimization. In: Proceedings of the 2nd International Conference on Intelligent Computing and Control Systems, ICICCS 2018, no. March, pp. 685–690 (2019). https://doi.org/10.1109/ICCONS.2018.8662957

Tavakol Aghaei, V., SeyyedAbbasi, A., Rasheed, J., Abu-Mahfouz, A.M.: Sand cat swarm optimization-based feedback controller design for nonlinear systems. Heliyon 9 (3), e13885 (2023). https://doi.org/10.1016/j.heliyon.2023.e13885

Arasteh, B., Seyyedabbasi, A., Rasheed, J., Abu-Mahfouz, A.M.: Program source-code re-modularization using a discretized and modified sand cat swarm optimization algorithm. Symmetry 15 (2), 401 (2023). https://doi.org/10.3390/sym15020401

Bhardwaj, U., Sharma, P.: Detection of email spam using an ensemble based boosting technique. Int. J. Innov. Technol. Explor. Eng. 8 (11), 403–408 (2019). https://doi.org/10.35940/ijitee.K1365.0981119

Mustapha, I.B., Hasan, S., Olatunji, S.O., Shamsuddin, S.M., Kazeem, A.: Effective email spam detection system using extreme gradient boosting (2020). http://arxiv.org/abs/2012.14430

Assegie, T.A.: Evaluation of supervised learning models for automatic spam email detection, pp. 1–10 (2023)

Download references

Acknowledgment

Special thanks to my advisor for the valuable insight provided regarding the topic at hand.

Author information

Authors and affiliations.

Department of Computer Science, College of Computer Science and Information Technology, University of Al-Qadisiyah, Babylon, Iraq

Mohammad Talib Hadi

Department of Computer Information Systems, College of Computer Science and Information Technology, University of Al-Qadisiyah, Diwanyah, Iraq

Salwa Shakir Baawi

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Talib Hadi .

Editor information

Editors and affiliations.

Department of Computer Engineering, Istanbul Sabahattin Zaim University, Istanbul, Türkiye

Jawad Rasheed

Council for Scientific and Industrial Research (CSIR), Pretoria, South Africa

Adnan M. Abu-Mahfouz

School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast, UK

Muhammad Fahim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper.

Hadi, M.T., Baawi, S.S. (2024). Email Spam Detection by Machine Learning Approaches: A Review. In: Rasheed, J., Abu-Mahfouz, A.M., Fahim, M. (eds) Forthcoming Networks and Sustainability in the AIoT Era. FoNeS-AIoT 2024. Lecture Notes in Networks and Systems, vol 1035. Springer, Cham. https://doi.org/10.1007/978-3-031-62871-9_15

Download citation

DOI : https://doi.org/10.1007/978-3-031-62871-9_15

Published : 26 June 2024

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-62870-2

Online ISBN : 978-3-031-62871-9

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

DOI: 10.7717/peerj-cs.2131
Corpus ID: 270722076

Comparative evaluation of machine learning algorithms for phishing site detection

Noura Fahad Almujahid , Mohd Anul Haq , Mohammed Alshehri
Published in PeerJ Computer Science 24 June 2024
Computer Science

34 References

Machine learning and deep learning for phishing page detection, tips, tricks, and training: supporting anti-phishing awareness among mid-career office workers based on employees’ current practices, analysis of the performance impact of fine-tuned machine learning model for phishing url detection, dbotpm: a deep neural network-based botnet prediction model, security in social-media: awareness of phishing attacks techniques and countermeasures, detection of phishing websites using deep learning techniques, eye diseases classification using back propagation artificial neural network, split behavior of supervised machine learning algorithms for phishing url detection, a new feature selection method for enhancing cancer diagnosis based on dna microarray, improving malware detection using big data and ensemble learning, related papers.

Showing 1 through 3 of 0 Related Papers

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Article Menu

Subscribe SciFeed
Recommended Articles
Google Scholar
on Google Scholar
Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Study of void detection beneath concrete pavement panels through numerical simulation.

1. Introduction

2. principles of gprmax numerical simulation, 2.1. discretization format of maxwell’s equations, 2.2. numerical stability and numerical dispersion, 2.3. absorbing boundary conditions, 3. building numerical simulation model based on gprmax, 3.1. modeling process, 3.2. modeling parameters, 3.2.1. pavement structural parameters, 3.2.2. ground-penetrating radar signal parameters, 3.2.3. spatial parameters, 4. numerical simulation model field verification, 4.1. basic information of the test site, 4.2. analysis of echo signal verification, 5. analysis of echo signal forms in numerical simulation model, 5.1. analysis of echo signal forms for void and intact structures, 5.2. analysis of echo signal forms at different thicknesses of voids, 5.3. analysis of echo signal forms at different void sizes, 5.4. analysis of echo signal forms at different void shapes, 5.5. analysis of echo signal forms in different void-filling media, 6. conclusions, 7. research limitations, 8. future prospects, author contributions, data availability statement, conflicts of interest.

Lu, H.; Zhao, C.; Yuan, J.; Yin, W.; Wang, Y.; Xiao, R. Study on the Properties and Benefits of a Composite Separator Layer in Airport Cement Concrete Pavement. Buildings 2022 , 12 , 2190. [ Google Scholar ] [ CrossRef ]
Zhang, Y.; Bao, F.; Tong, Z.; Ma, T.; Zhang, W.G.; Fan, J.W.; Huang, X.M. Radar Response of Heterogeneous Airport Cement Pavement Panel Void. J. Southeast Univ. (Nat. Sci. Ed.) 2023 , 53 , 137–148. [ Google Scholar ]
Tan, Y.; Ling, J.; Yuan, J.; Xu, Z.J. Influence of Voids on Stress in Airport Cement Concrete Pavement. J. Tongji Univ. (Nat. Sci. Ed.) 2010 , 38 , 552–556+568. [ Google Scholar ]
Khudoyarov, S.; Kim, N.; Lee, J.J. Three-dimensional convolutional neural network–based underground object classification using three-dimensional ground penetrating radar data. Struct. Health Monit. 2020 , 19 , 1884–1893. [ Google Scholar ] [ CrossRef ]
Liu, X.; Dong, X.; Leskovar, D.I. Ground penetrating radar for underground sensing in agriculture: A review. Int. Agrophys. 2016 , 30 , 533–543. [ Google Scholar ] [ CrossRef ]
Yu, Q.; Li, Y.; Luo, T.; Zhang, J.; Tao, L.; Zhu, X.; Zhang, Y.; Luo, L.; Xu, X. Cement pavement void detection algorithm based on GPR signal and continuous wavelet transform method. Sci. Rep. 2023 , 13 , 19710. [ Google Scholar ] [ CrossRef ]
Xiao, X.Z.; Li, Q.S. Study of method for identify void beneath cement concrete pavement slabs: A case study of meiguan expressway. J. Highw. Transp. Res. Dev. 2016 , 33 , 39–45. [ Google Scholar ]
Spears, M.; Hedjazi, S.; Taheri, H. An Evaluation of ASTM Standards for Implementation of Ground Penetrating Radar for Pavement and Bridge Deck Evaluations. J. Test. Eval. 2024 , 52 , 1234–1247. [ Google Scholar ] [ CrossRef ]
Li, C.; Li, X. Ground Penetrating Radar Image Inversion Based on Improved Attention Mechanism. J. Radio Sci. 2023 , 38 , 825–834. [ Google Scholar ]
Qiu, Z.; Zeng, J.; Tang, W.; Yang, H.; Lu, J.; Zhao, Z. Research on Real-Time Automatic Picking of Ground-Penetrating Radar Image Features by Using Machine Learning. Horticulturae 2022 , 8 , 1116. [ Google Scholar ] [ CrossRef ]
Jiang, H. Research on Detection of Underground Cavity Targets by Ground Penetrating Radar. Master’s Thesis, Institute of Technology, Harbin, China, 2017. [ Google Scholar ]
Kang, M.S.; Kim, N.; Lee, J.J.; An, Y.K. Deep learning-based automated underground cavity detection using three-dimensional ground penetrating radar. Struct. Health Monit. 2020 , 19 , 173–185. [ Google Scholar ] [ CrossRef ]
Bao, Y.W.; Gao, R.X.; Guo, D.; Bai, S.S.; Xin, X.J. Forward Modelling and Detection of GPR in Urban Road Base Disease. Chem. Eng. Trans. 2015 , 46 , 445–450. [ Google Scholar ]
He, B.; Zhang, H. Frequency Dispersion Suppression and Absorbing Boundary Improvement in Geological Radar Forward Modeling. Geol. Explor. 2000 , 59–63. [ Google Scholar ]
Feng, D.; Dai, Q.; Weng, J. Application of Time Domain Multiresolution Method in Three-Dimensional Ground Penetrating Radar Forward Simulation. J. Cent. South Univ. (Sci. Technol.) 2007 , 975–980. [ Google Scholar ]
Shu, Z. Research on Detection and Inversion of Tunnel Lining Cavity by Ground Penetrating Radar. Ph.D. Thesis, Chongqing University, Chongqing, China, 2011. [ Google Scholar ]
Zhang, J.W.; Liu, B.F.; Li, X.; Zhu, Q.B.; Ren, Y.Q. Fine Detection Method of Underground Pipeline Based on GPRMax2D. Geophys. Geochem. Explor. 2019 , 43 , 435–440. [ Google Scholar ]
Chen, H. Image Recognition of Underground Cavity Targets by Ground Penetrating Radar Based on Machine Learning. Master’s Thesis, Institute of Technology, Harbin, China, 2020. [ Google Scholar ]
Liu, Z.; Yeoh, J.K.; Gu, X.; Dong, Q.; Chen, Y.; Wu, W.; Wang, L.; Wang, D. Automatic pixel-level detection of vertical cracks in asphalt pavement based on GPR investigation and improved mask R-CNN. Autom. Constr. 2023 , 146 , 104689. [ Google Scholar ] [ CrossRef ]
Faize, A.; Lahalal, F.; Atounti, M. Study and simulation of soil salinity evolution by Reflexw and GprMax. Ann. Univ. Craiova-Math. Comput. Sci. Ser. 2019 , 46 , 426–432. [ Google Scholar ]
Wang, Z.L. On the expanded Maxwell’s equations for moving charged media system–General theory, mathematical solutions and applications in TENG. Mater. Today 2022 , 52 , 348–363. [ Google Scholar ] [ CrossRef ]
Gao, H.; Wen, M. Dynamic Characteristics of Semi-Saturated Viscoelastic Soil-Tunnel Lining System. Eng. Mech. 2013 , 30 , 90–96. [ Google Scholar ]
Liao, H.; Xu, Z.; Zeng, X.J.; Merigó, J.M. Qualitative decision making with correlation coefficients of hesitant fuzzy linguistic term sets. Knowl.-Based Syst. 2015 , 76 , 127–138. [ Google Scholar ] [ CrossRef ]
Karunasingha, D.S.K. Root mean square error or mean absolute error? Use their ratio as well. Inf. Sci. 2022 , 585 , 609–629. [ Google Scholar ] [ CrossRef ]
Zeng, B.; Liu, S.; Yang, J.; Feng, D.S.; Yuan, Z.M.; Liu, J.; Wang, X. Influence of Surface Undulation on Underground Pipeline GPR Detection. Geophys. Geochem. Explor. 2023 , 47 , 1064–1070. [ Google Scholar ]
Liu, H.; Shi, Z.; Li, J.; Liu, C.; Meng, X.; Du, Y.; Chen, J. Detection of road cavities in urban cities by 3D ground-penetrating radar. Geophysics 2021 , 86 , WA25–WA33. [ Google Scholar ] [ CrossRef ]
Xie, Y.Y.; Liao, H.J.; Zan, Y.W. Two-dimensional forward simulation of railway subgrade disease detection by ground penetrating radar. J. Zhejiang Univ. (Eng. Technol.) 2010 , 44 , 1907–1911. [ Google Scholar ]
Schlaich, A.; Knapp, E.W.; Netz, R.R. Water dielectric effects in planar confinement. Phys. Rev. Lett. 2016 , 117 , 048001. [ Google Scholar ] [ CrossRef ]
Lu, D.; Huo, Y.; Jiang, Z.; Zhong, J. Carbon nanotube polymer nanocomposites coated aggregate enabled highly conductive concrete for structural health monitoring. Carbon 2023 , 206 , 340–350. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

Layer	Material	Thickness	Relative Dielectric Constant	Electrical Conductivity
Air Layer	Air	20 cm	1	0
Surface	Cement Concrete	40 cm	8	0.005
Base	Cement-Stabilized Crushed Stone	30 cm	12	0.05

Parameter Type	Model Signal Parameters
Excitation Source	Ricker Signal
Center Frequency	800 MHz
Tx-Rx Antenna Spacing	10 cm
Time Window	15 ns
Step Size	1 cm

Parameter Type	Model Signal Parameters
Spatial Range	4 m ∗ 0.9 m
Boundary Condition	PML
Spatial Step	0.002 m

Filling Medium	Relative Permittivity	Electrical Conductivity	Simulation Scene
Air	1	0	Complete void condition
Loose fine	2	0.01	Inter-layer non-compaction, degradation of panel bottom support
Mud slurry	40	2	Seepage and mud flushing after joint damage
Water	81	5	Rainy conditions with water pooling between layers

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Yuan, J.; Jiao, H.; Wu, B.; Liu, F.; Li, W.; Du, H.; Li, J. Study of Void Detection Beneath Concrete Pavement Panels through Numerical Simulation. Buildings 2024 , 14 , 1956. https://doi.org/10.3390/buildings14071956

Yuan J, Jiao H, Wu B, Liu F, Li W, Du H, Li J. Study of Void Detection Beneath Concrete Pavement Panels through Numerical Simulation. Buildings . 2024; 14(7):1956. https://doi.org/10.3390/buildings14071956

Yuan, Jie, Huacheng Jiao, Biao Wu, Fei Liu, Wenhao Li, Hao Du, and Jie Li. 2024. "Study of Void Detection Beneath Concrete Pavement Panels through Numerical Simulation" Buildings 14, no. 7: 1956. https://doi.org/10.3390/buildings14071956

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

(PDF) Phishing Detection Using Machine Learning Algorithm
(PDF) Phishing Detection Using Machine Learning Technique
Figure 2 from Phishing Detection Using Machine Learning Techniques
Figure 1 from Detection of Phishing Website Using Machine Learning
Phishing mail detection model uses machine learning
(PDF) Phishing Detection Using Machine Learning Based on URL's

VIDEO

Phishing detection using Machine learning
phishing detection
Efficient Email phishing detection using Machine learning
Phishing Detection Website Using Machine Learning
Phishing website (URL) prediction Using ANN and MLP in Weka
final year project report ||Phishing Detection Website Using Machine Learning |final year project

COMMENTS

PDF PHISHING EMAIL DETECTION BY USING MACHINE LEARNING TECHNIQUES A Thesis
recognition (OCR) is also used by Gmail to protect users from picture spam [31]. Gmail can also link hundreds of parameters to improve spam detection thanks to machine-learning algo. ithms built to aggregate and rank enormous collections of Google search results. Factors like as domain reputation, links i.
Detecting phishing websites using machine learning technique
More than 33,000 phishing and valid URLs in Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers were used to train the proposed system. The phishing detection method focused on the learning process. They extracted 14 different features, which make phishing websites different from legitimate websites.
(PDF) "Phishing Websites Detection Using Heuristics Machine Learning
Machine Learning has attracted a lot of minds because of its potential applications in various. fields. Machine Learning based phishing detection utilizes features of a phishing website. Those ...
A systematic literature review on phishing website detection techniques
Phishing is a fraud attempt in which an attacker acts as a trusted person or entity to obtain sensitive information from an internet user. In this Systematic Literature Survey (SLR), different phishing detection approaches, namely Lists Based, Visual Similarity, Heuristic, Machine Learning, and Deep Learning based techniques, are studied and ...
Detecting Phishing Domains Using Machine Learning
Phishing is an online threat where an attacker impersonates an authentic and trustworthy organization to obtain sensitive information from a victim. One example of such is trolling, which has long been considered a problem. However, recent advances in phishing detection, such as machine learning-based methods, have assisted in combatting these attacks.
Phishing Website Detection Using Machine Learning: A Review
Keywords — Phishing Detection, Machine learning ... This thesis collected a ... be prevented by using anti-phishing mechanisms to detect phishing. Machine learning is a powerful tool used to ...
Dissertation Phishing Detection Using Machine Learning
Machine learning algorithms have shown promising results [13-16]. This technique requires prior real-world data that has been classiﬁed or marked to carry out the training [17]. However, we have faced the following limitations using machine learning-based techniques to detect phishing websites in existing approaches. 1.1.1 Not privacy ...
Phishing Detection Using Machine Learning Techniques
hybrid models and machine learning-based methods is highly recommended. In this paper, we are going to use machine learning-based classiﬁers for detecting phishing websites. Fig. 2. An Overview of phishing detection approaches IV. MACHINE LEARNING APPROACH Machine learning provides simpliﬁed and efﬁcient methods for data analysis.
Phishing Websites Detection Using Machine Learning
This thesis explains how to use machine learning to detect dangerous phishing websites, with an emphasis on attributes retrieved just from the URL. It starts with a description of the available data and the feature engineering process, then moves on to choosing acceptable machine learning approaches. It compares algorithm performance and ...
(PDF) Detection of Phishing Websites using Machine Learning
Department of Electri c Engineering. and Computer Science. University of Toledo. Toledo, OH, US. 6 [email protected]. Abstract. Phishing sends mal icious links or attachme nts through ...
PDF Detecting Phishing Attacks by Machine Learning
This thesis presents methods for detecting phishing attacks using machine learning techniques. The approach presents chosen machine learning models and ... The aim of this thesis is to use machine learning techniques in order to identify ... Jain, A.K et al. [2] presented a phishing detection system based on machine learning using an SVM ...
Thesis Machine Learning-based Phishing Detection Using Url Features: a
[9] 2022 PDGAN: Phishing Detection With Generative Adversarial Networks J-Q1 3.367 7 [10] 2022 Website Phishing Detection Using Machine Learning Classiﬁcation Algorithms CP - - [11] 2021 Towards Lightweight URL-Based Phishing Detection J-Q2 3.638 18 [12] 2021 An Explainable Multi-Modal Hierarchical Attention Model for Developing Phishing Threat
PDF Detection of Phishing Websites Using a Machine Learning Algorithm
As members of the board of examiners, we examined this thesis entitled "Detection of phishing website using a machine learning algorism" by Mekiyas Gelanew. We hereby certify that the thesis is accepted for fulfilling the requirements for the award of the degree of Master of Science in "Information Technology". Board of Examiners
An effective detection approach for phishing websites using URL and
Mostly available methods for detecting phishing attacks are blacklists/whitelists 5, natural language processing 6, visual similarity 7, rules 8, machine learning techniques 9,10, etc. Techniques ...
PDF Phishing URL Detection Using Gradient Boosting: A Machine Learning Approach
By integrating these features into our machine learning model, we aim to increase its accuracy and robustness. In the following section, we describe selection and training methods for advanced phishing detection. 3.4 Machine Learning Models To achieve the most effective phishing detection system, we assessed the performance of various machine
A case study on phishing detection with a machine learning net
The goal of this work was to develop a model capable of identifying phishing emails based on machine learning approaches and the final model consisted of a neural network able to detect more than 80% of phishing emails without compromising the remaining emails sent by E-goi clients. Phishing attacks aims to steal sensitive information and, unfortunately, are becoming a common practice on the web.
Phishing Detection System Through Hybrid Machine Learning Based on URL
Currently, numerous types of cybercrime are organized through the internet. Hence, this study mainly focuses on phishing attacks. Although phishing was first used in 1996, it has become the most severe and dangerous cybercrime on the internet. Phishing utilizes email distortion as its underlying mechanism for tricky correspondences, followed by mock sites, to obtain the required data from ...
PDF Phishing Detection using Machine Learning based URL Analysis: A Survey
In Machine Learning based approach, machine learning models are created to classify a given URL as phishing or not using supervised learning algorithms. Different algorithms are trained on a dataset and then tested to learn the performance of each model. Any variations in the training data directly affects the performance of the model.
Phishing Detection Using Machine Learning Technique
Phishing is a type of website threat and phishing is Illegally on the original website Information such as login id, password and information of credit card. This paper proposed an efficient ...
PDF Phishing Websites Detection using Machine Learning
In detecting phishing URLs, there are two steps. The first step is to extract features from the URLs, and the second step is to classify URLs using the model that has been developed with the help of the training set data. In this work, we used the data set that provided the extracted features.
PDF Detection of Phishing Websites using Machine Learning
We have tested two machine learning algorithms on the 'Phishing Websites Dataset' and reviewed their results. We then selected the best algorithm based on it's performance and built a Chrome extension for detecting phishing web pages. The extension allows easy deployment of our phishing detection model to end users.
PDF Phishing Website Detection Using Novel Machine Learning Fusion Approach
2.1 Detection of Phishing URL using Machine Learning Abstract: Phishing websites have proven to be a major security concern. Several cyberattacks risk the confidentiality, integrity, and availability of company and consumer data, and phishing is the beginning point for many of them. Many researchers have spent decades creating unique
Email Spam Detection by Machine Learning Approaches: A Review
Technological innovation is being misused for unethical and unlawful activities such as phishing and scamming. The detection of online spammers has become a ... S.C., Akhila, K., Pujita, K., Sagar, P.V., Mondal, S., Bulla, S.: Spam based email identification and detection using machine learning techniques. In: 2nd International Conference on ...
Model of detection of phishing URLs based on machine learning
Thus, using machine learning to detect phishing URLs can be an effective method to protect users from phishing-related cyberattacks. Evaluation of different phishing detection methods The problem with detecting ph ishing URLs is that they are designed to look like legitimate URLs, making it difficult for users to distinguish them from genuine ones.
Comparative evaluation of machine learning algorithms for phishing site
DOI: 10.7717/peerj-cs.2131 Corpus ID: 270722076; Comparative evaluation of machine learning algorithms for phishing site detection @article{Almujahid2024ComparativeEO, title={Comparative evaluation of machine learning algorithms for phishing site detection}, author={Noura Fahad Almujahid and Mohd Anul Haq and Mohammed Alshehri}, journal={PeerJ Computer Science}, year={2024}, url={https://api ...
Study of Void Detection Beneath Concrete Pavement Panels through ...
In the structure of composite pavement, the formation of voids beneath concrete panels poses significant risks to structural integrity and operational safety. Ground-Penetrating Radar (GPR) detection serves as an effective method for identifying voids beneath concrete pavement panels. This paper focuses on analyzing the morphological features of GPR echo signals.

Detecting phishing websites using machine learning technique

1. Introduction

2. Research background and related works

2.1 Classification of phishing attack techniques

2.2 Phishing detection approaches

2.2.1 Normal dataset.

2.2.2 Phishing dataset.

2.3 Research questions

3. Research methodology

3.1. Input gate

3.2. Forget gate

3.3. Output gate

4. Results and discussions

5. Conclusion

Supporting information

Acknowledgments

Phishing Websites Detection Using Machine Learning

Suhani Jain

Suhani Jain (Contact Author)

Do you have a job opening that you would like to promote on SSRN?

An effective detection approach for phishing websites using URL and HTML features

Similar content being viewed by others

Detecting hallucinations in large language models using semantic entropy

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data

Related work

List-based detection

Machine learning-based detection

Proposed approach

System architecture

Feature generation

Feature vectorization

Detection module

Features extraction

URL character sequence features (F1)

HTML features

Textual content-based features (F2)

Script, CSS, img, and anchor files (F3, F4, F5, and F6)

Empty hyperlinks (F7 and F8)

Total hyperlinks feature (F9)

Internal and external hyperlinks (F10, F11, and F12)

Error in hyperlinks (F13)

Login form features (F14 and F15)

Classification algorithms

Experiments and result analysis

Performance metrics

Evaluation of features

Comparison with existing approaches

Discussion and limitations

Limitations

Conclusion and future work

Data availability

Acknowledgements

Author information

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Share this article

This article is cited by

Detection of phishing URLs with deep learning based on GAN-CNN-LSTM network and swarm intelligence algorithms

A CNN-Based SIA Screenshot Method to Visually Identify Phishing Websites

Life-long phishing attack detection using continual learning

Quick links

IEEE Account

Purchase Details

Profile Information

Email Spam Detection by Machine Learning Approaches: A Review

Access this chapter

Acknowledgment

Author information

Corresponding author

Editor information

Rights and permissions

Copyright information

About this paper

Download citation

Share this paper