Malicious and Phishing URL Detection Using Machine Learning

IEEE BASE PAPER TITLE:

BERT-PhishFinder: A Robust Model for Accurate Phishing URL Detection With Optimized DistilBERT

IEEE BASE PAPER ABSTRACT:

Phishing URL detection has become a critical challenge in cybersecurity, with existing methods often struggling to maintain high accuracy while generalizing across diverse datasets. In this article, we introduce BERT-PhishFinder, a novel and efficient transformer-based model designed to tackle this problem. While most traditional approaches rely heavily on lexical features or complex convolutional architectures, BERT-PhishFinder leverages the power of DistilBERT, a lightweight yet highly effective transformer, to capture rich contextual representations of URL sequences. To enhance the model’s robustness and reduce overfitting, we strategically incorporate SpatialDropout1D in the embedding layers, along with global average pooling and global max pooling techniques to extract both comprehensive and key discriminative features. The pooled representations are thoughtfully concatenated to form a comprehensive feature representation. Through this carefully crafted design, our model adopts ensemble learning, as it undergoes multiple parallel dense layers, each with distinct parameters and dropout regularization. This facilitates learning diverse patterns and features from the input URL sequence, culminating in exceptional phishing URL detection performance. Extensive evaluations against conventional deep learning algorithms, transformer models (XLNet, RoBERTa, ALBERT), and other existing methods on five benchmark datasets show that BERT-PhishFinder not only achieves the state of-the-art real phishing URL detection but also accomplishes this with reduced label dependency.

PROJECT OUTPUT VIDEO:

ALGORITHM / MODEL USED:

Gradient Boosting Classifier, XGBoost Classifier, Multi-layer Perceptron Classifier, Logistic Regression, K-Nearest Neighbors, Support Vector Machine Classifier, Naive Bayes Classifier, Decision Trees Classifier and Random Forest Classifier.

OUR PROPOSED PROJECT ABSTRACT:

Malicious and phishing URLs have become one of the most common attack vectors used by cybercriminals to steal sensitive information, distribute malware, and compromise user trust on the internet. With the rapid growth of online services such as digital banking, e-commerce, and cloud-based platforms, users are increasingly exposed to deceptive websites that closely mimic legitimate ones. This creates a strong need for an intelligent and automated system that can accurately identify whether a given URL is safe or unsafe before users interact with it, thereby reducing the risk of cyber threats and financial loss.

To address this need, this project presents a Malicious and Phishing URL Detection Using Machine Learning. The system is developed using Python for backend processing, with HTML, CSS, and JavaScript for the user interface, and Flask as the web framework to integrate the frontend and machine learning models seamlessly.

The core contribution of this project is using multiple supervised machine learning algorithms are implemented and evaluated independently, including Logistic Regression, K-Nearest Neighbors, Support Vector Machine Classifier, Naive Bayes Classifier, Decision Trees Classifier, Random Forest Classifier, Gradient Boosting Classifier, XGBoost Classifier, and Multi-layer Perceptron Classifier. These models are trained on URL-based features to learn patterns that distinguish safe URLs from malicious or phishing ones.

The performance of each model is analyzed in detail using both training and test datasets. Logistic Regression achieved a training accuracy of 92.7% and test accuracy of 93.4%, while K-Nearest Neighbors achieved 96.5% training accuracy and 93.8% test accuracy. The Support Vector Machine Classifier demonstrated strong generalization with 96.9% training accuracy and 96.4% test accuracy. The Naive Bayes Classifier showed comparatively lower performance with 60.5% accuracy on both training and test data. Decision Trees and Random Forest classifiers achieved high training accuracies of 99.1%, with test accuracies of 96.0% and 96.7% respectively. Gradient Boosting and XGBoost classifiers delivered robust results with test accuracies of 97.4% and 97.1%, while the Multi-layer Perceptron Classifier achieved a training accuracy of 98.8% and test accuracy of 96.9%.

The developed system provides an interactive user interface where users can select a preferred machine learning model (top 3) and enter a URL for analysis. Based on the prediction, the system classifies the URL as Safe or Unsafe. If the URL is predicted as safe, users are allowed to continue directly to the website. If the URL is identified as unsafe, the system displays a warning message and provides a “Proceed with Caution” option for informed user decision-making. In addition to prediction results, the system presents comprehensive performance analysis metrics such as accuracy, precision, recall, F1-score, and confusion matrix for each model.

To enhance interpretability and comparison, the project includes static visualization graphs such as accuracy comparison charts, F1-score comparison graphs, combined metric comparison graphs, phishing count distribution charts, and feature importance visualizations for Gradient Boosting and XGBoost classifiers. Feature importance using permutation methods for the full model is also provided. Overall, this project demonstrates an effective and user-centric machine learning solution for real-time malicious and phishing URL detection, combining high predictive accuracy with detailed performance analysis and intuitive visual insights.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS:

System : Pentium i3 Processor.
Hard Disk : 20 GB.
Monitor : 15’’ LED.
Input Devices : Keyboard, Mouse.
Ram : 8 GB.

SOFTWARE REQUIREMENTS:

Operating System : Windows 10 / 11.
Coding Language : Python 3.12.0.
Web Framework : Flask.
Frontend : HTML, CSS, JavaScript.

REFERENCE:

Ali Aljofey, Saifullahi Aminu Bello, Jian Lu, and Chen Xu, “BERT-PhishFinder: A Robust Model for Accurate Phishing URL Detection With Optimized DistilBERT”, IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 22, NO. 4, JULY/AUGUST 2025.

👉CLICK HERE TO BUY THIS PROJECT “Malicious and Phishing URL Detection Using Machine Learning” SOURCE CODE👈

Frequently Asked Questions (FAQ’s) and Answers

1. What is the objective of this project?

The objective of this project is to develop a machine learning-based system that can automatically detect whether a given URL is safe or malicious/phishing. The system helps users avoid fraudulent websites and reduces the risk of cyber threats by providing real-time URL safety predictions.

2. What types of URLs can the system detect?

The system is designed to detect phishing URLs and other malicious links that attempt to impersonate legitimate websites or distribute harmful content. It also correctly identifies genuine and safe URLs based on learned patterns from training data.

3. Which technologies are used in this project?

The project is developed using Python for backend and machine learning implementation, Flask as the web framework, and HTML, CSS, and JavaScript for the frontend. Machine learning models are implemented using popular libraries such as Scikit-learn and XGBoost.

4. Which machine learning algorithms are implemented?

The system implements 9 Machine Learning Models, including Logistic Regression, K-Nearest Neighbors, Support Vector Machine Classifier, Naive Bayes Classifier, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, XGBoost Classifier, and Multi-layer Perceptron Classifier.

5. How does the system classify a URL as safe or unsafe?

The system extracts relevant features from the input URL and passes them to the selected machine learning model. Based on learned patterns from training data, the model predicts whether the URL is Safe or Unsafe.

6. Can users choose different models for prediction?

Yes. The system allows users to select a preferred machine learning model before entering a URL. This enables users to compare predictions and understand how different models perform.

7. What happens after a URL is classified as Safe or Unsafe?

If the URL is classified as Safe, the system provides an option to continue to the website. If the URL is classified as Unsafe, a warning message is displayed along with a “Proceed with Caution” option for informed decision-making.

8. What performance metrics are displayed in the system?

The system displays key performance evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrix for each machine learning model.

9. Are visualization graphs included in the project?

Yes. The system includes static visualization graphs such as accuracy comparison graphs, F1-score comparison graphs, combined metric comparison graphs, phishing count charts, and feature importance visualizations for selected models.

10. Is the system suitable for real-time usage?

Yes. The system provides quick predictions for entered URLs and is suitable for real-time or near real-time usage in web-based environments.

11. Is any personal user data collected by the system?

No. The system only analyzes the structure and features of the entered URL. It does not collect, store, or process personal user information.

12. What is the significance of this project?

This project demonstrates how machine learning can be effectively applied to cybersecurity problems. It provides a practical solution for phishing detection while offering educational value through model comparison, performance analysis, and visual insights.

Python IEEE Projects

Malicious and Phishing URL Detection Using Machine Learning