User-Centric Phishing URL Detection Tool Powered by Interpretable Machine Learning Model.

Paper

LinkScope is a senior project focused on delivering an effective phishing detection solution through two main components: Model Development and Web Application. As a data scientist, I was responsible for developing the machine learning model and handling backend development. The Model Development component involves training a machine learning model to distinguish between legitimate and phishing URLs, while the Web Application provides a user-friendly interface for accessing these detection functionalities.
I successfully passed ISCIT2024, and my paper, "User-Centric Phishing URL Detection Tool Powered by Interpretable Machine Learning Model," will be published in IEEE Xplore.

Model Development

1. Data Preparation: The process begins with data preparation, which involves data aggregation, balancing and preprocessing. Then, feature extraction is performed.
2. Feature Engineering and Experimentation: Extracted features undergo experiments for feature selection and model training. The best model among all candidates is then selected for deployment in the web application. The chosen model uses the LightGBM algorithm and Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection.

Model Result

The results are obtained from applying the machine learning algorithms with different feature selection approaches. In order to select the best model for the web application, LightGBM is chosen for its high accuracy, leading among proposed algorithms : Random Forest, LightGBM, SVC.

Accuracy of each model by Feature Selection Approaches.

However, in the context of a web application, processing time is also a critical factor. When users submit URLs for analysis, the feature extraction process takes time, adding to the overall latency. This consideration makes it important to balance accuracy with processing time.

LightGBM with all features and with RFECV provides approximately the same highest accuracies at 94.61%. However, having all features extracted is a drawback for the web application, where user experience depends on quick responses. Given this context, LightGBM with RFECV is chosen, reducing the features to 26.

Algorithm Accuracy Precision Recall F1-score
LightGBM 94.61% 95.68% 93.44% 94.61%
LightGBM (Hyperparameterized) 95.07% 96.00% 94.05% 95.07%

According to above table, the classification results of LightGBM with RFECV are presented. The hyperparameter tuning process increased the accuracy to 95.07%. The features were selected using Recursive Feature Elimination with Cross Validation (RFECV), resulting in a reduced set of 26 features. These features include 'domainlength', 'www', 'subdomain', 'https', 'short_url', '@', '-', '=', '.', '_', '/', 'digit', 'log', 'pay', 'web', 'account', 'pcemptylinks', 'pcextlinks', 'pcrequrl', 'zerolink', 'extfavicon', 'submit2email', 'sfh', 'redirection', 'domainage', and 'domainend'. This set of features, along with the hyperparameterized LightGBM model, is chosen for deployment in the application.