URL CLASSIFIER FOR PHISHING LINKSBack Case Studies
Client: Cyber security company
Domain: Cyber Security
Length: 2 months
Goal: Analyze URLs in order to give estimate whether the link is safe to click or not
Tech: Python, Tensorflow, Keras
The problem was to classify link into one of the 2 classes: phish and not-phish. Phish class represent malicious link and not-phish represent normal benign link. The end goal was to make web application that would include various link checks, as well as machine learning check.
We are presented with the labeled data set that initially had 100k of normal urls, and 60k phish urls. We tried two different approaches: classical machine learning method with feature extraction and deep learning solution.
The feature based approach requires preprocessing of the urls such as: cleaning, encoding, etc. After preprocessing step we extracted features from the urls, and then we trained machine learning model to classify new urls as phish and not phish.
We extracted 30 features just by looking at urls. Some of the features were: length of the url, number of dots, length of the hostname, count of numbers, etc.
We trained several machine learning models, and the best one was XGBDT. The feature based method didn’t achieve satisfying results, but it served as a baseline for future improvements.
Deep learning approach
The feature based approach had its limitations such as: it wasn’t aware of the context. Clearly, we could add more features that would improve results, but we wanted to try a deep learning solution that can learn text features and context by itself.
As starting point we trained char2vec model on the data set we had. On the top of that we trained recurrent neural network that takes character by character and outputs probability whether entered url is phish or not. Neural network was able to detect anomalies is the links such as: g00gle, gegle, etc. This approach was clearly better than feature based approach and it achieved better results on the static test set.
The final solution was a web application that had REST endpoint which tells which input link is suspicious by giving confidence. The REST endpoint was rule base engine that takes list of the links and runs them through pipeline of various checks. One of the checks was our neural network. We also implemented retraining mechanism for the neural network. The retraining can be triggered when new data set of the urls is collected.
The final results showed that neural network approach worked better than feature based approach. Even though neural network solution was better and achieved accuracy 97% on test set,, there was problem of overfitting. We didn’t have various types of the urls, such as ones that are languages other than English, and neural network made mistakes. Also, in the real world scenario there are much more normal links than phish links so our neural network was biased to phish links since we have pretty much balanced data set. Those problems are reduced by collecting various urls and retraining mechanism.