Phishing is a type of social engineering attack in which users are deceived into performing specific actions, often under the guise of a legitimate organization such as Google, their employer, or a financial institution. This paper presents SEMPHISH, a phishing detection tool that leverages semantic hashes and machine learning techniques to identify webpages that visually or structurally mimic well-known legitimate websites. The underlying approach relies on Semantic Hashing techniques, applied to both the source code and screenshots of webpages, to compute similarity scores. The extracted similarity scores are subsequently analyzed using machine learningbased classifiers. To evaluate the performance of SEMPHISH, a custom dataset has been built. Multiple performance metrics were evaluated through extensive experimentation with various machine-learning algorithms. This enabled assessing the impact on detection performance of each similarity score individually, as well as evaluating the hybrid approach leveraging multiple scores. Additionally, it facilitated determining the optimal algorithm and parameter configuration for detecting, preventing, and mitigating phishing threats. The configuration of SEMPHISH which leverages the eXtreme Gradient Boosting classifier performed the best by scoring an accuracy of 95.15% and an F1-score of 94.99% on the analyzed dataset.
SEMPHISH: A Phishing Detection Tool Based on Semantic Hashes
Romeo, Francesco;Blefari, Francesco;Pironti, Francesco Aurelio;Lupinacci, Matteo;Furfaro, Angelo;
2025-01-01
Abstract
Phishing is a type of social engineering attack in which users are deceived into performing specific actions, often under the guise of a legitimate organization such as Google, their employer, or a financial institution. This paper presents SEMPHISH, a phishing detection tool that leverages semantic hashes and machine learning techniques to identify webpages that visually or structurally mimic well-known legitimate websites. The underlying approach relies on Semantic Hashing techniques, applied to both the source code and screenshots of webpages, to compute similarity scores. The extracted similarity scores are subsequently analyzed using machine learningbased classifiers. To evaluate the performance of SEMPHISH, a custom dataset has been built. Multiple performance metrics were evaluated through extensive experimentation with various machine-learning algorithms. This enabled assessing the impact on detection performance of each similarity score individually, as well as evaluating the hybrid approach leveraging multiple scores. Additionally, it facilitated determining the optimal algorithm and parameter configuration for detecting, preventing, and mitigating phishing threats. The configuration of SEMPHISH which leverages the eXtreme Gradient Boosting classifier performed the best by scoring an accuracy of 95.15% and an F1-score of 94.99% on the analyzed dataset.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


