Phishing is a type of social engineering attack in which users are deceived into performing specific actions, often under the guise of a legitimate organization such as Google, their employer, or a financial institution. This paper presents SEMPHISH, a phishing detection tool that leverages semantic hashes and machine learning techniques to identify webpages that visually or structurally mimic well-known legitimate websites. The underlying approach relies on Semantic Hashing techniques, applied to both the source code and screenshots of webpages, to compute similarity scores. The extracted similarity scores are subsequently analyzed using machine learningbased classifiers. To evaluate the performance of SEMPHISH, a custom dataset has been built. Multiple performance metrics were evaluated through extensive experimentation with various machine-learning algorithms. This enabled assessing the impact on detection performance of each similarity score individually, as well as evaluating the hybrid approach leveraging multiple scores. Additionally, it facilitated determining the optimal algorithm and parameter configuration for detecting, preventing, and mitigating phishing threats. The configuration of SEMPHISH which leverages the eXtreme Gradient Boosting classifier performed the best by scoring an accuracy of 95.15% and an F1-score of 94.99% on the analyzed dataset.

SEMPHISH: A Phishing Detection Tool Based on Semantic Hashes

Romeo, Francesco;Blefari, Francesco;Pironti, Francesco Aurelio;Lupinacci, Matteo;Furfaro, Angelo;
2025-01-01

Abstract

Phishing is a type of social engineering attack in which users are deceived into performing specific actions, often under the guise of a legitimate organization such as Google, their employer, or a financial institution. This paper presents SEMPHISH, a phishing detection tool that leverages semantic hashes and machine learning techniques to identify webpages that visually or structurally mimic well-known legitimate websites. The underlying approach relies on Semantic Hashing techniques, applied to both the source code and screenshots of webpages, to compute similarity scores. The extracted similarity scores are subsequently analyzed using machine learningbased classifiers. To evaluate the performance of SEMPHISH, a custom dataset has been built. Multiple performance metrics were evaluated through extensive experimentation with various machine-learning algorithms. This enabled assessing the impact on detection performance of each similarity score individually, as well as evaluating the hybrid approach leveraging multiple scores. Additionally, it facilitated determining the optimal algorithm and parameter configuration for detecting, preventing, and mitigating phishing threats. The configuration of SEMPHISH which leverages the eXtreme Gradient Boosting classifier performed the best by scoring an accuracy of 95.15% and an F1-score of 94.99% on the analyzed dataset.
2025
classification
machine learning
phishing detection
semantic hashing
web security
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/392561
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact