Malicious URL detection using ML and AI

Manukhandelwal
7 min readJan 27, 2021
Open source tool flags malicious domains with Support Vector Machine algorithm

Problems faced by internet users

The openness of the World Wide Web (Web) has become more exposed to cyber-attacks. An attacker performs the cyber-attacks on Web using malware Uniform Resource Locators (URLs) since it widely used by internet users.

A malicious URL is a clickable link embedded within the content of an email. It is created with the sole purpose of compromising the recipient of that email. By clicking on an infected malicious URL, you may download malware or a Trojan that can open back doors to your devices, or you can be persuaded to provide sensitive information on a fake website such as Office 365 credentials .The most common email scams with malicious URLs involve the delivery of targeted spam and phishing

Malicious Web sites are a cornerstone of Internet criminal activities. They host a variety of unwanted content ranging from spam-advertised products, to phishing sites, to dangerous “drive-by’’ exploits that infect a visitor’s machine with malware. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. The most prominent existing approaches to the malicious URL problem are manually- constructed blacklists, as well as client-side systems that analyze the content or behavior of a Web site as it is visited. The premise of this dissertation is that we should be able to construct a lightweight URL classification system that simultaneously overcomes the challenges that face blacklists (which have manual updates that can quickly become obsolete) and client-side systems (which are difficult to deploy on a large scale because of their high overhead)

Existing System to eliminate Malicious URLs

In the Web browsing context, blacklists are precompiled lists (or databases)that contain IP addresses, domain names or URLs of malicious sites that users should avoid. (By contrast, whitelists contain sites that are known to be safe.)To cross reference a site against a blacklist, users submit a given IP address, domain name or URL to the blacklist’s query service. In return, they receive response for whether the site is in the blacklist. The most popular implementations of blacklist query services are domain name system-based (DNS-based) blacklists, browser toolbars and network appliances.

DNS-Based Blacklists

DNS blacklists are a query service implemented on top of DNS . Users submit a query (representing the IP address or the domain name in question) to the blacklist provider’s special DNS server, and the response is an IP address that represents whether the query was present in the blacklist.

Browser Toolbars

Browser toolbars provide a client-side defense for users. Before a user visits a site, the toolbar intercepts the URL from the address bar and cross references a URL blacklist, which is often stored locally on the user’s machine or on a server that the browser can query. If the URL is present on the blacklist, then the browser redirects the user to a special warning screen that provides information about the threat. The user can then decide whether to continue to the site

Network Appliances

Dedicated network hardware is another popular option for deploying black-lists. These appliances serve as proxies between user machines within an enterprise network and the rest of the Internet. As users within an organization visit sites, the appliance intercepts outgoing connections and cross references URLs or IP addresses against a precompiled blacklist. The operating principle is similar to browser toolbars, but the positioning of the appliance at the network gateway saves the overhead of installing special software on all machines within the organization

A way to eliminate the problems caused by attackers

The main components of the malicious URLs detection system

In this section, we provide a detailed discussion of our approach to classifying site reputation. We categorize the features that we gather for URLs as being either lexical or host-based.

Lexical features: The justification for using lexical features is that URLs malicious sites tend to “look different” in the eyes of the users who see them. Hence, including lexical features allows us to methodically capture this property for classification purposes, and perhaps infer patterns in malicious URLs that we would otherwise miss through ad-hoc inspection. For the purpose of this discussion, we want to distinguish the two parts of a URL: the hostname and the path. As an example, with the URLwww.geocities.com/url/index.html, the hostname portion is www.geocities.comand the path portion is usr/index.html. Lexical features are the textual properties of the URL itself(not the content of the page it references). We use a combination of features suggested by the studies of McGrath and Gupta [MG08] and Kolari et al. [KFJ06]. These properties include the length of the hostname, the length of the entire URL, as well as the number of dots in the URL — all of these are real-valued features.

Host-based features: Host-based features describe “where” malicious sites are hosted, “who” they are managed by, and “how” they are administered. The reason for using these features is that malicious Web sites may be hosted in less reputable hosting centers, on machines that are not conventional Web hosts, or through disreputable registrars. The following are properties of the hosts (there could be multiple) that are identified by the hostname part of the URL. We note some of these features overlap with lexical properties of the URL.

1. IP address properties — Are the IPs of the A, MX or NS records in the same autonomous systems (ASes) or prefixes as one another? To what ASes or prefixes do they belong? If the hosting infrastructure surrounding malicious URLs tends to reside in a specific IP prefix or AS belonging to an Internet service provider (ISP), then we want to account for that disreputable ISP during classification.

2. WHOIS properties — What is the date of registration, update, and expiration? Who is the registrar? Who is the registrant? Is the WHOIS entry locked? If a set of malicious domains are registered by the same individual, we would like to treat such ownership as a malicious feature. Moreover, if malicious sites are taken down frequently, we would expect the registration dates to be newer than for legitimate sites.

3. Domain name properties — What is the time-to-live (TTL) value for the DNS records associated with the hostname? Additionally, the following domain name properties are used in the Spam Assassin Botnet plugin for detecting links to malicious sites in emails: Does the hostname contain “client” or “server” keywords? Is the IP address in the hostname? Is there a PTR record for the host? Does the PTR record in-turn resolve one of the host’s

4. Blacklist membership — Is the IP address in a blacklist? For the evaluations in Section 4.5, 55% of malicious URLs were present in blacklists. Thus, although this feature is useful, it is still not comprehensive.

5. Geographic properties — In which continent/country/city does the IP address belong? As with IP address properties, hotbeds of malicious activity could be concentrated in specific geographic regions.

Connection speed — What is the speed of the uplink connection (e.g., broad-band, dial-up)? If some malicious sites tend to reside on compromised residential machines (connected via cable or DSL), then we want to record the host connection speed

Malicious URL detection plays a critical role for many cybersecurity applications, and clearly machine learning approaches are a promising direction.We are planning to conduct a comprehensive and systematic survey on Malicious URL Detection using machine learning techniques.In particular, we are planning to offer a systematic formulation of Malicious URL detection from a machine learning perspective.

References

[1] S. G. Selvaganapathy, M. Nivaashini, and H. P. Natarajan, “Deep belief network based detection and categorization of malicious URLs,” Inf. Secur. J., vol. 27, no. 3, pp. 145–161, 2018.

[2] A. Firdaus, N. B. Anuar, M. F. A. Razak, and A. K. Sangaiah, “Bio-inspired computational paradigm for feature investigation and malware detection: interactive analytics,” Multimed. Tools Appl., 2017.

[3] D.R. Patil and J. B. Patil, “Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification,” Int. J. Inf. Secur., vol. 10, no. 2, pp. 141–162, 2018. Indonesian J Elec Eng& Comp Sci, Vol. 17, №3, March 2020 : 1210–12141214

[4] A. Kulkarni and L. L., “Phishing Websites Detection using Machine Learning,” Int. J. Adv. Comput. Sci. Appl., vol.10, no. 7, 2019.

[5] N. Jayakanthan and A. V. Ramani, “Classification Model to Detect Malicious URL via Behaviour Analysis,” Int. J.Comput. Appl. Technol. Res., vol. 6, no. 3, pp. 133–140, 2017.

[6] B. Li, G. Yuan, L. Shen, R. Zhang, and Y. Yao, “Incorporating URL embedding into ensemble clustering to detect web anomalies,” Futur. Gener. Comput. Syst., vol. 96, pp. 176–184, 2019.

[7] Muhammad Taseer Suleman and Shahid Mahmood Awan, “Optimization of URL-Based Phishing Websites Detection through Genetic Algorithms,” Autom. Control Comput. Sci., vol. 53, no. 4, pp. 333–341, 2019.

[8] N. M. M. Noor, S. Mohamad, Y. M. Saman, and M. S. Hitam, “Probabilistic knowledge base system for forensic evidence analysis,” J. Theor. Appl. Inf. Technol., vol. 59, no. 3, pp. 708–717, 2014.

[9] A. Firdaus, N. B. Anuar, M. F. A. Razak, I. A. T. Hashem, S. Bachok, and A. K. Sangaiah, “Root Exploit Detection and Features Optimization: Mobile Device and Blockchain Based Medical Data Management,” J. Med. Syst., vol.42, no. 6, 2018

[10] R. Zur, “12 Alarming Cyber Security Facts and Stats,” Cybrint, 2018. [Online]. Available: https://www.cybintsolutions.com/cyber-security-facts-stats/. [Accessed: 03-Dec-2018].

Written by:

Aman Gupta

Dev jindani

Manu khandelwal

--

--