Detecting Malicious Websites
Detecting Malicious Websites
ABSTRACT:
To expose more users to threats of drive-by downloadattacks, attackers compromise vulnerable websites discoveredby search engines and redirect clients to malicious websitescreated with exploit kits. Security researchers and vendors havetried to prevent the attacks by detecting malicious data, i.e.,malicious URLs, web content, and redirections. However, attackersconceal a part of malicious data with evasion techniques tocircumvent detection systems. In this paper, we propose a systemfor detecting malicious websites without collecting all maliciousdata. Even if we cannot observe a part of malicious data, we canalways observe compromised websites. Since vulnerable websitesare discovered by search engines, compromised websites havesimilar traits. Therefore, we built a classifier by leveraging notonly malicious websites but also compromised websites. Moreprecisely, we convert all websites observed at the time of accessinto a redirection graph and classify it by integrating similaritiesbetween its subgraphs and redirection subgraphs shared acrossmalicious, benign, and compromised websites. As a result ofevaluating our system with crawling data of 455,860 websites,we found that the system achieved a 91.7% true positive rate formalicious websites containing exploit URLs at a low false positiverate of 0.1%. Moreover, it detected 143 more evasive maliciouswebsites than conventional systems.
PROJECT OUTPUT VIDEO:
EXISTING SYSTEM:
- From different perspectives, many systems have been proposed for detecting malicious websites launching drive-by download attacks. Here, we describe conventional systems focused on large-scale user traffic, system behavior, and web content and redirections.
- One approach for detecting malicious websites is aggregating large-scale user traffic. Attackers redirect clients to the same redirection URL from landing URLs and then redirect them to the exploit URLs targeting their environment. Geographical diversity and uniform client environments can be used as traits of malicious websites.
DISADVANTAGES OF EXISTING SYSTEM:
- At the exploit URL, vulnerabilities in browsers and/or their plugins are exploited, and the client is finally infected with malware
- These systems require logs provided by anti-virus vendors or large ISPs, and it is generally difficult to obtain these logs.
PROPOSED SYSTEM:
- In this paper, we propose a system for detecting malicious websites without collecting all malicious data. Even if we cannot observe a part of malicious data, e.g., exploit code and malware, we can always observe compromised websites. Since vulnerable websites are automatically discovered with search engines by attackers, compromised websites have similar traits. Therefore, we built a classifier by leveraging not only malicious websites but compromised websites.
- More precisely, we convert all websites observed at the time of access into a redirection graph, whose vertices are URLs and edges are redirections between two URLs, and classify it with a graph mining approach. To perform classification, we integrate similarities between its subgraphs and redirection subgraphs shared across malicious, benign, and compromised websites.
ADVANTAGES OF PROPOSED SYSTEM:
- It detected more malicious websites employing evasion techniques than conventional systems. These detected evasive websites were, for example, built by compromising a vulnerable CMS.
- These results show that our system successfully captures redirection subgraphs of compromised websites, not only those of malicious websites.
- We propose a system that detects malicious websites by leveraging all websites observed at the time of access even if all malicious data cannot be collected due to evasion techniques.
- We show that leveraging the redirection subgraphs of benign, compromised, and malicious websites enhances the classification performance; the benign subgraphs contribute to reducing false positives such as subgraphs of web advertisements and the compromised and malicious subgraphs contribute to improving true positives such as subgraphs of compromised CMSs and exploit kits.
MODULES:
- Subgraph extraction
- Template construction
- Feature extraction
- Classification
MODULE DESCRIPTIONS:
Subgraph extraction:
We collect web content at each URL and the methods used for redirections by analyzing websites with a honey client. The websites are represented as redirection graphs, i.e., directed graphs whose vertices are URLs and edges are redirections. The most important structure of redirection graphs for detecting malicious websites is the path from landing to exploit URLs. To reduce the computational cost, we extract only subgraphs that have an important structure, i.e., path-shaped subgraphs. Excluding subgraphs that have a branch structure reduces the computational cost. We also use the information of vertices and edges for subgraph extraction. From path pi, j, we extract subgraph sgi, j, which is a tuple of a vector containing the information of vertices and a vector containing the information of redirections. A redirection graph is decomposed into a set of subgraphs extracted from all paths.
Template construction:
We split redirection graphs into clusters composed of similar redirection graphs and construct templates from the clusters. The similarity utilized for clustering is defined as a similar manner of the Dice index. The dice index, D, between set A and set B is defined as D = 2 × |A ∩B|/(|A| + |B|). If the intersection of two subgraph sets is defined on the basis of both the redirection information and the maliciousness, the amount of difference in the maliciousness does not properly affect the similarity because the maliciousness is a continuous value. For this reason, we define the intersection of two subgraph sets on the basis of only the redirection information and use the maliciousness as weighting coefficients.
Feature extraction:
A high similarity between the subgraphs of a redirection graph and a template indicates that the redirection graph contains the template as its subgraph. In other words, we can encode the redirection graphs of websites into numerical values by calculating similarities between subgraphs of redirection graphs and templates. We extract a feature vector x containing similarities between a subgraph set SG and templates Ti: x = [S (SG, T1), …, S (SG, TN)], where N is the number of templates.
Classification:
These feature vectors are classified with supervised machine learning. We use random forest, which can classify a large amount of data accurately and quickly. We use a random Forest package in R. Note that the algorithm of classification is not limited to random forest, but other algorithms of supervised machine learning can be applied.
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
- System : Pentium Dual Core.
- Hard Disk : 120 GB.
- Monitor : 15’’ LED
- Input Devices : Keyboard, Mouse
- Ram : 1 GB
SOFTWARE REQUIREMENTS:
- Operating system : Windows 7.
- Coding Language : NET,C#.NET
- Tool : Visual Studio 2008
- Database : SQL SERVER 2005