2.1. Introduction
This chapter primarily reviews the available literature in the field under study. Accordingly, it will account for the definitions of concepts and issues that affect website phishing detection using different techniques and approaches. The first part of this chapter will describe phishing and its various classifications. The second part of this chapter will deal with existing techniques and approaches that are related to detecting phishing websites. The third part discusses three types of classifier designs and their impact on website phishing detection. The fourth and the final part of this chapter reviews earlier works related to phishing detection in websites.
2.2. Phishing
The definition of phishing in this context is essentially not so fixed but can be seen like an indisputable fact that changes with respect to the way in which phishing is carried out. More particularly, the use of email and website are the two methods of phishing. Although there are some differences between this two methods but they both share their goals in common.
In addition, phishing can be said to be an online attack used by perpetrators in committing fraud through social engineering schemes via instant messages, emails, or online advertisement to lure users to phishing websites similar to a legitimate website for gaining confidential information about the victim such as password, financial account, personal identification, and financial account numbers, which can then be used for illegal profit (
Liu et al., 2010). As explained by
Abbasi and Chen (2009b), phishing websites can be divided into two common types, namely; spoof and concocted websites. Spoof sites are sham replica of existing commercial websites (
Dhamija et al., 2006,
Dinev, 2006). Commonly spoofed websites include eBay, PayPal, various banking and escrow service providers (
Abbasi and Chen, 2009a), and e-tailers. Spoof websites attempt to steal unsuspecting users’ identities; account logins, personal information, credit card numbers, and so forth. (
Dinev, 2006). Online phish repositories such as PhishTank maintain URLs for millions of verified spoof websites used in phishing attacks intended to mimic thousands of legitimate entities. Fictitious websites mislead users by attempting to give the impression of unique, legitimate commercial entities such as investment banks, escrow services, shipping companies, and online pharmacies (
Abbasi and Chen, 2009b; Abbasi et al., 2012;
Abbasi et al., 2010). The aim of fictitious websites is failure-to-ship scam; swindling customers’ of their money without keeping to their own end of the bargain (
Chua and Wareham, 2004). Both spoof and concocted websites are also commonly used to propagate malware and viruses (
Willis, 2009).
In a personal fraud survey carried out by
Jamieson et al. (2012) indicate the percentage of phishing in identity crime reclassification using publicly available data by Australia Bureau of Statistic (ABS) as a case study. The outcome showed that phishing constitutes a fraction of 0.4% which corresponded to 57,800 victims.
Figures 2.1 and
2.2 represent the survey information.
2.3. Existing anti-phishing approaches
In a study review published by Anti-Phishing Working Group (APWG), there were at least 67, 677 phishing attacks in the last 6 months of 2010 (
A.P.W.G, 2010). A lot of research has been done on anti-phishing in designing various anti-phishing approaches.
Afroz and Greenstadt (2009), categorized the current phishing detection into three main types: (1) non-content-based approaches that do not make use of site content to classify it as authentic or phishing, (2) content-based approaches that make use of site contents to catch phishing, and (3) visual similarity-based approaches that uses visual similarity with known sites to recognize phishing. These approaches are discussed in subsequent sections.
2.3.1. Non-Content-Based Approaches
In a study carried out by
Afroz and Greenstadt (2009), it was claimed that non-content-based approaches include URL and host information based classification of phishing sites, blacklisting, and whitelisting methods. In URL-based schemes, URLs are classified on the basis of both lexical and host features. Lexical features describe lexical patterns of malicious URLs. These include features such as length of the URL, the number of dots, special characters it contains. Host features of the URL include properties of IP address, the owner of the site, DNS properties such as TTL, and geographical location (
Ma et al., 2009). Using these features, a matrix is built and run through multiple classification algorithms. In real-time processing trials, this approach has success rates between 95% and 99%. According to
Afroz and Greenstadt (2009), they used lexical features of URL along with site contents and image analysis to improve performance and reduce false positive cases.
In blacklisting approaches, reports made by users or companies are used to detect phishing websites which are stored in a database. Perhaps the use of this approach by commercial toolbars such as Netcraft, Internet explorer 7, CallingID Toolbar, EarthLink Toolbar, Cloudmark Anti-Fraud Toolbar, GeoTrust TrustWatch Toolbar, Netscape Browser 8.1 has made it very popular amongst other anti-phishing approaches (
Afroz and Greenstadt, 2009). Nonetheless, as most phishing sites are temporary and often times exist for less than 20 hours (
Moore and Clayton, 2007), or change URLs frequently (fast-flux), the URL blacklisting approach fails to identify majority of phishing incidents. Furthermore, a blacklisting approach will fail to detect an attack that is aim at a specific user (spear-phishing), especially those that aim profitable but not extensively used sites such as small brokerages, company intranets, and so forth (
Afroz and Greenstadt, 2009).
Whitelisting approaches seek to identify known good sites (
Chua and Wareham, 2004,
Close, 2009;
Herzberg and Jbara, 2008), but a user must remember to inspect the interface whenever he visits any site. Some whitelisting approaches use server-side validation to add additional authentication metrics (beyond SSL) to client browsers as a proof of its benign nature, For example, dynamic security skins (
Kumaraguru et al., 2007), TrustBar (
Herzberg and Gbara, 2004), SRD (Synchronized Random Dynamic Boundaries) (
Ye et al., 2005).
2.3.2. Content-Based Approaches
According to content-based approach, phishing attacks are detected by investigating site contents. Features used in this approach comprise of password fields, spelling errors, source of the images, links, embedded links, and so forth alongside URL and host-based features. SpoofGuard (
Chou et al., 2004) and CANTINA (
Zhang et al., 2007) are two examples of content-based approach. In addition, Google’s anti-phishing filter detects phishing and malware by examining page URL, page rank, WHOIS information and contents of a page including HTML, javascript, images, iframe, and so forth (
Whittaker et al., 2010). The classifier is frequently retrained with new phishing sites to learn new trends in phishing. This classifier has high accuracy but is presently implemented offline as it takes 76 seconds on average to detect phishing. Some researchers studied fingerprinting and fuzzy logic-based approaches that use a series of hashes of websites to identify phishing sites (
Aburrous et al., 2008;
Zdziarski et al., 2006). Furthermore, experimentation of
Afroz and...