What Phishing Sites Look Like ? (Study)

You liked this post, share it !
phish-art

In this post we are going to take a closer look on what are the current phishing tactics employed in the wild. The trends uncovered by analyzing our new data-set of 5000 recents phishing sites will change the way you think about phishing.

One of my current research project, with Jing and a bunch of people of the university of Michigan, is to develop an in-browser defense against phishing, that will be able to detect phishing sites as quickly as they are created.  Instead of relying on a black list, it will use vision and machine learning algorithms.

Before to set out on a journey to find the best way to do this,  we needed to understand why detecting phishing sites is so difficult. There is little information on how phishers operate in the wild so we ran our own experiment and analyzed around 5000 recent phishing websites. Turnout that the results of this preliminary analysis are interesting  by themselves and shed a new light on current phishers behaviors so I decided to share them with you via this blog post.

Methodology

Before delving into the results, let me explain how we got to them. First we collected, phishing urls via Phishtank which is the best resources to get phishing URLs. Next we used these URLs to feed our crawler, which took a screenshot and collected a bunch of information for each of these sites. Then we used Amazon Mechanical Turk (as usual :) ) to have human review each screenshot and augment our data-set with “human intelligence”. To make sure our data-set is clean, we had every phishing site screenshot analyzed by three different Turkers. Finally we processed the data reported by the Turkers to compute the results that we are going to discuss. In particular we discarded meaningless results and used a voting system to come-up with a stable data set. In then end, we ended-up having data about 1000 phishing websites.  It might not seems a lot of works but trust me, it took us a lot of effort to get there :)

Type of Phishing

There is two kind of phishing websites: fake sites and scam sites. Fake sites are phishing sites that clone the appearance of the targeted website in the hope you will confuse the two and enter  your credentials (login and password). Here is an example of a Paypal phishing site

Paypal Fake website

Scam site try to talk you into entering your credentials for a dubious reason or another. The screenshot below show a phishing site that attempts to steal your MSN credentials via offering you a software that allows you to know who blocked you. Notice how the phisher, make clear that this is safe to use it ….

MSN Credential phishing via a SCAM

 

 

Accordingly the first question that comes to mind is which is the favority phishier tactic ? Faking or Scamming ? Well it is about equal (48.2%, 51.8%) as visible in the graph below:

Phishing Sites Target Type

The next question is what kind of sites phishers are targeting ? Are they trying to steal your bank account, your email, or your Facebook account ?
As visible on the chart below, for those we were able to categorize, Without any surprise  financial services, like Paypal and Banks, are the most targeted. The next big target (no surprise here either) are social networks (Facebook, Orkut…). What is surprising is that the third big type of target, are online games (World of Warcraft in particular) not email accounts. One hypothesis, that explains this trend is that reselling stolen online goods is a lucrative business.

Visual Similarity

One other question, we asked Turkers is to rank  how visually similar fakes sites are to the target site they attempt to phish.  We asked to rank the fake phishing site on a scale from 1 to 5.  1 being completely different to 5 being close to a perfect copy. I was expecting to have a majority of sites to look very similar to their target. Oh boy, how wrong was I, as visible in the chart below in reality most fake sites are poorly executed (on purpose to avoid detection ?).

Here are some examples of phishing sites with different level of visual similarity:

Eve-online phishing site (similarity 5/5 – high resemblance)

Visual similarity 5

World of Warcraft phishing site (similarity 5/5 very similar)

World of Warcraft phishing site (visual similarity 2/5 very few common point with the original site)

Why Detecting Phishing is Hard ?

So why detecting phishing is hard ? Well the results of our analysis suggest at least two reasons: First many phishing sites (51.8%) are scam sites not fake sites which make them harder to classify because we don’t have a baseline for them (the real site). The second explanation is that those who attempt to fake a realsite are poorly executed and therefore are hard to recognize. While I still believe that  machine learning and vision algorithm can yield something (there are previous successful works on this), it is clear that we will need new ideas to deal with scam phishing sites  and poorly executed fake sites. Right now, I am thinking using image content extraction and spacial correlation but only time will tell if it will work. There is also probably more to the data that what I discussed, so if you have an idea let me know :)

Thanks for reading this post. If you like it please sharing it with the world, it makes me happy :) You can follow me on Twitter @elie or on Google+

Elie Bursztein is a researcher at Google where he works on fixing Internet security and privacy problems.
  • Brad Wardman

    Elie, I really enjoyed reading your article but want to add (in case you are trying to publish) that there could be issues with some of your assumptions. First though, I think that you did a good job on how you collected and analyzed your data set. Academic. 

    I might add that APWG is another free avenue (for EDU) for more URLs. And to that point, your data set is too small to make assumptions that there is a 50/50 split between scam/phish. I would also recommend to use the same system to get more labeled URLs for your data set, as 5,000 is small IMHO for phishing detection papers in 2011. Although plenty of papers still get published with around the same number of URLs or emails. 

    Another thing that I have noticed in phishing websites is that phishers reuse kits from the past (that mimic past versions of the spoofed organization’s website), so that may be another reason to lack of visual similarity. This is another reason (other than scam sites) why you need a training data set instead of just comparing against real organizational websites. 

    Image content extraction and spacial correlation is a good technique for detection as shown by other phishing and spam researchers (sorry do not have the refs on me). I do know that Dr. Zhang’s group at UAB have done some of this. You may also want to include some additional features that others have used (Whittaker et al. ”Large-Scale Automatic Classification…” 2010 or Xiang & Hong “A Hybrid Phish Detection…” 2009) such as the appearance of certain terms, titles, presence of login, and so forth. 

    Sorry for the ramble, just trying to give suggestions.  Looking forward to your work and again thanks for the blog post.

    • https://elie.im Elie Bursztein

      Hey Brad,
      Thanks for the suggestions they are super useful !
      I agree we need more data and using APWG seems a great resource. I will see if we can get data from there.

      5000 is probably too small, but it was the only own still up when we runned the first batch of analysis. We will keep collecting them and I expect to have a huge data set by the time we will write a fully fledged research paper.

      Thanks for the references, I will read all of them. I already came across some of them (the 2009 in particular). If you have any other paper that came to mind, please let me know.

      Thanks again for the comment :)

Popular blog posts
Latest social News
New survey: 19% of users use their browser private mode - http://t.co/2BTgm6SA #security #privacy #infosec #smo
19% of users use their browser private mode - http://t.co/ed2NqpaZ #security #privacy #infosec
1 day ago
Blizzard fixing GAME Australia's bankruptcy mess, giving Diablo 3 to those who preordered - http://t.co/JjpVm5X5 #d3 #diablo #diablo3
SessionJuggler Secure Web Login from an Untrusted Terminal Using Session Hijacking - http://t.co/IRQsBcVY #security #infosec #www2012...
Fascinating: An interview with a cybercriminal - http://t.co/amO1M5wN < guy operate a 10k botnet. #botnet #security #infosec