Spam business took off in 2002 and became a massive phenomenon, covering 72% of the email traffic in 2004, and reaching its pick in 2010 - 89% of emails were spam. Today user mailbox also started receiving large amounts of other types of bulk emails. According to Hotmail reports in 2012, newsletters and automated notifications messages summed up to 50% of inbox messages. However, for conventional anti-spam filters it is challenging to automatically differentiate unsolicited and solicited bulk emails. Therefore, while most of the existing research studies the efficiency of anti-spam techniques, this thesis focuses on those few cases in which the existing techniques fail. We limit our study to the often overlooked area of gray emails, i.e., those ambiguous messages that cannot be clearly categorized one way or the other by automated spam filters.
We approach the study of gray area as bulk emails, by focusing on the analysis of email campaigns. We propose a three-phase approach based on message clustering, classification, and graph-based refinement that is based only on email headers data. During the study of gray area, we identify three email campaign categories - commercial, newsletters, and botnet - for which our classification method works well. To identify 419 scam campaigns, an advanced fee fraud primarily based on confidence, we propose instead a technique based on the phone numbers. We next rely on this insight to identify and characterize 419 scam campaigns by describing several illustrative examples that demonstrate the diversity of such campaigns and their international geographic distribution.