Wading through the noise, or how to make sense of scattered e-mail addresses

‘Making sense of email addresses on drives’  by Neil C. Rowe, Riqui Schwamm, Michael R. McCarrin (U.S. Naval Postgraduade School, Computer Science Department), and Ralucca Gera (Applied Mathematics Department)
Best Paper Award at ICDF2C 2016, 8th EAI International Conference on Digital Forensics & Cyber Crime

Investigators of cyber crime rely on different kinds of physical and digital evidence, and hard drives fall into the category of the most useful. Drives often contain information in the form of email addresses, which can be used to build a picture of the social networks in which the drive owner participated. Information gathered this way is usually more reliable than what we can infer from publicly available data on the actual online social networks, if only because every user has the ability choose what is and what isn’t publicly available. But hard drives and large and there is plenty of noise to wade through if you’re looking for a specific type of information. Thus, the demand for new methods that can filter data based on interestingness is huge.

Thus far, little attention has been paid to mining email addresses from drives, their classification, or their connection to social networks. Work has been done on the classification of email messages from their message headers, but headers provide significantly richer contextual information than lists of email addresses scattered over a drive. What authors of this paper set out to do essentially equates to searching for needles in haystacks, but these needs could hold valuable information.

They have done their work with 2401 drives from 36 countries that represent a range of business, government, and home users, running the Bulk Extractor tool to extract all email addresses, effectively bypassing the file system and searching the raw drive bytes for patterns appearing to be email addreses. This totaled to respectable numbers – 292,347,920 addresses having an average of 28.4 characters per address, of which there were 17,544,550 addresses.

What followed was serious data-crunching. To learn more about the method, test setup, elimination of uninteresting addresses, and visualization of email networks and drive similarities, we recommend getting the full paper here.