Beyond precision and recall: Understanding uses (and misuses) of similarity hashes in binary analysis

Pagani, Fabio; Dell'Amico, Matteo; Balzarotti, Davide
CODASPY 2018, 8th ACM Conference on Data and Application Security and Privacy, March 19-21, 2018, Tempe, AZ, USA

Fuzzy hashing algorithms provide a convenient way of summarizing in a compact form the content of files, and of looking for similarities between them. Because of this, they are widely used in the security and forensics communities to look for similarities between binary program files; one version of them, ssdeep, is the de facto standard to share information about known malware. Fuzzy hashes are quite pervasive, but no study so far answers conclusively the question of which (if any) fuzzy hashing algorithms are suited to detect similarities between programs, where we consider as similar those programs that have code or libraries in common. We measure how four popular algorithms perform in different scenarios: when they are used to correlate statically-compiled files with the libraries they use, when compiled with different flags or different compilers, and when applied to programs that share a large part of their source code. Perhaps more importantly, we provide interpretations that explain the reasons why results vary, sometimes widely, among apparently very similar use cases. We find that the low-level details of the compilation process, together with the technicalities of the hashing algorithms, can explain surprising results such as similarities dropping to zero with the change of a single assembly instruction. More in general, we see that ssdeep, the de facto standard for this type of analysis, performs definitely worse than alternative algorithms; we also find that the best choice of algorithm to use varies depending on the particularities of the use case scenario.

Digital Security
Eurecom Ref:
© ACM, 2018. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in CODASPY 2018, 8th ACM Conference on Data and Application Security and Privacy, March 19-21, 2018, Tempe, AZ, USA