The ability to accurately compute the similarity between two pieces of binary code plays an important role in a wide range of different problems. Several research communities such as security, programming language analysis, and machine learning, have been working on this topic for more than five years, with hundreds of papers published on the subject. One would expect that, by now, it would be possible to answer a number of research questions that go beyond very specific techniques presented in papers, but that generalize to the entire research field. Unfortunately, this topic is affected by a number of challenges, ranging from reproducibility issues to opaqueness of research results, which hinders meaningful and effective progress. In this paper, we set out to perform the first measurement study on the state of the art of this research area. We begin by systematizing the existing body of research. We then identify a number of relevant approaches, which are representative of a wide range of solutions recently proposed by three different research communities. We re-implemented these approaches and created a new dataset (with binaries compiled with different compilers, optimizations settings, and for three different architectures), which enabled us to perform a fair and meaningful comparison. This effort allowed us to answer a number of research questions that go beyond what could be inferred by reading the individual research papers. By releasing our entire modular framework and our datasets (with associated documentation), we also hope to inspire future work in this interesting research area.
How machine learning is solving the binary function similarity problem
USENIX 2022, 31st USENIX Security Symposium, 10-12 August 2022, Boston, MA, USA
Type:
Conference
City:
Boston
Date:
2022-08-10
Department:
Digital Security
Eurecom Ref:
6847
Copyright:
Copyright Usenix. Personal use of this material is permitted. The definitive version of this paper was published in USENIX 2022, 31st USENIX Security Symposium, 10-12 August 2022, Boston, MA, USA and is available at :
PERMALINK : https://www.eurecom.fr/publication/6847