Skip to main content

Binary-Similarity-BSIM

ghidriff BSIM

Background

With the introduction of BSIM in Ghidra 11.0 a new power has been brought to ghidriff.

The BSim Program Correlator uses the decompiler to generate confidence scores between potentially matching functions in the source and destination programs. Function call-graphs are used to further boost the scores and distinguish between conflicting matches.

The decompiler generates a formal feature vector for a function, where individual features are extracted from the control-flow and data-flow characteristics of its normalized p-code representation.

Functions are compared by comparing their corresponding feature vectors, from which similarity and confidence scores are extracted.

A confidence score, for this correlator, is an open-ended floating-point value (ranging from -infinity to +infinity) describing the amount of correspondence between the control-flow and data-flow of two functions. A good working range for setting thresholds (below) and for describing function pairs with some matching features is 0.0 to 100.0. A score of 0.0 corresponds to functions with roughly equal amounts of similar and dissimilar features. A score of 10.0 is typical for small identical functions, and 100.0 is achieved by pairs of larger sized identical functions. Ghidra BSIM Docs

BSIM correlator first impressions

  • The BSIM correlator is great for matching. The overall improvement for #ghidriff is a net plus, but some custom #ghidriff correlators were already providing similar structural matching (not as good, but similar) 💪
  • Speculation: 🧐 BSIM is the reason why Ghidra Version Tracking was lacking structural matching heuristics. This is why ghidriff has its own structural function matching. BSIM is a more accurate and powerful version.
  • Adding BSIM to #ghidriff slows it down a bit. This is because BSIM decompiles all functions to match based on data flow and call graphs, and #ghidriff similarly already does this to make matching decisions. It has been optimized. 🤓

ghidriff BSIM correlations options

BSIM Options:
--bsim, --no-bsim Toggle using BSIM correlation (default: True)
--bsim-full, --no-bsim-full
Slower but better matching. Use only when needed (default: False)

You can run ghidriff with or without BSIM. My recommendation is to run with. The --bsim-full will allow you to match with BSIM across the full address space. It is generally recommended not to run full, but might be worth a try if you have a complicated diff as BSIM might pick up some new matches.