Fuzzy Duplication Detection in Flay
Published 2013-03-28 @ 12:00
Tagged flay
1 2 3 4 5 6 7 8 9 10 11 |
def a f1; f2; f3; f4; f5; f6 end def b f2; f3; f4; f5; f6 end def c f2; f3; f4; f5; f6; f7 end |
Given these three methods: a
, b
, and c
, flay will currently
report that a
and c
are structurally similar.
% flay -m 5 bogus_example.rb
Total score (lower is better) = 16
1) Similar code found in :defn (mass = 16)
bogus_example.rb:1
bogus_example.rb:9
In c
, The call to f1
has been removed and a call to f7
has been
added. Structurally, the two methods are the same. What does that mean
“structurally the same”? It means that if you strip all the names and
values out, the code has the same tree shape (and node types, but
that’s a digression). Flay is just an intelligent tree matcher.
But what about when they’re not structurally the same? What happens
when a sloppy developer copies a method and comments out one or two
expressions (as in b
above)? That gets past flay currently because
the trees aren’t the same shape anymore. What flay needs is to be able
to say that they’re a “fuzzy” match and report them.
I can quickly think of 2 ways to detect that b
is similar to a
&
c
. The first is to apply some sort of tree-based distance algorithm
to the diagonal matrix of all applicable nodes and reporting any
combinations that have a distance under a certain threshold. Quite
frankly, I’m hesitant to go this route simply because of the
complexity of it. Wikipedia hints that the problem of graph
isomorphism falls into NP but it is unknown whether it falls into
NP-complete or P (tho this is a special case because ASTs are planar
graphs and that restricts a lot of the complexity). It sounds like an
interesting problem, but not necessarily one I want to take on for
flay. Flay is currently conceptually really “simple” and I’d like to
keep it that way.
The second way is to try to piggyback on flay’s current main data structure and algorithm and let it do all of the work the way it currently does. Flay stores a bunch of “buckets” of structurally similar nodes that looks something like this:
1 |
all_nodes.group_by(&:structural_hash).delete_if { |_,v| v.size == 1 } |
(boy do I wish I could actually write it that simply!)
If I could get subset matches into that structure, then I’m done. It
turns out that this isn’t that hard and only took me an afternoon to
get something scratched up that felt right and was efficient enough to
use against real code bases. Running flay rails
vs flay -f rails
only adds 30 seconds and 19 megs of RSS, but it also detected a lot
more resulting in a flay score increase of 9464 and ~700 extra lines
of report output.
So… what’s a subset? Well… in this case, I’m mostly interested in
detecting when a developer copies a method (or whatever) and simply
adds or removes one or two things. Much more than that and you’re
going to wind up with a mess in your reports. By using
Array#combination
I can generate a bunch of variations on an AST and
then structurally compare those to the original ASTs.
Something like this:
1 2 3 4 5 |
(ast.size - 1).downto(ast.size - difference) do |n| ast.combination(n).each do |subast| self.hashes[subast.structural_hash] << subast end end |
Of course, it isn’t that clean and simple. Much more went into it to produce performant code and to prevent nonsensical comparisons… but that’s the gist of it. The nice thing about this is that it uses ruby built-ins to do 90% of the work and then simply folds into flay’s current data structure. I didn’t really have to do much more than that. You can see it work here:
% flay -m 5 -f bogus_example.rb
Total score (lower is better) = 37
1) Similar code found in :defn (mass = 21)
bogus_example.rb:1 (FUZZY)
bogus_example.rb:5
bogus_example.rb:9 (FUZZY)
2) Similar code found in :defn (mass = 16)
bogus_example.rb:1
bogus_example.rb:9
I just committed my initial implementation. Take a look if you’d like.