polishing ruby by ryan davis

Fuzzy Duplication Detection in Flay

Published 2013-03-28 @ 12:00

Tagged flay

1
2
3
4
5
6
7
8
9
10
11
def a
  f1; f2; f3; f4; f5; f6
end

def b
      f2; f3; f4; f5; f6
end

def c
      f2; f3; f4; f5; f6; f7
end

Given these three methods: a, b, and c, flay will currently report that a and c are structurally similar.

    % flay -m 5 bogus_example.rb

    Total score (lower is better) = 16

    1) Similar code found in :defn (mass = 16)
      bogus_example.rb:1
      bogus_example.rb:9

In c, The call to f1 has been removed and a call to f7 has been added. Structurally, the two methods are the same. What does that mean “structurally the same”? It means that if you strip all the names and values out, the code has the same tree shape (and node types, but that’s a digression). Flay is just an intelligent tree matcher.

But what about when they’re not structurally the same? What happens when a sloppy developer copies a method and comments out one or two expressions (as in b above)? That gets past flay currently because the trees aren’t the same shape anymore. What flay needs is to be able to say that they’re a “fuzzy” match and report them.

I can quickly think of 2 ways to detect that b is similar to a & c. The first is to apply some sort of tree-based distance algorithm to the diagonal matrix of all applicable nodes and reporting any combinations that have a distance under a certain threshold. Quite frankly, I’m hesitant to go this route simply because of the complexity of it. Wikipedia hints that the problem of graph isomorphism falls into NP but it is unknown whether it falls into NP-complete or P (tho this is a special case because ASTs are planar graphs and that restricts a lot of the complexity). It sounds like an interesting problem, but not necessarily one I want to take on for flay. Flay is currently conceptually really “simple” and I’d like to keep it that way.

The second way is to try to piggyback on flay’s current main data structure and algorithm and let it do all of the work the way it currently does. Flay stores a bunch of “buckets” of structurally similar nodes that looks something like this:

1
all_nodes.group_by(&:structural_hash).delete_if { |_,v| v.size == 1 }

(boy do I wish I could actually write it that simply!)

If I could get subset matches into that structure, then I’m done. It turns out that this isn’t that hard and only took me an afternoon to get something scratched up that felt right and was efficient enough to use against real code bases. Running flay rails vs flay -f rails only adds 30 seconds and 19 megs of RSS, but it also detected a lot more resulting in a flay score increase of 9464 and ~700 extra lines of report output.

So… what’s a subset? Well… in this case, I’m mostly interested in detecting when a developer copies a method (or whatever) and simply adds or removes one or two things. Much more than that and you’re going to wind up with a mess in your reports. By using Array#combination I can generate a bunch of variations on an AST and then structurally compare those to the original ASTs.

Something like this:

1
2
3
4
5
(ast.size - 1).downto(ast.size - difference) do |n|
  ast.combination(n).each do |subast|
    self.hashes[subast.structural_hash] << subast
  end
end

Of course, it isn’t that clean and simple. Much more went into it to produce performant code and to prevent nonsensical comparisons… but that’s the gist of it. The nice thing about this is that it uses ruby built-ins to do 90% of the work and then simply folds into flay’s current data structure. I didn’t really have to do much more than that. You can see it work here:

    % flay -m 5 -f bogus_example.rb

    Total score (lower is better) = 37

    1) Similar code found in :defn (mass = 21)
      bogus_example.rb:1 (FUZZY)
      bogus_example.rb:5
      bogus_example.rb:9 (FUZZY)

    2) Similar code found in :defn (mass = 16)
      bogus_example.rb:1
      bogus_example.rb:9

I just committed my initial implementation. Take a look if you’d like.