Code Similarity

Code similarity is a special case of graph similarity, which means it is easier to solve than the general case.

Plagiarism and Clone Detection

Detecting similar code is important for plagiarism detection as well as code clone detection. Plagiarism is something schools care about, because they want students to do their own work. Code clone detection is something software companies care about, because it can help answer whether a competing product contains an unauthorized copy of proprietary work. Software developers also care about code clones, because it can help identify duplicated code that may already exist as a library. Using existing solutions rather than reinventing the same wheel is a productivity boost for most developers.

Plagiarism and clones are similar concepts, and can be detected with similar methods.

Graph Similarity

The most general-purpose data structure is the graph. Nodes stand for ideas and edges represent relationships between the ideas. Graphs can be used to represent code, and if graphs are similar, it may be reasonable to conclude that the code represented by the graphs is also similar.

One challenge of working with graphs is they are too general. Coming up with a similarity measurement often requires domain knowledge. For example, two nodes can have the same out-degree but different in-degrees - should they be considered similar or not? These kinds of design choices make a big difference in the effectiveness of graph algorithms. Graph algorithms can also be computationally expensive, often requiring traversing every node multiple times.

Graph Embeddings

Deep learning based methods like deepwalk and node2vec use random walks among neighborhoods of vertices. The path traversed by these walks is encoded as a vector of numbers, and compared to other vectors using algebraic properties like their dot product.

The randomness of these kinds of approaches creates a kind of down-sampling of the graph. Rather than exhaustively search the entire graph for similar sub-graphs of varying sizes, random walks form a (hopefully) representative distribution of all possible paths.

Graph-Based Code Similarity

If a graph representation of code exists, it can be embedded and compared to other embeddings. The trick then is in generating or obtaining this graph representation. For source code, compilers and interpreters generally construct a graph representation internally. Whether they expose this representation to the users of the compiler/interpreter varies, but for example LLVM does provide an interface to access this internal representation. Separate code analysis tools like Tree Sitter also perform analysis and emit a graph structure (a tree is a kind of graph).