Pyhidra

Some notes about using pyhidra for reverse engineering. Some of this post is background on what reverse engineering is, but the focus is on pyhidra, an interface between Python and Ghidra.

Reverse Engineering

Reverse engineering (or RE) is the practice of obtaining information about an engineered system by examining it in the absence of any insider information. Essentially, whenever you systematically try to understand how some bit of tech works - you are reverse engineering that tech.

Sometimes RE is the only way to obtain information about some sort of technology. For example, some ancient tool discovered by archaeologists may not have any surviving documentation or writing to explain how it was used or how it works. So, an archaeologist might build a replica and just try to use it, and use their judgement to attempt to derive how the tool worked in its time.

Similarly, software reverse engineering may require deriving some mental model for software whose authors have disappeared and for which there is insufficient documentation.

Another side of RE is about gathering intelligence on some tech without its creators knowing. Sometimes companies try to reverse engineer their competitors' products in order to produce copies. Security researchers may try to reverse engineer malware in order to understand what it is meant to accomplish.

Ghidra

In 2019, the NSA made their software reverse engineering suite Ghidra available to the public. Prior to its release, other software reverse engineering tools existed, such as IDA and Radare. An inaccurate but useful stereotype is that professional reverse engineers working for larger companies use IDA, and hobbyists use Radare2. However, Ghidra seems to be competitive with IDA, making it an interesting alternative for people with smaller RE budgets.

Scripting

Ghidra is a Java application, and can be scripted with Java or Jython (a Java-based Python interpreter). However, Jython is not exactly the same as Cpython, so this is not ideal.

The first solution to this mismatch that I found is Ghidra Bridge, an open-source interface between Ghidra's Jython interpreter and your system's Cpython interpreter. This allows regular Python scripts to access Ghidra's internal data structures as if you were using Jython from inside Ghidra.

Pyhidra is a newer solution to the same problem that seems to be also released by the US government, so perhaps it is more "official" than Ghidra Bridge. I'm not exactly sure how Pyhidra works internally, but it feels like a "patched" Ghidra that works with normal Python.

Python-in-Ghidra

You can find example code using Ghidra's internal Jython environment, but these typically assume:

The Python interpreter is running inside of an already-open Ghidra project.
Certain global variables are already initialized for you.

For example, this answer references the variables currentProgram and monitor, which do not exist in a standard Python program. However, Ghidra defines them after you open a project.

From inside Ghidra, you can open the Pyhidra REPL and type in the following to print the instructions for all the basic blocks in a program:

from ghidra.program.model.block import BasicBlockModel
from ghidra.util.task import TaskMonitor
for block in BasicBlockModel(currentProgram).getCodeBlocks(TaskMonitor.DUMMY):
 for inst in currentProgram.getListing().getInstructions(block, True):
  print(inst.getAddressString(False, True), inst)

Pyhidra Outside Ghidra

To get the function call graph for every file in a large project, it would be more convenient to write a standalone script and not have to manually type commands into the Pyhidra REPL.

However, we need to obtain some special variables in a different way. The import ghidra statement will fail without the proper prerequisite. You have to first start Ghidra running in headless mode like this:

import pyhidra
pyhidra.start()
import ghidra #this works now

But it's more convenient (and faster) to import the ghidra modules from inside a pyhidra context manager:

import pyhidra
def call_graphs(program):
 calls = {}
 called_by = {}
 with pyhidra.open_program(program) as flat_api:
  from ghidra.util.task import TaskMonitor
  monitor = TaskMonitor.DUMMY
  currentProgram = flat_api.getCurrentProgram()
  funcs = currentProgram.getFunctionManager().getFunctions(True)
  for f in funcs:
   calls[f] = f.getCalledFunctions(monitor)
   called_by[f] = f.getCallingFunctions(monitor)

 return {'calls':calls,'called_by':called_by}

The call to pyhidra.start() takes a few seconds on my machine, but only needs to run once. Also, the currentProgram variable is self-explanatory, but Ghidra sets its value for you when you create a project. For a standalone script, we set its value yourself:

import pyhidra
with pyhidra.open_program('a.out') as flat_api:
  currentProgram = flat_api.getCurrentProgram()  #same currentProgram as in typical Ghidra examples
  #currentProgram = flat_api.currentProgram  #same as above
  for f in currentProgram.getFunctionManager().getFunctions(True):
   print(f.getName())

Complete example

To demonstrate the workflow, start with a C program:

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>

int math(int32_t x, float y) {return y > 2 ? -x : x+4;}

int math2(int x) {
  int tmp = math(x, 3.14159);
  return tmp-x;
}

int main(int argc, char**argv){
  if(argc<3){printf("supply 2 number arguments please.\n");return 1;}
  int x = atoi(argv[1]);
  float y = atoi(argv[2]);
  printf("Your number is %d\n", math(x, y));
  printf("Some other number is %d\n", math2(x));
  return 0;
}

I want to analyze dietlibc, so after installing diet I compile like this:

diet gcc example.c

This results in an executable called a.out by default so that's the file I will analyze with pyhidra. For example, one feature of interest is the function call graph. Here's a way to extract a graph of functions in both directions (caller/callee):

import pyhidra
def call_graphs(program):
 calls = {}
 called_by = {}
 with pyhidra.open_program(program) as flat_api:
  from ghidra.util.task import TaskMonitor
  monitor = TaskMonitor.DUMMY
  for f in flat_api.currentProgram.getFunctionManager().getFunctions(True):
   calls[f] = f.getCalledFunctions(monitor)
   called_by[f] = f.getCallingFunctions(monitor)

 return {'calls':calls,'called_by':called_by}

call_graphs('a.out')

And here's a way to iterate through the basic blocks:

import pyhidra
def block_graphs(program):
 with pyhidra.open_program(program) as flat_api:
  from ghidra.program.model.block import BasicBlockModel
  from ghidra.util.task import TaskMonitor
  for b in BasicBlockModel(flat_api.currentProgram).getCodeBlocks(TaskMonitor.DUMMY):
   print(f'Label: {b.name}')
   print(f'  min address: {b.minAddress}')
   print(f'  max address: {b.maxAddress}')

block_graphs('a.out')

The ability to construct function call graphs and basic block graphs opens new possibilities for static analysis.