Some notes about using pyhidra for reverse engineering. Some of this post is background on what reverse engineering is, but the focus is on pyhidra, an interface between Python and Ghidra.
Reverse engineering (or RE) is the practice of obtaining information about an engineered system by examining it in the absence of any insider information. Essentially, whenever you systematically try to understand how some bit of tech works - you are reverse engineering that tech.
Sometimes RE is the only way to obtain information about some sort of technology. For example, some ancient tool discovered by archaeologists may not have any surviving documentation or writing to explain how it was used or how it works. So, an archaeologist might build a replica and just try to use it, and use their judgement to attempt to derive how the tool worked in its time.
Similarly, software reverse engineering may require deriving some mental model for software whose authors have disappeared and for which there is insufficient documentation.
Another side of RE is about gathering intelligence on some tech without its creators knowing. Sometimes companies try to reverse engineer their competitors' products in order to produce copies. Security researchers may try to reverse engineer malware in order to understand what it is meant to accomplish.
In 2019, the NSA made their software reverse engineering suite Ghidra available to the public. Prior to its release, other software reverse engineering tools existed, such as IDA and Radare. An inaccurate but useful stereotype is that professional reverse engineers working for larger companies use IDA, and hobbyists use Radare2. However, Ghidra seems to be competitive with IDA, making it an interesting alternative for people with smaller RE budgets.
Ghidra is a Java application, and can be scripted with Java or Jython (a Java-based Python interpreter). However, Jython is not exactly the same as Cpython, so this is not ideal.
The first solution to this mismatch that I found is Ghidra Bridge, an open-source interface between Ghidra's Jython interpreter and your system's Cpython interpreter. This allows regular Python scripts to access Ghidra's internal data structures as if you were using Jython from inside Ghidra.
Pyhidra is a newer solution to the same problem that seems to be also released by the US government, so perhaps it is more "official" than Ghidra Bridge. I'm not exactly sure how Pyhidra works internally, but it feels like a "patched" Ghidra that works with normal Python.
You can find example code using Ghidra's internal Jython environment, but these typically assume:
For example, this answer references the variables currentProgram
and monitor
, which do not exist in a standard Python program.
However, Ghidra defines them after you open a project.
From inside Ghidra, you can open the Pyhidra REPL and type in the following to print the instructions for all the basic blocks in a program:
from ghidra.program.model.block import BasicBlockModel from ghidra.util.task import TaskMonitor for block in BasicBlockModel(currentProgram).getCodeBlocks(TaskMonitor.DUMMY): for inst in currentProgram.getListing().getInstructions(block, True): print(inst.getAddressString(False, True), inst)
To get the function call graph for every file in a large project, it would be more convenient to write a standalone script and not have to manually type commands into the Pyhidra REPL.
However, we need to obtain some special variables in a different way.
The import ghidra
statement will fail without the proper prerequisite.
You have to first start Ghidra running in headless mode like this:
import pyhidra pyhidra.start() import ghidra #this works now
But it's more convenient (and faster) to import the ghidra modules from inside a pyhidra context manager:
import pyhidra def call_graphs(program): calls = {} called_by = {} with pyhidra.open_program(program) as flat_api: from ghidra.util.task import TaskMonitor monitor = TaskMonitor.DUMMY currentProgram = flat_api.getCurrentProgram() funcs = currentProgram.getFunctionManager().getFunctions(True) for f in funcs: calls[f] = f.getCalledFunctions(monitor) called_by[f] = f.getCallingFunctions(monitor) return {'calls':calls,'called_by':called_by}
The call to pyhidra.start()
takes a few seconds on my machine, but only needs to run once.
Also, the currentProgram
variable is self-explanatory, but Ghidra sets its value for you when you create a project.
For a standalone script, we set its value yourself:
import pyhidra with pyhidra.open_program('a.out') as flat_api: currentProgram = flat_api.getCurrentProgram() #same currentProgram as in typical Ghidra examples #currentProgram = flat_api.currentProgram #same as above for f in currentProgram.getFunctionManager().getFunctions(True): print(f.getName())
To demonstrate the workflow, start with a C program:
#include <stdio.h> #include <stdlib.h> #include <stddef.h> int math(int32_t x, float y) {return y > 2 ? -x : x+4;} int math2(int x) { int tmp = math(x, 3.14159); return tmp-x; } int main(int argc, char**argv){ if(argc<3){printf("supply 2 number arguments please.\n");return 1;} int x = atoi(argv[1]); float y = atoi(argv[2]); printf("Your number is %d\n", math(x, y)); printf("Some other number is %d\n", math2(x)); return 0; }
I want to analyze dietlibc, so after installing diet I compile like this:
diet gcc example.c
This results in an executable called a.out
by default so that's the file I will analyze with pyhidra.
For example, one feature of interest is the function call graph.
Here's a way to extract a graph of functions in both directions (caller/callee):
import pyhidra def call_graphs(program): calls = {} called_by = {} with pyhidra.open_program(program) as flat_api: from ghidra.util.task import TaskMonitor monitor = TaskMonitor.DUMMY for f in flat_api.currentProgram.getFunctionManager().getFunctions(True): calls[f] = f.getCalledFunctions(monitor) called_by[f] = f.getCallingFunctions(monitor) return {'calls':calls,'called_by':called_by} call_graphs('a.out')
And here's a way to iterate through the basic blocks:
import pyhidra def block_graphs(program): with pyhidra.open_program(program) as flat_api: from ghidra.program.model.block import BasicBlockModel from ghidra.util.task import TaskMonitor for b in BasicBlockModel(flat_api.currentProgram).getCodeBlocks(TaskMonitor.DUMMY): print(f'Label: {b.name}') print(f' min address: {b.minAddress}') print(f' max address: {b.maxAddress}') block_graphs('a.out')
The ability to construct function call graphs and basic block graphs opens new possibilities for static analysis.