Introduction
SAF (Static Analyzer Factory) is a Rust + WebAssembly framework for building program analysis tools. It provides whole-program pointer analysis, value-flow reasoning, and graph-based program representations -- all accessible through a Python SDK or directly in the browser via WebAssembly.
Who Is SAF For?
SAF is designed for three audiences:
-
Students learning program analysis. The interactive Playground lets you paste C code, visualize control-flow graphs, call graphs, and value-flow graphs, and step through pointer analysis -- all without installing anything.
-
Security researchers writing custom checkers. The Python SDK exposes a schema-driven API for authoring analyzers that detect use-after-free, double free, taint-flow violations, and other vulnerability classes. You write the checker logic; SAF handles the underlying analysis infrastructure.
-
AI and agent developers. The same schema-driven API that humans use is designed for LLM agents to consume. Agents can query points-to sets, traverse value-flow edges, and emit SARIF findings programmatically.
What Can You Do With SAF?
| Capability | Description |
|---|---|
| Visualize program graphs | Render CFGs, ICFGs, call graphs, def-use chains, and value-flow graphs as interactive diagrams |
| Run pointer analysis | Andersen-style inclusion-based analysis with field sensitivity and on-the-fly call graph construction |
| Detect vulnerabilities | Built-in checkers for use-after-free, double free, memory leaks, null dereference, file descriptor leaks, and more |
| Write custom analyzers | Author new checkers in Python using the SDK, with full access to points-to and value-flow results |
| Embed analysis widgets | Embed interactive analysis <iframe> widgets into any web page for live program visualization |
| Export results | Output findings as SARIF, and graphs as the unified PropertyGraph JSON format |
Key Differentiators
-
Runs in the browser. The core analysis engine compiles to WebAssembly. The playground and embeddable widgets require no server -- everything executes client-side.
-
Deterministic. Given identical inputs, SAF produces byte-identical outputs. All internal data structures use ordered collections, and all IDs are derived from BLAKE3 hashes. This makes results reproducible and diffable.
-
Frontend-agnostic. SAF operates on its own intermediate representation (AIR), not on LLVM IR directly. Frontends translate source languages into AIR; the analysis core never sees frontend-specific types.
-
Open source. SAF is open source and designed for extensibility. The crate architecture cleanly separates the core IR, frontends, analysis passes, and language bindings.
Quick Start
The fastest way to try SAF:
-
In the browser: Open the Playground and paste a C snippet. Click "Analyze" to see the CFG, call graph, and points-to results.
-
Locally with Docker:
git clone https://github.com/Static-Analyzer-Factory/static-analyzer-factory.git cd static-analyzer-factory make shell # Inside the container: python3 tutorials/memory-safety/02-use-after-free/detect.py -
With the CLI:
saf run program.ll --format sarif --output results.sarif
Where to Go Next
- Installation -- Set up SAF locally with Docker or install the Python package.
- First Analysis -- Walk through analyzing a small C program end to end.
- Playground Tour -- Learn how to use the browser-based playground.
- Tutorials -- Step-by-step guides for detecting specific vulnerability classes.
- API Reference -- Full reference for the Python SDK, CLI, and PropertyGraph format.
Architecture Overview
Input (.c / .ll / .bc / .air.json)
-> Frontend (LLVM, AIR-JSON)
-> AirBundle (canonical IR)
-> Graph builders (CFG, ICFG, CallGraph, DefUse)
-> PTA solver (Andersen)
-> ValueFlow graph
-> Queries (flows, taint_flow, points_to)
-> Export (JSON, SARIF)
SAF is structured as a set of Rust crates:
| Crate | Purpose |
|---|---|
saf-core | AIR definitions, config, deterministic ID generation |
saf-frontends | Frontend trait + LLVM/AIR-JSON implementations |
saf-analysis | CFG, call graph, PTA, value-flow, checkers |
saf-cli | Command-line interface |
saf-python | PyO3 bindings for the Python SDK |
Installation
SAF can be used in two ways: in the browser (no install needed) or locally via Docker for full analysis capabilities.
Browser (No Install)
The fastest way to try SAF is the Playground. It runs entirely in your browser using WebAssembly -- no server, no Docker, no setup.
The browser version supports:
- Writing C code or LLVM IR
- Visualizing CFGs, call graphs, def-use chains, and value-flow graphs
- Running pointer analysis
- Exporting results as JSON
For the full Python SDK, checker framework, and LLVM bitcode support, use the local installation below.
Local Installation (Docker)
The local installation uses Docker to provide a complete development environment with Rust, Python 3.12, LLVM 18, and all dependencies pre-installed.
Prerequisites
Step 1: Clone the Repository
git clone https://github.com/Static-Analyzer-Factory/static-analyzer-factory.git
cd static-analyzer-factory
Step 2: Start the Development Shell
make shell
This drops you into an interactive shell inside the Docker container. The project
directory is mounted at /workspace.
On first run, the container automatically:
- Creates a Python virtual environment
- Installs pytest and development dependencies
- Builds the SAF Python SDK with
maturin develop --release
Subsequent runs skip the build if saf is already importable.
Step 3: Verify the Installation
Inside the Docker shell:
python3 -c "import saf; print('SAF version:', saf.version())"
You should see the SAF version printed without errors.
To run the full test suite:
# Exit Docker shell first (Ctrl+D), then:
make test
What's Included
The Docker environment provides:
| Tool | Version | Purpose |
|---|---|---|
| Rust | stable | Core analysis engine |
| Python | 3.12 | SDK and tutorials |
| LLVM/Clang | 18 | C/C++ to LLVM IR compilation |
| maturin | latest | Python-Rust bridge (PyO3) |
Available Make Commands
make help # Show all available commands
make test # Run all tests (Rust + Python) in Docker
make lint # Run clippy + rustfmt check in Docker
make fmt # Auto-format all Rust code in Docker
make shell # Open interactive dev shell in Docker
make build # Build minimal runtime Docker image
make clean # Remove Docker volumes and local target/
Next Steps
- First Analysis -- Analyze a C program end to end
- Playground Tour -- Learn the browser-based playground
First Analysis
This guide walks you through analyzing a small C program from start to finish using the SAF Python SDK.
The Vulnerable Program
We will detect a command injection vulnerability (CWE-78) -- user-controlled input flowing directly to a shell command:
#include <stdlib.h>
int main(int argc, char *argv[]) {
if (argc < 2) return 1;
// SOURCE: argv[1] is user-controlled
char *user_cmd = argv[1];
// SINK: system() executes the string as a shell command
return system(user_cmd);
}
If an attacker runs ./program "rm -rf /", the program executes that command.
The Analysis Pipeline
vulnerable.c --> clang-18 --> LLVM IR (.ll) --> SAF --> Findings
(compile) (.ll file) (analyze)
SAF works on LLVM IR, not source code directly. The pipeline compiles C to LLVM IR first, then runs analysis on the IR.
Step 1: Write the Detection Script
Create a file called detect.py:
import subprocess
from pathlib import Path
from saf import Project, sources, sinks
def main() -> None:
# 1. Compile C to LLVM IR
subprocess.run(
["clang-18", "-S", "-emit-llvm", "-O0", "-g",
"-o", "vulnerable.ll", "vulnerable.c"],
check=True,
)
print("Compiled: vulnerable.c -> vulnerable.ll")
# 2. Load the LLVM IR into SAF
proj = Project.open("vulnerable.ll")
print("Project loaded")
# 3. Create a query context
q = proj.query()
# 4. Run taint flow analysis
findings = q.taint_flow(
sources=sources.function_param("main", 1), # argv
sinks=sinks.call("system", arg_index=0), # system()'s argument
)
# 5. Print results
print(f"\nFound {len(findings)} taint flow(s):")
for i, f in enumerate(findings):
print(f" [{i}] finding_id={f.finding_id}")
if f.trace:
print(f" trace steps: {len(f.trace.steps)}")
if __name__ == "__main__":
main()
Step 2: Run Inside Docker
make shell
python3 detect.py
Expected output:
Compiled: vulnerable.c -> vulnerable.ll
Project loaded
Found 1 taint flow(s):
[0] finding_id=0x...
trace steps: 3
SAF found one taint flow: data from argv[1] reaches system().
Understanding the Key Concepts
Sources and Sinks
- Source: Where untrusted data enters.
sources.function_param("main", 1)selects the second parameter ofmain(which isargv). - Sink: Where untrusted data is dangerous.
sinks.call("system", arg_index=0)selects the first argument to any call tosystem().
Taint Flow
q.taint_flow(sources, sinks) performs a breadth-first search over the
ValueFlow graph, looking for paths from any source to any sink. Each path found
is reported as a Finding with:
finding_id-- a deterministic identifiertrace-- the step-by-step data flow path
Project.open()
Project.open("vulnerable.ll") does several things internally:
- Detects the
.llextension and selects the LLVM frontend - Parses the LLVM IR into SAF's Analysis IR (AIR)
- Builds the call graph, def-use chains, and points-to analysis
- Constructs the ValueFlow graph
Step 3: Explore the Graphs
You can also inspect the program's structure:
# Export graphs
graphs = proj.graphs()
# Call graph
cg = graphs.export("callgraph")
print(f"Functions: {len(cg['nodes'])}")
print(f"Call edges: {len(cg['edges'])}")
# ValueFlow graph
vf = graphs.export("valueflow")
print(f"VF nodes: {len(vf['nodes'])}, edges: {len(vf['edges'])}")
Step 4: Use the Checker Framework
Instead of manually specifying sources and sinks, use the built-in checkers for common vulnerability patterns:
import saf
proj = saf.Project.open("vulnerable.ll")
# Run all built-in checkers (memory-leak, use-after-free, double-free, etc.)
all_findings = proj.check_all()
for f in all_findings:
print(f"[{f.severity}] {f.checker}: {f.message}")
# Or run a specific checker
leak_findings = proj.check("memory-leak")
Next Steps
- Playground Tour -- Try the same analysis in the browser
- Browser vs Full SAF -- Understand the differences
- Tutorials -- Detect UAF, double-free, leaks, and more
Playground Tour
The SAF Playground is a browser-based interactive environment for writing code, running analysis, and visualizing program graphs. Everything runs client-side via WebAssembly -- no server needed.
Opening the Playground
Navigate to the Playground in your browser. You will see three main areas:
- Code Editor (left) -- Write C code or LLVM IR
- Graph Viewer (right) -- Interactive visualization of analysis results
- Toolbar (top) -- Controls for compilation, analysis, and graph selection
Writing Code
The editor supports two input modes:
- C/C++: Write standard C code. The playground compiles it to LLVM IR using the Compiler Explorer API, then analyzes the resulting IR.
- LLVM IR: Write
.llformat directly for full control over the IR structure.
Example: Command Injection
Paste this into the editor:
#include <stdlib.h>
int main(int argc, char *argv[]) {
if (argc < 2) return 1;
char *cmd = argv[1];
return system(cmd);
}
Running Analysis
Click Analyze to start the pipeline:
- C code is compiled to LLVM IR via Compiler Explorer
- The IR is parsed by tree-sitter-llvm (in WASM)
- The parsed IR is converted to AIR JSON
- SAF's WASM module runs the full analysis pipeline
- Results are displayed as interactive graphs
Viewing Graphs
Use the graph selector tabs to switch between:
| Graph | What It Shows |
|---|---|
| CFG | Control flow within each function -- basic blocks and branches |
| Call Graph | Which functions call which -- the program's call hierarchy |
| Def-Use | Where values are defined and where they are used |
| Value Flow | How data flows through the program, including across functions |
| Points-To | What each pointer may point to after analysis |
Interacting with Graphs
- Pan: Click and drag the background
- Zoom: Scroll wheel or pinch gesture
- Select: Click a node to highlight it and its connections
- Details: Selected nodes show their properties in a detail panel
Built-In Examples
The playground includes pre-loaded examples demonstrating different analysis scenarios:
- Command Injection -- Taint flow from
argvtosystem() - Use-After-Free -- Pointer used after
free() - Memory Leak -- Allocation without deallocation
- Double Free -- Same pointer freed twice
- Struct Field Access -- Field-sensitive pointer analysis
Select an example from the dropdown to load it into the editor.
Limitations
The browser playground uses a simplified analysis pipeline compared to the full Docker-based SAF:
| Feature | Browser | Full SAF |
|---|---|---|
| C/C++ compilation | Compiler Explorer API | Local clang-18 |
LLVM bitcode (.bc) | Not supported | Supported |
| Python SDK | Not available | Full access |
| Checker framework | Not available | 9 built-in checkers |
| Z3 path refinement | Not available | Supported |
| SARIF export | Not available | Supported |
| Graph visualization | Built-in | Export to external tools |
See Browser vs Full SAF for a detailed comparison.
Next Steps
- Browser vs Full SAF -- Understand what each mode offers
- First Analysis -- Set up the full local environment
- Concepts -- Learn about the underlying analysis
Browser vs Full SAF
SAF is available in two forms: the browser-based playground (WebAssembly) and the full local installation (Docker). This page compares the two.
Feature Comparison
| Feature | Browser (WASM) | Full SAF (Docker) |
|---|---|---|
| Setup | None -- open a URL | Docker + make shell |
| LLVM frontend | No (tree-sitter) | Yes (inkwell, full LLVM API) |
| Z3 solver | No | Yes |
| Context-sensitive PTA | Limited | Yes |
| Large programs (>1000 LOC) | Slow | Fast |
| Python SDK | Pyodide (limited) | Full |
| Custom specs | Yes | Yes |
| Checker framework | 9 built-in checkers (via query protocol) | 9 built-in checkers + custom checkers |
| SARIF export | Not available | Standards-compliant output via CLI |
| Graph visualization | Built-in Cytoscape.js | Export JSON for external tools |
| Batch scanning | Single file | Multi-file projects |
| Offline use | Requires Compiler Explorer API for C | Fully offline |
When to Use the Browser
The browser playground is ideal for:
- Learning: Experiment with program analysis concepts interactively
- Quick checks: Paste a snippet and see the CFG or call graph immediately
- Demonstrations: Show analysis capabilities without requiring setup
- Embedding: Include analysis widgets in documentation or blog posts
When to Use Full SAF
The full installation is needed for:
- Production analysis: Scanning real codebases for vulnerabilities
- Custom checkers: Writing Python scripts that use the SDK
- CI/CD integration: Automated scanning with SARIF output
- Advanced analysis: Z3-based path refinement, IFDS taint tracking
- Bitcode input: Analyzing pre-compiled
.bcfiles - Batch scanning: Processing multiple files or entire projects
Architecture Differences
Browser
C code --> Compiler Explorer API --> LLVM IR (.ll text)
--> tree-sitter-llvm (WASM) --> CST
--> TypeScript converter --> AIR JSON
--> saf-wasm (WASM) --> PropertyGraph JSON
--> Cytoscape.js visualization
The browser path uses tree-sitter to parse LLVM IR text and a TypeScript layer
to convert the parse tree into AIR JSON. The saf-wasm crate contains
saf-core and saf-analysis compiled to WebAssembly, without LLVM, Z3, or
threading dependencies.
Full SAF
C code --> clang-18 --> LLVM IR (.ll or .bc)
--> inkwell (LLVM C API) --> AIR
--> saf-analysis (native) --> Analysis results
--> Python SDK / CLI / JSON / SARIF
The full path uses inkwell to parse LLVM IR/bitcode natively with full access to LLVM's type system and metadata. Analysis runs in native Rust with threading support.
Migrating from Browser to Full SAF
If you start with the playground and want to move to the full SDK:
- Install Docker and clone the repository (see Installation)
- Save your C code to a
.cfile - Write a Python detection script using the same analysis concepts
- Run inside
make shell
The analysis concepts (sources, sinks, graph types) are the same in both environments. The full SDK simply provides more control and additional features.
Analysis IR (AIR)
AIR (Analysis Intermediate Representation) is SAF's canonical, frontend-agnostic intermediate representation. All analysis passes operate on AIR, never on frontend-specific types like LLVM IR or Clang ASTs.
Why a Separate IR?
SAF supports multiple frontends (LLVM bitcode, AIR JSON, and potentially source-level frontends in the future). Rather than coupling analysis algorithms to any specific input format, SAF defines AIR as a common target:
LLVM bitcode (.bc/.ll) --> LLVM Frontend --> AIR
AIR JSON (.air.json) --> JSON Frontend --> AIR
(future: Clang AST) --> AST Frontend --> AIR
|
v
Analysis passes
(CFG, PTA, ValueFlow, ...)
This design means adding a new frontend requires only implementing the mapping to AIR -- no changes to analysis algorithms.
Structure
An AIR module contains:
| Entity | Description |
|---|---|
| Module | Top-level container with a fingerprint and metadata |
| Functions | Named functions with parameters, return types, and basic blocks |
| Basic Blocks | Sequences of instructions with a terminator |
| Instructions | Individual operations (alloc, load, store, call, etc.) |
| Values | SSA registers, function parameters, constants, and globals |
| Objects | Memory objects (stack allocas, heap allocations, globals) |
Operations
AIR supports the following operation types:
| Category | Operations |
|---|---|
| Allocation | Alloca (stack), Global, HeapAlloc (malloc/calloc) |
| Memory | Load, Store, GEP (field/element access), Memcpy, Memset |
| Control | Br (branch), Switch, Ret (return) |
| SSA | Phi, Select |
| Calls | CallDirect, CallIndirect |
| Transforms | Cast, BinaryOp (arithmetic, bitwise) |
Deterministic IDs
Every AIR entity has a deterministic u128 ID derived from BLAKE3 hashes.
IDs are serialized as 0x followed by 32 lowercase hex characters (e.g.,
0x1a2b3c4d5e6f...).
The ID derivation hierarchy:
ModuleFingerprint = hash(FrontendId, input_fingerprint_bytes)
FunctionId = hash(ModuleFingerprint, "fn", function_key)
BlockId = hash(FunctionId, "bb", block_index)
InstId = hash(BlockId, "inst", inst_index, opcode_tag)
ValueId = derived from instruction results, args, globals, constants
ObjId = derived from allocas, heap allocators, globals
LocId = hash(ObjId, "loc", field_path)
This means identical inputs always produce identical IDs, regardless of when or where the analysis runs. Debug information does not affect structural IDs by default.
Source Metadata
AIR instructions can carry optional source-level metadata:
- Span: File, line, column, byte offsets for source location
- Symbol: Display name, mangled name, namespace path
- Type representation: Frontend-specific type string
This metadata enables source-level error reporting without coupling the analysis to any particular frontend.
Next Steps
- Control Flow Graphs -- How AIR is organized into CFGs
- Points-To Analysis -- How AIR objects are analyzed for aliasing
Control Flow Graphs
A control flow graph (CFG) represents the possible execution paths within a function. An interprocedural control flow graph (ICFG) extends this to the whole program by connecting function calls to their targets.
CFG Basics
A CFG is a directed graph where:
- Nodes are basic blocks -- straight-line sequences of instructions with no branches except at the end
- Edges represent possible control flow between blocks
Key Concepts
| Concept | Description |
|---|---|
| Entry block | Where function execution begins (no predecessors) |
| Exit block | Ends with a return instruction (no successors) |
| Branch block | Has multiple successors (if/while/switch) |
| Merge block | Has multiple predecessors (join point after a branch) |
| Back-edge | Edge to an earlier block, indicating a loop |
Example
int abs(int x) {
if (x < 0) // Block 0: entry, branch
x = -x; // Block 1: negate
return x; // Block 2: merge, exit
}
The CFG for this function has three blocks:
[Block 0: entry] --true--> [Block 1: negate]
| |
+---false--> [Block 2: return] <--+
Block 2 is a merge point (two predecessors) and an exit point (return).
ICFG
The ICFG connects CFGs across functions by adding call and return edges:
- Call edge: From a call instruction to the callee's entry block
- Return edge: From the callee's return to the instruction after the call
main:
[Block 0] --call--> validate: [entry]
...
validate: [exit] --return--> [Block 1]
The ICFG enables interprocedural analysis -- tracking data flow across function boundaries.
SAF's PropertyGraph Format
SAF exports CFGs in the unified PropertyGraph JSON format:
{
"schema_version": "0.1.0",
"graph_type": "cfg",
"nodes": [
{
"id": "0x...",
"labels": ["Block", "Entry"],
"properties": {
"name": "entry",
"function": "main"
}
}
],
"edges": [
{
"src": "0x...",
"dst": "0x...",
"edge_type": "FLOWS_TO",
"properties": {}
}
]
}
- Nodes have
labelsincluding"Block"and optionally"Entry"for entry blocks - The
properties.functionfield groups blocks by function - Edges have
edge_type: "FLOWS_TO"
Exporting with the Python SDK
from saf import Project
proj = Project.open("program.ll")
graphs = proj.graphs()
cfg = graphs.export("cfg")
# Group blocks by function
from collections import defaultdict
by_function = defaultdict(list)
for node in cfg["nodes"]:
fn = node["properties"]["function"]
by_function[fn].append(node)
for fn, blocks in by_function.items():
entry = [b for b in blocks if "Entry" in b["labels"]]
print(f"{fn}: {len(blocks)} blocks, entry={entry[0]['id']}")
Why CFGs Matter
CFGs are the foundation for:
- Reachability analysis: Is a particular code path possible?
- Loop detection: Back-edges in the CFG reveal loops
- Dominance: Which blocks must execute before others?
- Dead code: Blocks with no path from entry are unreachable
- Taint propagation: Data flows follow control flow paths
Next Steps
- Call Graphs -- Function-level structure
- Value Flow -- Data flow built on top of CFGs
Call Graphs
A call graph is a directed graph representing the calling relationships between functions in a program. It answers the question: "which functions can call which other functions?"
Structure
- Nodes represent functions
- Edges represent call sites (function A calls function B)
Key Concepts
| Concept | Description |
|---|---|
| Entry point | Function with no callers (e.g., main) |
| Leaf function | Function that calls no other user-defined functions |
| Fan-out | Number of distinct functions a function calls |
| Fan-in | Number of distinct callers a function has |
| Strongly connected component | Group of functions that (transitively) call each other (recursion) |
Example
main
|-- log_error (shared utility)
\-- parse_request
|-- validate_input
| \-- log_error
|-- process_data
\-- send_response
Here, main is the entry point, and log_error has fan-in of 2 (called by both
main and validate_input). process_data, send_response, and log_error
are leaf functions.
Direct vs Indirect Calls
SAF distinguishes between:
- Direct calls: The target function is known statically (e.g.,
foo()) - Indirect calls: The target is a function pointer (e.g.,
fptr())
For indirect calls, SAF uses points-to analysis to resolve the possible targets. The call graph is refined as PTA discovers which functions a pointer may refer to.
SAF's PropertyGraph Format
{
"schema_version": "0.1.0",
"graph_type": "callgraph",
"nodes": [
{
"id": "0x...",
"labels": ["Function"],
"properties": {
"name": "main",
"kind": "defined"
}
}
],
"edges": [
{
"src": "0x...",
"dst": "0x...",
"edge_type": "CALLS",
"properties": {}
}
]
}
- Nodes have
labels: ["Function"]andproperties.namefor the function name - The
properties.kindfield indicates"defined"(has a body) or"external" - Edges have
edge_type: "CALLS"
Exporting with the Python SDK
from saf import Project
proj = Project.open("program.ll")
graphs = proj.graphs()
cg = graphs.export("callgraph")
nodes = cg["nodes"]
edges = cg["edges"]
# Build adjacency
callees = {}
callers = {}
for node in nodes:
nid = node["id"]
name = node["properties"]["name"]
callees[nid] = []
callers[nid] = []
for edge in edges:
callees[edge["src"]].append(edge["dst"])
callers[edge["dst"]].append(edge["src"])
# Find entry points and leaves
for node in nodes:
nid = node["id"]
name = node["properties"]["name"]
if not callers[nid]:
print(f"Entry point: {name}")
if not callees[nid]:
print(f"Leaf: {name}")
Why Call Graphs Matter
Call graphs are essential for:
- Attack surface analysis: Entry points and functions that process external input
- Dead code detection: Functions with no callers may be unused
- Interprocedural analysis: Taint and value flow follow call edges
- Modular analysis: Analyze one function at a time, using summaries for callees
- Dependency analysis: Understanding how changes propagate through a codebase
Next Steps
- Points-To Analysis -- Resolving indirect call targets
- Value Flow -- Data flow across function calls
Points-To Analysis
Points-to analysis (PTA) determines what memory locations each pointer in a program may refer to. This is fundamental to understanding aliasing, resolving indirect calls, and building accurate data flow graphs.
The Problem
Consider:
int x = 10, y = 20;
int *p = &x;
int *q = p; // q also points to x
*q = 30; // modifies x through q
Without PTA, an analysis cannot determine that modifying *q affects x.
Points-to analysis computes that both p and q point to the same object x,
enabling the analysis to track data flow through memory.
Andersen's Analysis
SAF implements Andersen-style inclusion-based points-to analysis. This is a whole-program, flow-insensitive, context-insensitive analysis that computes a conservative approximation of the points-to sets.
Constraint Types
PTA works by extracting constraints from the program and solving them:
| Constraint | Meaning | Example |
|---|---|---|
| Addr | p points to object o | p = &x |
| Copy | p points to everything q points to | p = q |
| Load | p points to what *q points to | p = *q |
| Store | *p points to what q points to | *p = q |
| GEP | p points to a field of what q points to | p = &q->field |
Solving
The solver iterates until a fixed point:
- Initialize points-to sets from
Addrconstraints - Process
Copyconstraints: propagate points-to sets - Process
Load/Storeconstraints: follow indirections - Repeat until no points-to set changes
Field Sensitivity
SAF supports field-sensitive analysis, distinguishing different fields of a struct:
struct Pair { int *a; int *b; };
struct Pair s;
s.a = &x; // s.a -> x
s.b = &y; // s.b -> y (distinct from s.a)
Without field sensitivity, s.a and s.b would be conflated, causing false
aliasing.
Querying Points-To Results
Python SDK
from saf import Project
proj = Project.open("program.ll")
q = proj.query()
# What does pointer p point to? (pass the hex ID of the pointer value)
pts = q.points_to("0x00000000000000000000000000000001")
print(f"p points to: {pts}")
# Do p and q alias? (pass hex IDs of both pointer values)
alias = q.may_alias("0x00000000000000000000000000000001",
"0x00000000000000000000000000000002")
print(f"p and q may alias: {alias}")
Export Format
pta = proj.pta_result().export()
# Returns: {"points_to": [{"value": "0x...", "locations": ["0x...", ...]}, ...]}
Impact on Other Analyses
PTA results feed into multiple downstream analyses:
| Analysis | How PTA Helps |
|---|---|
| Call graph | Resolves indirect call targets (function pointers) |
| ValueFlow | Determines which stores affect which loads |
| Taint analysis | Tracks taint through pointer indirections |
| Alias analysis | Identifies when two pointers may refer to the same memory |
Without PTA, these analyses must conservatively assume that any pointer could point to any object, leading to excessive false positives.
Configuration
PTA behavior can be tuned via configuration:
{
"pta": {
"enabled": true,
"field_sensitivity": "struct_fields",
"max_objects": 100000
}
}
field_sensitivity:"none"or"struct_fields"max_objects: Safety cap; when exceeded, analysis collapses to a conservative approximation usingunknown_mem
Next Steps
- Value Flow -- How PTA enables precise data flow tracking
- Taint Analysis -- Using PTA-aware taint propagation
Value Flow
The ValueFlow graph is SAF's central data flow representation. It tracks how values move through a program -- from where they are created to where they are used -- across function boundaries and through memory operations.
From Def-Use to ValueFlow
Def-Use Chains
A def-use chain connects a value's definition to its uses within a single function:
int x = 10; // definition of x
printf("%d", x); // use of x
return x; // another use of x
Def-use chains are intraprocedural (within one function) and track SSA values.
ValueFlow Graph
The ValueFlow graph extends def-use chains with:
- Interprocedural edges: Data flowing across function calls (arguments and return values)
- Memory modeling: Data flowing through stores and loads, resolved by PTA
- Transform edges: Data modified by arithmetic, casts, or other operations
Edge Types
| Edge Type | Meaning | Example |
|---|---|---|
| DefUse | SSA def-use chain (direct assignment) | y = x |
| Store | Value written to memory | *p = x |
| Load | Value read from memory | y = *p |
| CallArg | Value passed as function argument | foo(x) |
| Return | Value returned from function | return x |
| Transform | Value modified by an operation | y = x + 1 |
Node Types
| Node Kind | Meaning |
|---|---|
| Value | An SSA register value |
| Location | A memory location (from pointer analysis) |
| UnknownMem | Unknown or external memory |
Example
char *buf = malloc(64); // Value: malloc return
strcpy(buf, "hello"); // CallArg: buf -> strcpy arg 0
log_message(buf); // CallArg: buf -> log_message arg 0
free(buf); // CallArg: buf -> free arg 0
The ValueFlow graph captures all of these flows:
malloc() return
|
+--[CallArg]--> strcpy arg 0
+--[CallArg]--> log_message arg 0
+--[CallArg]--> free arg 0
PropertyGraph Format
{
"schema_version": "0.1.0",
"graph_type": "valueflow",
"metadata": {
"node_count": 42,
"edge_count": 58
},
"nodes": [
{
"id": "0x...",
"labels": ["Value"],
"properties": { "kind": "Value" }
}
],
"edges": [
{
"src": "0x...",
"dst": "0x...",
"edge_type": "DEFUSE",
"properties": {}
}
]
}
Exporting with the Python SDK
from saf import Project
proj = Project.open("program.ll")
graphs = proj.graphs()
# Def-use chains
defuse = graphs.export("defuse")
definitions = [e for e in defuse["edges"] if e["edge_type"] == "DEFINES"]
uses = [e for e in defuse["edges"] if e["edge_type"] == "USED_BY"]
print(f"Definitions: {len(definitions)}, Uses: {len(uses)}")
# Full ValueFlow graph
vf = graphs.export("valueflow")
print(f"Nodes: {len(vf['nodes'])}, Edges: {len(vf['edges'])}")
# Count edge types
from collections import Counter
edge_types = Counter(e["edge_type"] for e in vf["edges"])
for kind, count in edge_types.most_common():
print(f" {kind}: {count}")
How ValueFlow Enables Analysis
The ValueFlow graph is the foundation for SAF's query capabilities:
| Analysis | How It Uses ValueFlow |
|---|---|
| Taint flow | BFS from source nodes to sink nodes |
| Memory leak | Check if allocation nodes reach exit without passing through free |
| Use-after-free | Check if freed pointer reaches a dereference |
| Double free | Check if freed pointer reaches another free |
Next Steps
- Taint Analysis -- Querying the ValueFlow graph for vulnerabilities
- Points-To Analysis -- How PTA makes ValueFlow precise
Taint Analysis
Taint analysis tracks the flow of untrusted data through a program to determine if it can reach security-sensitive operations. It is SAF's primary technique for detecting injection vulnerabilities, information leaks, and other data-flow security issues.
Core Concepts
Sources, Sinks, and Sanitizers
| Concept | Definition | Examples |
|---|---|---|
| Source | Where untrusted ("tainted") data enters the program | argv, getenv(), read(), fgets() |
| Sink | A dangerous function that should never receive tainted data | system(), execve(), printf() format arg |
| Sanitizer | A function that validates or cleans data, removing the taint | Input validation, bounds checking, escaping |
The Question
Taint analysis answers: "Can data from a source reach a sink without passing through a sanitizer?"
If the answer is yes, a finding is reported -- a potential vulnerability.
How SAF Performs Taint Analysis
SAF implements taint analysis as a graph reachability query over the ValueFlow graph:
- Identify source nodes in the ValueFlow graph (e.g.,
argvparameter,getenv()return value) - Identify sink nodes (e.g.,
system()argument) - Identify sanitizer nodes (optional -- nodes that "clean" the taint)
- BFS traversal from sources to sinks, skipping paths through sanitizers
- Report findings with deterministic trace paths
BFS vs IFDS
SAF provides two taint analysis modes:
| Mode | Method | Precision | Speed |
|---|---|---|---|
| BFS | q.taint_flow() | Flow-insensitive | Fast |
| IFDS | proj.ifds_taint() | Context-sensitive, flow-sensitive | Slower |
BFS is sufficient for most vulnerability detection. IFDS provides higher precision when false positives from flow-insensitive analysis are a concern.
Using the Python SDK
Basic Taint Flow
from saf import Project, sources, sinks
proj = Project.open("program.ll")
q = proj.query()
# Find flows from argv to system()
findings = q.taint_flow(
sources=sources.function_param("main", 1), # argv
sinks=sinks.call("system", arg_index=0), # system()'s first arg
)
for f in findings:
print(f"Finding: {f.finding_id}")
if f.trace:
for step in f.trace.steps:
print(f" -> {step}")
Available Selectors
Source selectors:
| Selector | Description |
|---|---|
sources.function_param(name, index) | Function parameter by name and position |
sources.function_return(name) | Return value of a named function |
sources.call(name) | Return value from calls to a named function |
Sink selectors:
| Selector | Description |
|---|---|
sinks.call(name, arg_index=N) | Argument N of calls to a named function |
sinks.arg_to(name, index) | Argument at index passed to a named function |
With Sanitizers
from saf import sources, sinks
findings = q.taint_flow(
sources=sources.function_param("main", 1),
sinks=sinks.call("system", arg_index=0),
sanitizers=sources.function_return("validate_input"), # Paths through this function are safe
)
Common Vulnerability Patterns
| Vulnerability | CWE | Source | Sink |
|---|---|---|---|
| Command injection | CWE-78 | argv, getenv() | system(), execve() |
| Format string | CWE-134 | User input | printf() format arg |
| SQL injection | CWE-89 | HTTP parameters | SQL query functions |
| Path traversal | CWE-22 | User input | fopen(), open() |
| Buffer overflow | CWE-120 | malloc() return | Unchecked memory write |
Checker Framework
For common patterns, SAF provides built-in checkers that pre-configure the appropriate sources, sinks, and modes:
# Instead of manually specifying sources and sinks:
findings = proj.check("memory-leak")
findings = proj.check("use-after-free")
findings = proj.check("double-free")
# Or run all 9 built-in checkers at once
all_findings = proj.check_all()
The checker framework supports 9 built-in checkers covering memory safety, information flow, and resource management. See the Python SDK reference for the full list.
Next Steps
- Tutorials -- Hands-on guides for UAF, leaks, double-free, and taint analysis
- Python SDK Reference -- Full API reference for selectors, checkers, and queries
Python SDK
The SAF Python SDK (import saf) provides full access to SAF's static analysis
capabilities from Python. It is built with PyO3 and installed via maturin.
Installation
The SDK is built automatically when entering the Docker environment:
make shell
# SDK is available immediately:
python3 -c "import saf; print(saf.version())"
For manual installation inside the dev container:
maturin develop --release
Core API
Project
The Project class is the main entry point for all analysis operations.
from saf import Project
# Open a project from LLVM IR
proj = Project.open("program.ll")
# Open from AIR-JSON
proj = Project.open("program.air.json")
# Open with analysis tuning parameters
proj = Project.open(
"program.ll",
vf_mode="precise", # "fast" (default) or "precise"
pta_solver="worklist", # "worklist" (default) or "datalog"
pta_max_iterations=20000, # default: 10000
field_sensitivity_depth=3, # default: 2 (0 = disabled)
max_refinement_iterations=5, # default: 10
)
Project.open() signature:
Project.open(
path: str,
*,
vf_mode: str = "fast",
pta_solver: str = "worklist",
pta_max_iterations: int | None = None,
field_sensitivity_depth: int | None = None,
max_refinement_iterations: int | None = None,
) -> Project
| Parameter | Description |
|---|---|
path | Path to input file (.air.json, .ll, or .bc). Frontend is selected automatically by extension. |
vf_mode | "fast" routes all memory through a single unknown node for robust taint analysis. "precise" uses points-to analysis to resolve memory locations (may miss flows through unresolved pointers). |
pta_solver | "worklist" uses the imperative worklist-based solver. "datalog" uses the Ascent Datalog fixpoint solver. |
pta_max_iterations | Maximum PTA solver iterations. Default: 10000. |
field_sensitivity_depth | Field sensitivity depth. 0 = disabled, default: 2. Higher values track deeper nested struct fields. |
max_refinement_iterations | Maximum CG refinement iterations. Default: 10. |
Raises: FrontendError if the input file cannot be parsed or the required frontend is not available.
Schema Discovery
schema = proj.schema()
# Returns a dict with structured information about:
# - tool_version, schema_version
# - frontends (air-json, llvm with extensions and descriptions)
# - graphs (cfg, callgraph, defuse, valueflow)
# - queries (taint_flow, flows, points_to, may_alias with parameters)
# - selectors (sources, sinks, sanitizers)
Query Context
from saf import sources, sinks, sanitizers
q = proj.query()
# Taint flow analysis
findings = q.taint_flow(
sources=sources.function_param("main", 1),
sinks=sinks.call("system", arg_index=0),
)
# With sanitizers (accepts a Selector or SelectorSet, not strings)
findings = q.taint_flow(
sources=sources.function_param("main", 1),
sinks=sinks.call("system", arg_index=0),
sanitizers=sanitizers.call("validate_input"),
limit=500, # default: 1000
)
# Data flows (without sanitizer filtering)
findings = q.flows(
sources=sources.function_param("read_data"),
sinks=sinks.arg_to("write_output", 0),
limit=100,
)
# Points-to query (takes hex value ID string)
pts = q.points_to("0x00000000000000000000000000000001")
# Alias query (takes hex value ID strings)
alias = q.may_alias("0x00000001", "0x00000002")
Query method signatures:
| Method | Parameters | Returns |
|---|---|---|
taint_flow(sources, sinks, sanitizers=None, *, limit=1000) | Selector/SelectorSet for each | list[Finding] |
flows(sources, sinks, *, limit=1000) | Selector/SelectorSet for each | list[Finding] |
points_to(ptr) | Hex string value ID | list[str] (location IDs) |
may_alias(p, q) | Hex string value IDs | bool |
Graph Export
graphs = proj.graphs()
# List available graph types
print(graphs.available()) # ["cfg", "callgraph", "defuse", "valueflow"]
# Export to PropertyGraph dict
cfg = graphs.export("cfg")
cfg_main = graphs.export("cfg", function="main") # single function
cg = graphs.export("callgraph")
du = graphs.export("defuse")
vf = graphs.export("valueflow")
# Export to Graphviz DOT string
dot_str = graphs.to_dot("callgraph")
# Export to interactive HTML (Cytoscape.js)
html_str = graphs.to_html("cfg", function="main")
All export() calls return a unified PropertyGraph dict:
{
"schema_version": "0.1.0",
"graph_type": "callgraph",
"metadata": {},
"nodes": [{"id": "0x...", "labels": [...], "properties": {...}}, ...],
"edges": [{"src": "0x...", "dst": "0x...", "edge_type": "...", "properties": {}}, ...],
}
Source and Sink Selectors
Selectors identify values in the program for taint analysis. They can be
combined using the | operator to form a SelectorSet.
Sources (saf.sources)
| Function | Description |
|---|---|
function_param(function, index=None) | Select function parameters by name pattern (glob-style). index is 0-based; None selects all parameters. |
function_return(function) | Select function return values by name pattern. |
call(callee) | Select return values of calls to a function. |
argv() | Select command-line arguments (shortcut for function_param("main", None)). |
getenv(name=None) | Select environment variable reads (shortcut for call("getenv")). |
from saf import sources
src = sources.function_param("main", 1)
src = sources.function_param("read_*") # glob pattern
src = sources.function_return("get_input")
src = sources.call("getenv")
src = sources.argv()
# Combine with |
combined = sources.argv() | sources.getenv()
Sinks (saf.sinks)
| Function | Description |
|---|---|
call(callee, *, arg_index=None) | Select calls to a function. If arg_index is given, selects that argument; otherwise selects the call result. |
arg_to(callee, index) | Select arguments passed to a function (0-based index). |
from saf import sinks
sink = sinks.call("system", arg_index=0)
sink = sinks.call("printf", arg_index=0)
sink = sinks.arg_to("free", 0)
# Without arg_index, selects the call result
sink = sinks.call("dangerous_function")
Sanitizers (saf.sanitizers)
| Function | Description |
|---|---|
call(callee, *, arg_index=None) | Select calls to a sanitizing function. If arg_index is given, selects that argument; otherwise the return value is considered sanitized. |
arg_to(callee, index) | Select arguments passed to a sanitizing function. |
from saf import sanitizers
san = sanitizers.call("escape_html", arg_index=0)
san = sanitizers.call("sanitize_input")
# Combine sanitizers
combined = sanitizers.call("sanitize") | sanitizers.call("escape")
Module-Level Selector Factories
The following factory functions are also available directly from saf._saf
(used internally by sources, sinks, and sanitizers modules):
function_param(function, index=None)-- Select function parameters.function_return(function)-- Select function return values.call(callee)-- Select call results.arg_to(callee, index=None)-- Select arguments to a callee.
Checker Framework
Running Built-In Checkers
# Run a specific checker
findings = proj.check("memory-leak")
# Run multiple checkers at once (pass a list)
findings = proj.check(["memory-leak", "use-after-free", "double-free"])
# Run all 9 built-in checkers
findings = proj.check_all()
# List available checkers and their metadata
schema = proj.checker_schema()
for checker in schema["checkers"]:
print(f"{checker['name']}: {checker['description']} (CWE-{checker['cwe']})")
Built-In Checker Table
| Checker Name | CWE | Description |
|---|---|---|
memory-leak | 401 | Allocated memory never freed |
use-after-free | 416 | Memory accessed after being freed |
double-free | 415 | Memory freed more than once |
null-deref | 476 | Null pointer dereference |
file-descriptor-leak | 403 | Opened file never closed |
uninit-use | 457 | Use of uninitialized memory |
stack-escape | 562 | Returning stack address |
lock-not-released | 764 | Mutex not unlocked |
generic-resource-leak | N/A | Custom resource tracking |
Custom Checkers
# Define a custom checker with source/sink/sanitizer roles
findings = proj.check_custom(
"my-custom-leak",
mode="must_not_reach", # "may_reach", "must_not_reach", or "never_reach_sink"
source_role="allocator", # resource role for sources
source_match_return=True, # match return value (True) or first arg (False)
sink_is_exit=True, # sinks are function exits
sink_role=None, # or a resource role string
sanitizer_role="deallocator", # or None
sanitizer_match_return=False,
cwe=401, # optional CWE ID
severity="warning", # "info", "warning", "error", "critical"
)
Path-Sensitive Checking (Z3)
# Run checkers with Z3-based path feasibility filtering
result = proj.check_path_sensitive("null-deref", z3_timeout_ms=2000, max_guards=64)
# Or run all checkers with path sensitivity
result = proj.check_all_path_sensitive(z3_timeout_ms=1000, max_guards=64)
# Result has feasible, infeasible, and unknown findings
print(f"Real bugs: {len(result.feasible)}")
print(f"False positives filtered: {len(result.infeasible)}")
print(f"Unknown: {len(result.unknown)}")
print(result.diagnostics) # dict with Z3 statistics
# Post-filter existing findings
raw_findings = proj.check_all()
result = proj.filter_infeasible(raw_findings, z3_timeout_ms=1000, max_guards=64)
CheckerFinding Attributes
Each item returned by check(), check_all(), or check_custom() is a
CheckerFinding:
| Attribute | Type | Description |
|---|---|---|
checker | str | Checker name that produced this finding |
severity | str | "info", "warning", "error", or "critical" |
cwe | int | None | CWE ID if applicable |
message | str | Human-readable description |
source | str | Source SVFG node hex ID |
sink | str | Sink SVFG node hex ID |
trace | list[str] | Path from source to sink as hex node IDs |
sink_traces | list[dict] | Per-sink traces for multi-reach findings (e.g., double-free). Each dict has "sink" and "trace" keys. |
for f in proj.check("use-after-free"):
print(f.checker, f.severity, f.message)
print(f" CWE-{f.cwe}: {f.source} -> {f.sink}")
print(f" Trace length: {len(f.trace)}")
# Convert to dict
d = f.to_dict()
Finding Objects
The taint_flow() and flows() query methods return Finding objects, which
are distinct from CheckerFinding objects.
Finding Attributes
| Attribute | Type | Description |
|---|---|---|
finding_id | str | Deterministic hex identifier |
source_location | str | Source location (file:line:col or value ID) |
sink_location | str | Sink location (file:line:col or value ID) |
source_id | str | Source value ID (hex) |
sink_id | str | Sink value ID (hex) |
rule_id | str | None | Optional rule identifier |
trace | Trace | Step-by-step data flow path |
for f in q.taint_flow(sources.argv(), sinks.call("system", arg_index=0)):
print(f"{f.source_location} -> {f.sink_location}")
print(f.trace.pretty()) # human-readable trace
print(f"Steps: {len(f.trace)}")
d = f.to_dict() # convert to dict
Trace and TraceStep
A Trace contains a list of TraceStep objects. Each step represents one
hop in the value-flow graph:
| TraceStep Attribute | Type | Description |
|---|---|---|
from_id | str | Source node ID |
from_kind | str | Source node kind |
from_symbol | str | None | Symbol name at source |
from_location | str | None | Source file:line:col |
edge | str | Edge kind (def_use, transform, store, load, etc.) |
to_id | str | Target node ID |
to_kind | str | Target node kind |
to_symbol | str | None | Symbol name at target |
to_location | str | None | Target file:line:col |
for step in finding.trace.steps:
print(f" {step.from_symbol or step.from_id} --{step.edge}-> "
f"{step.to_symbol or step.to_id}")
Resource Table
The resource table maps function names to resource management roles. It ships with built-in entries for C stdlib, C++ operators, POSIX I/O, and pthreads.
table = proj.resource_table()
table.has_role("malloc", "allocator") # True
table.has_role("free", "deallocator") # True
table.has_role("fopen", "acquire") # True
table.has_role("fclose", "release") # True
# Add custom entries
table.add("my_alloc", "allocator")
table.add("my_free", "deallocator")
# Inspect
print(table.size) # number of entries
print(table.function_names()) # sorted list of function names
entries = table.export() # list of {"name": ..., "roles": [...]}
Available roles: allocator, deallocator, reallocator, acquire,
release, lock, unlock, null_source, dereference.
Advanced Analysis
IFDS Taint Analysis
Precise interprocedural taint tracking using the IFDS framework (Reps/Horwitz/Sagiv tabulation algorithm):
result = proj.ifds_taint(
sources=sources.function_param("main", 0),
sinks=sinks.call("system", arg_index=0),
sanitizers=sanitizers.call("validate"), # optional
)
Typestate Analysis
Track per-resource state machines using the IDE framework:
# Built-in specs: "file_io", "mutex_lock", "memory_alloc"
result = proj.typestate("file_io")
# Custom typestate spec
from saf import TypestateSpec
result = proj.typestate_custom(spec)
Flow-Sensitive Pointer Analysis
More precise than Andersen's flow-insensitive analysis for programs with pointer reassignment:
fs_result = proj.flow_sensitive_pta(pts_repr="auto")
# pts_repr options: "auto", "btreeset", "bitvector", "bdd"
Context-Sensitive Pointer Analysis (k-CFA)
Distinguishes calls to the same function from different call sites:
cs_result = proj.context_sensitive_pta(k=1, pts_repr="auto")
Demand-Driven Pointer Analysis
Computes points-to information only for explicitly queried pointers:
dda = proj.demand_pta(
max_steps=100_000,
max_context_depth=10,
timeout_ms=5000,
enable_strong_updates=True,
pts_repr="auto",
)
Memory SSA and SVFG
mssa = proj.memory_ssa() # Memory SSA representation
svfg = proj.svfg() # Sparse Value-Flow Graph
Call Graph Refinement
Iterative CHA + PTA-based indirect call resolution:
result = proj.refine_call_graph(entry_points="all", max_iterations=10)
Abstract Interpretation
Numeric interval analysis with widening/narrowing:
result = proj.abstract_interp(
max_widening=100,
narrowing_iterations=3,
use_thresholds=True,
)
Numeric Checkers
# Individual numeric checker
findings = proj.check_numeric("buffer_overflow") # CWE-120
findings = proj.check_numeric("integer_overflow") # CWE-190
findings = proj.check_numeric("division_by_zero") # CWE-369
findings = proj.check_numeric("shift_count") # CWE-682
# All numeric checkers at once
findings = proj.check_all_numeric()
Combined PTA + Abstract Interpretation
Alias-aware numeric analysis with bidirectional refinement:
result = proj.analyze_combined(
enable_refinement=True,
max_refinement_iterations=3,
)
interval = result.interval_at("0x1234...")
alias = result.may_alias("0x5678...", "0x9abc...")
Z3 Path Refinement
# Prove/disprove assertions
result = proj.prove_assertions(z3_timeout_ms=1000, max_guards=64)
# Refine alias query with path constraints
result = proj.refine_alias("0xP", "0xQ", at_block="0xB", func_id="0xF")
# Check if a feasible path exists between two blocks
result = proj.check_path_reachable(
from_block="0xB1", to_block="0xB2", func_id="0xF",
z3_timeout_ms=1000, max_guards=64, max_paths=100,
)
JSON Protocol (LLM Agent Interface)
import json
resp = proj.request('{"action": "schema"}')
data = json.loads(resp)
AIR Module Access
The AirModule provides mid-level access to the intermediate representation:
air = proj.air()
print(air.name) # module name
print(air.id) # module ID (hex)
print(air.function_count) # number of functions
print(air.global_count) # number of globals
print(air.function_names()) # list of function names
print(air.global_names()) # list of global names
Visualization
The saf.viz module provides dependency-free graph visualization:
from saf import viz
pg = proj.graphs().export("callgraph")
# Graphviz DOT string (no dependencies)
dot_str = viz.to_dot(pg)
# Interactive HTML with Cytoscape.js (no dependencies)
html_str = viz.to_html(pg)
# Open in browser or save to file
viz.visualize(pg) # opens in browser
viz.visualize(pg, output="callgraph.html") # saves to file
# Cytoscape.js JSON (for ipycytoscape in Jupyter)
cy_json = viz.to_cytoscape_json(pg)
# NetworkX DiGraph (requires networkx)
G = viz.to_networkx(pg)
# graphviz.Digraph object (requires graphviz package)
gv = viz.to_graphviz(pg)
gv.render("graph", format="svg")
Exceptions
All SAF exceptions inherit from SafError and carry .code and .details
attributes for structured error handling:
| Exception | Description |
|---|---|
SafError | Base exception for all SAF errors |
FrontendError | Frontend ingestion errors (parsing, I/O, unsupported features) |
AnalysisError | Analysis errors (PTA timeout, ValueFlow build error) |
QueryError | Query execution errors (invalid selector, no match) |
ConfigError | Configuration errors (invalid field, incompatible options) |
from saf import Project, SafError, FrontendError
try:
proj = Project.open("nonexistent.ll")
except FrontendError as e:
print(f"Error: {e}")
Module Exports
The saf package exports the following names:
from saf import (
# Core classes
Project, Query, Finding, Trace, TraceStep,
# Selectors
Selector, SelectorSet,
# Selector modules
sources, sinks, sanitizers, viz,
# Checker types
CheckerFinding, PathSensitiveResult, ResourceTable,
# Typestate types
TypestateResult, TypestateFinding, TypestateSpec, typestate_specs,
# Exceptions
SafError, FrontendError, AnalysisError, QueryError, ConfigError,
# Resource role constants
Allocator, Deallocator, Reallocator, Acquire, Release,
NullSource, Dereference,
# Reachability mode constants
MayReach, MustNotReach,
# Severity constants
Info, Warning, Error, Critical,
# Functions
version,
)
CLI
The SAF command-line interface provides access to analysis capabilities from the
terminal. The binary is named saf.
Global Options
| Option | Description |
|---|---|
--json-errors | Output errors as JSON for machine-readable consumption |
Commands
saf index
Index input files via a frontend to produce AIR-JSON.
saf index program.ll
saf index program.ll --output bundle.air.json
saf index --frontend air-json bundle.air.json
| Option | Type | Default | Description |
|---|---|---|---|
<inputs> | positional, required | Input files to index | |
--frontend <frontend> | llvm, air-json | llvm | Frontend to use for ingestion |
--output <path> | path | stdout | Write AIR-JSON output to file instead of stdout |
saf run
Run analysis passes on input programs. This is the primary command -- it ingests input files, builds internal representations (CFG, call graph, pointer analysis), runs the selected checkers, and produces findings.
saf run program.ll
saf run program.ll --mode fast
saf run program.ll --format sarif --output report.sarif
saf run program.ll --checkers memory-leak,null-deref --pta cspta --pta-k 2
Input Options
| Option | Type | Default | Description |
|---|---|---|---|
<inputs> | positional, required | Input files to analyze | |
--frontend <frontend> | llvm, air-json | llvm | Frontend to use for ingestion |
--specs <path> | path | Additional spec files or directories |
Analysis Options
| Option | Type | Default | Description |
|---|---|---|---|
--mode <mode> | fast, precise | precise | Analysis mode |
--checkers <list> | string | all | Checkers to run (comma-separated, or all / none) |
--combined | flag | Run combined PTA + abstract interpretation |
Pointer Analysis Options
| Option | Type | Default | Description |
|---|---|---|---|
--pta <variant> | andersen, cspta, fspta, dda | andersen | PTA variant |
--solver <backend> | worklist, datalog | worklist | PTA solver backend |
--pts-repr <repr> | auto, btreeset, fxhash, roaring, bdd | auto | Points-to set representation |
--pta-k <n> | integer | 2 | k-CFA depth (cspta only) |
--field-sensitivity <level> | struct-fields, array-index, flat | struct-fields | Field sensitivity level |
--max-pta-iterations <n> | integer | Maximum PTA iterations |
Z3 Options
| Option | Type | Default | Description |
|---|---|---|---|
--path-sensitive | flag | Enable Z3 path-sensitive checker filtering | |
--z3-prove | flag | Prove assertions via Z3 | |
--z3-refine-alias | flag | Refine alias results via Z3 | |
--z3-check-reachability | flag | Check path reachability via Z3 | |
--z3-timeout <ms> | integer | 5000 | Z3 solver timeout in milliseconds |
Typestate and Taint Options
| Option | Type | Default | Description |
|---|---|---|---|
--typestate <name> | string | Run built-in typestate spec | |
--typestate-custom <path> | path | Run custom typestate spec from YAML | |
--ifds-taint <path> | path | Run IFDS taint analysis with config file |
Output Options
| Option | Type | Default | Description |
|---|---|---|---|
--format <fmt> | human, json, sarif | human | Output format |
--output <path> | path | stdout | Write output to file |
--diagnostics | flag | Include checker/PTA diagnostics | |
--verbose | flag | Show timing, resource table, stats |
saf query
Query analysis results. Requires --input and a subcommand.
saf query points-to --input program.ll 0x00ab1234...
saf query alias --input program.ll 0x001a... 0x002b...
saf query flows --input program.ll 0xsource... 0xsink...
saf query taint --input program.ll 0xsource... 0xsink...
saf query reachable --input program.ll 0xfunc1... 0xfunc2...
| Option | Type | Default | Description |
|---|---|---|---|
--input <files> | path(s), required | Input files to analyze | |
--frontend <frontend> | llvm, air-json | llvm | Frontend to use for ingestion |
Subcommands
| Subcommand | Arguments | Description |
|---|---|---|
points-to | <pointer> (hex value ID) | Points-to set for a value |
alias | <p> <q> (hex value IDs) | May-alias check between two pointers |
flows | <source> <sink> (hex value IDs) | Data-flow reachability |
taint | <source> <sink> (hex value IDs) | Taint-flow query |
reachable | <func_ids>... (hex function IDs) | Call-graph reachability from functions |
saf export
Export graphs or findings in various formats.
saf export cfg --input program.ll
saf export callgraph --input program.ll --format dot --output cg.dot
saf export findings --input program.ll --format sarif --output report.sarif
saf export valueflow --input program.ll --format json --output vfg.json
| Option | Type | Default | Description |
|---|---|---|---|
<target> | positional, required | Export target (see below) | |
--input <files> | path(s), required | Input files to analyze | |
--format <fmt> | json, sarif, dot, html | json | Output format |
--output <path> | path | stdout | Write output to file instead of stdout |
--function <name> | string | Filter to a specific function (for CFG export) | |
--frontend <frontend> | llvm, air-json | llvm | Frontend to use for ingestion |
Export Targets
| Target | Description |
|---|---|
cfg | Control-flow graph |
callgraph | Call graph (inter-procedural) |
defuse | Def-use chains |
valueflow | Value-flow graph |
svfg | Sparse Value-Flow Graph |
findings | Analysis findings / bug reports |
pta | Points-to analysis results |
saf schema
Print the SAF schema (supported frontends, queries, checkers).
saf schema
saf schema --checkers
saf schema --frontends
saf schema --format json
| Option | Type | Default | Description |
|---|---|---|---|
--checkers | flag | List available checkers only | |
--frontends | flag | List available frontends only | |
--format <fmt> | human, json, sarif | human | Output format |
saf specs
Manage function specifications (list, validate, look up).
saf specs list
saf specs list --verbose
saf specs validate path/to/specs.yaml
saf specs lookup malloc
Subcommands
| Subcommand | Arguments | Description |
|---|---|---|
list | [--verbose] | List loaded function specifications |
validate | <path> (required) | Validate spec file or directory |
lookup | <name> (required) | Look up the spec for a function by name |
saf incremental
Run incremental analysis with caching. Only recomputes what changed between runs.
saf incremental src/*.ll
saf incremental src/*.ll --cache-dir .saf-cache
saf incremental src/*.ll --plan
saf incremental src/*.ll --clean
saf incremental src/*.ll --export-summaries summaries.yaml
| Option | Type | Default | Description |
|---|---|---|---|
<inputs> | positional, required | Input files to analyze | |
--frontend <frontend> | llvm, air-json | llvm | Frontend to use for ingestion |
--mode <mode> | sound, best-effort | best-effort | Precision mode for incremental analysis |
--cache-dir <path> | path | .saf-cache | Cache directory for incremental state |
--plan | flag | Dry-run: show what would be recomputed without running analysis | |
--clean | flag | Clear the cache before analysis | |
--export-summaries <path> | path | Export computed summaries as YAML to the given path |
saf help
Show topic-based help guides.
saf help # Overview of all commands and topics
saf help run # How to run analyses and configure passes
saf help checkers # Built-in bug checkers and how to select them
saf help pta # Pointer analysis variants and configuration
saf help typestate # Typestate checking and custom protocol specs
saf help taint # Taint analysis modes and configuration
saf help z3 # Z3-based path sensitivity and refinement
saf help export # Export targets, formats, and examples
saf help specs # Function specification format and discovery
saf help incremental # Incremental analysis and caching
saf help examples # Common usage patterns and recipes
Available Topics
| Topic | Description |
|---|---|
run | How to run analyses and configure passes |
checkers | Built-in bug checkers and how to select them |
pta | Pointer analysis variants and configuration |
typestate | Typestate checking and custom protocol specs |
taint | Taint analysis modes and configuration |
z3 | Z3-based path sensitivity and refinement |
export | Export targets, formats, and examples |
specs | Function specification format and discovery |
incremental | Incremental analysis and caching |
examples | Common usage patterns and recipes |
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success (no findings, or findings exported) |
| 1 | Error (invalid input, analysis failure) |
| 2 | Findings detected (when used in CI mode) |
Examples
Quick Scan
# Run all checkers with defaults (Andersen PTA, precise mode)
saf run program.ll
CI/CD Integration
# Run all checkers and output SARIF for GitHub Code Scanning
saf run program.ll --format sarif --output results.sarif
# Check exit code for pass/fail
if saf run program.ll --format json | jq '.findings | length' | grep -q '^0$'; then
echo "No findings"
else
echo "Findings detected"
exit 1
fi
Batch Analysis
# Analyze multiple files
for f in src/*.ll; do
saf run "$f" --format sarif --output "results/$(basename "$f" .ll).sarif"
done
Path-Sensitive Analysis
# Reduce false positives with Z3-based path feasibility filtering
saf run program.ll --path-sensitive --z3-timeout 10000
Export Call Graph as DOT
saf export callgraph --input program.ll --format dot --output cg.dot
dot -Tpng cg.dot -o cg.png
Incremental Analysis
# First run caches results; subsequent runs only re-analyze changed files
saf incremental src/*.ll --cache-dir .saf-cache
Demand-Driven Points-To Query
saf query points-to --input program.ll 0x00ab1234abcd5678
PropertyGraph Format
All SAF graph exports use a unified PropertyGraph JSON format. This format is shared across graph types (CFG, call graph, def-use, value-flow, SVFG, PTA) for consistent downstream processing. SAF also exports findings (checker results) and a native PTA format, documented below.
Schema
{
"schema_version": "0.1.0",
"graph_type": "<type>",
"metadata": {},
"nodes": [
{
"id": "0x...",
"labels": ["Label1", "Label2"],
"properties": { "key": "value" }
}
],
"edges": [
{
"src": "0x...",
"dst": "0x...",
"edge_type": "EDGE_TYPE",
"properties": {}
}
]
}
Top-Level Fields
| Field | Type | Description |
|---|---|---|
schema_version | string | Format version (currently "0.1.0") |
graph_type | string | One of cfg, callgraph, defuse, valueflow, svfg, pta |
metadata | object | Graph-specific metadata (e.g., node/edge counts) |
nodes | array | List of node objects |
edges | array | List of edge objects |
Node Fields
| Field | Type | Description |
|---|---|---|
id | string | Deterministic hex ID (0x + 32 hex chars) |
labels | array[string] | Node type labels (e.g., ["Function"], ["Block", "Entry"]) |
properties | object | Type-specific properties |
Edge Fields
| Field | Type | Description |
|---|---|---|
src | string | Source node ID |
dst | string | Destination node ID |
edge_type | string | Edge type label |
properties | object | Edge-specific properties |
Graph Types
Call Graph (callgraph)
| Element | Details |
|---|---|
| Node labels | ["Function"] |
| Node properties | name (function name), kind ("defined" or "external") |
| Edge type | "CALLS" |
CFG (cfg)
| Element | Details |
|---|---|
| Node labels | ["Block"], optionally ["Block", "Entry"] |
| Node properties | name (block name), function (owning function) |
| Edge type | "FLOWS_TO" |
Def-Use (defuse)
| Element | Details |
|---|---|
| Node labels | ["Value"] or ["Instruction"] |
| Node properties | Varies |
| Edge types | "DEFINES", "USED_BY" |
Value Flow (valueflow)
| Element | Details |
|---|---|
| Node labels | ["Value"], ["Location"], or ["UnknownMem"] |
| Node properties | kind ("Value", "Location", "UnknownMem") |
| Edge types | "Direct", "Store", "Load", "CallArg", "Return", "Transform" |
| Metadata | node_count, edge_count |
SVFG (svfg)
The Sparse Value-Flow Graph captures both direct (top-level SSA) and indirect (memory, via MSSA) value flows. It is the foundation for SVFG-based checkers such as null-pointer dereference and use-after-free detectors.
| Element | Details |
|---|---|
| Node labels | ["Value"] or ["MemPhi"] |
| Node properties | kind ("value" or "mem_phi") |
| Edge types | DIRECT_DEF, DIRECT_TRANSFORM, CALL_ARG, RETURN, INDIRECT_DEF, INDIRECT_STORE, INDIRECT_LOAD, PHI_FLOW |
| Metadata | node_count, edge_count |
Edge type descriptions:
| Edge type | Category | Description |
|---|---|---|
DIRECT_DEF | Direct | SSA def-use chain (including phi incoming, select, copy) |
DIRECT_TRANSFORM | Direct | Binary/unary/cast/GEP operand to result |
CALL_ARG | Direct | Actual argument to formal parameter |
RETURN | Direct | Callee return value to caller result |
INDIRECT_DEF | Indirect | Store value to load result (clobber is a store) |
INDIRECT_STORE | Indirect | Store value to MemPhi node |
INDIRECT_LOAD | Indirect | MemPhi node to load result |
PHI_FLOW | Indirect | MemPhi to MemPhi (nested phi chaining) |
Example:
{
"schema_version": "0.1.0",
"graph_type": "svfg",
"metadata": { "node_count": 3, "edge_count": 2 },
"nodes": [
{ "id": "0x00000000000000000000000000000001", "labels": ["Value"], "properties": { "kind": "value" } },
{ "id": "0x00000000000000000000000000000064", "labels": ["MemPhi"], "properties": { "kind": "mem_phi" } },
{ "id": "0x00000000000000000000000000000002", "labels": ["Value"], "properties": { "kind": "value" } }
],
"edges": [
{ "src": "0x00000000000000000000000000000001", "dst": "0x00000000000000000000000000000064", "edge_type": "INDIRECT_STORE", "properties": {} },
{ "src": "0x00000000000000000000000000000064", "dst": "0x00000000000000000000000000000002", "edge_type": "INDIRECT_LOAD", "properties": {} }
]
}
Note: The SVFG also has a native (non-PropertyGraph) export format with an
SvfgExportschema that includes adiagnosticsobject with construction statistics (direct_edge_count,indirect_edge_count,mem_phi_count,skipped_call_clobbers,skipped_live_on_entry). The PropertyGraph format shown above is whatsaf export svfgproduces.
PTA (pta) — PropertyGraph Format
The points-to analysis can be exported as a PropertyGraph. Pointer values become
nodes and locations become nodes, connected by POINTS_TO edges.
| Element | Details |
|---|---|
| Node labels | ["Pointer"] or ["Location"] |
| Node properties | Location nodes carry obj (object ID hex) and optionally path (field path) |
| Edge type | "POINTS_TO" |
Example:
{
"schema_version": "0.1.0",
"graph_type": "pta",
"metadata": {},
"nodes": [
{ "id": "0x00000000000000000000000000000001", "labels": ["Pointer"], "properties": {} },
{ "id": "0x00000000000000000000000000000064", "labels": ["Location"], "properties": { "obj": "0x000000000000000000000000000000c8", "path": [".0"] } }
],
"edges": [
{ "src": "0x00000000000000000000000000000001", "dst": "0x00000000000000000000000000000064", "edge_type": "POINTS_TO", "properties": {} }
]
}
PTA Native Export
The PTA also has a richer native export format (not PropertyGraph) that includes
analysis configuration, all abstract locations, and diagnostics. This is the
format returned by PtaResult::export() in the Rust API:
{
"schema_version": "0.1.0",
"config": {
"enabled": true,
"field_sensitivity": "struct_fields(max_depth=2)",
"max_objects": 100000,
"max_iterations": 100
},
"locations": [
{
"id": "0x...",
"obj": "0x...",
"path": [".0", "[2]"]
}
],
"points_to": [
{
"value": "0x...",
"locations": ["0x...", "0x..."]
}
],
"diagnostics": {
"iterations": 12,
"iteration_limit_hit": false,
"collapse_warning_count": 0,
"constraint_count": 150,
"location_count": 45
}
}
| Field | Type | Description |
|---|---|---|
config | object | PTA configuration used for the analysis run |
locations | array | All abstract locations with object ID and field path |
points_to | array | Points-to sets: each entry maps a value to its locations |
diagnostics | object | Solver statistics (iterations, limits, constraint/location counts) |
Findings Export
The findings export (saf export findings) produces a JSON array of checker
findings. This is not a PropertyGraph -- it is a flat list of diagnostic
results from all enabled checkers.
Each finding has the following structure:
[
{
"check": "null_deref",
"severity": "error",
"cwe": 476,
"message": "Pointer may be null when dereferenced",
"path": [
{ "location": "main:5", "event": "NULL assigned" },
{ "location": "main:10", "event": "pointer dereferenced" }
],
"object": "p"
}
]
| Field | Type | Description |
|---|---|---|
check | string | Name of the checker that produced this finding |
severity | string | One of info, warning, error, critical |
cwe | number? | CWE ID if applicable (omitted when not set) |
message | string | Human-readable description of the issue |
path | array | Trace events from source to sink (omitted when empty) |
object | string? | Affected object name if applicable (omitted when not set) |
Each entry in path is a PathEvent:
| Field | Type | Description |
|---|---|---|
location | string | Source location description |
event | string | What happened at this point |
state | string? | Typestate label (omitted when not applicable) |
Determinism
All PropertyGraph exports are deterministic:
- Nodes are sorted by
(node_kind, referenced_id_hex) - Edges are sorted by
(edge_kind, src_id_hex, dst_id_hex, label_hash_hex) - No timestamps or wall-clock-dependent data
- Identical inputs produce byte-identical JSON output
Working with PropertyGraph
Python
import json
from saf import Project
proj = Project.open("program.ll")
graphs = proj.graphs()
cg = graphs.export("callgraph")
# Access nodes and edges directly
for node in cg["nodes"]:
print(node["properties"]["name"])
for edge in cg["edges"]:
print(f"{edge['src']} -> {edge['dst']}")
# Save to file
with open("callgraph.json", "w") as f:
json.dump(cg, f, indent=2)
jq
# List function names
jq '.nodes[] | .properties.name' callgraph.json
# Count edges by type
jq '[.edges[] | .edge_type] | group_by(.) | map({type: .[0], count: length})' graph.json
# Find high fan-out nodes
jq '[.edges[] | .src] | group_by(.) | map({node: .[0], out: length}) | sort_by(-.out) | .[0:5]' callgraph.json
NetworkX
import json
import networkx as nx
with open("callgraph.json") as f:
data = json.load(f)
G = nx.DiGraph()
for node in data["nodes"]:
G.add_node(node["id"], **node.get("properties", {}))
for edge in data["edges"]:
G.add_edge(edge["src"], edge["dst"], edge_type=edge.get("edge_type", ""))
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Components: {nx.number_weakly_connected_components(G)}")
SAF Widgets
SAF provides embeddable widgets that bring interactive program analysis into any web page. Widgets are implemented as iframes pointing to the SAF playground in embed mode.
Basic Usage
Add a widget to any HTML page:
<div class="saf-widget">
<iframe
src="../../playground/?embed=true&split=true&example=taint_flow&graph=cfg"
loading="lazy"
></iframe>
</div>
URL Parameters
| Parameter | Values | Description |
|---|---|---|
embed | true | Enables embed mode (hides toolbar, navigation) |
split | true | Shows both source editor and graph side by side |
example | taint_flow, pointer_alias, complex_cfg, indirect_call, etc. | Pre-loaded example slug |
graph | cfg, callgraph, defuse, valueflow, pta | Initial graph to display |
Styling
Add the SAF widget styles to your CSS:
.saf-widget {
border: 1px solid #333;
border-radius: 8px;
overflow: hidden;
margin: 1em 0;
}
.saf-widget iframe {
width: 100%;
height: 400px;
border: none;
}
Adjust the height to fit your content. For split mode with source + graph,
500px or more is recommended.
mdBook Integration
In mdBook pages, use raw HTML:
<div class="saf-widget">
<iframe
src="../../playground/?embed=true&split=true&example=taint_flow&graph=valueflow"
loading="lazy"
></iframe>
</div>
The custom.css file included in this book already provides the widget styles.
Examples
CFG Visualization
<div class="saf-widget">
<iframe
src="../../playground/?embed=true&split=true&example=complex_cfg&graph=cfg"
loading="lazy"
></iframe>
</div>
Call Graph with Indirect Calls
<div class="saf-widget">
<iframe
src="../../playground/?embed=true&split=true&example=indirect_call&graph=callgraph"
loading="lazy"
></iframe>
</div>
Points-To Analysis
<div class="saf-widget">
<iframe
src="../../playground/?embed=true&split=true&example=pointer_alias&graph=pta"
loading="lazy"
></iframe>
</div>
Value Flow (Taint)
<div class="saf-widget">
<iframe
src="../../playground/?embed=true&split=true&example=taint_flow&graph=valueflow"
loading="lazy"
></iframe>
</div>
Sizing Recommendations
| Mode | Recommended Height | Description |
|---|---|---|
Graph only (embed=true) | 400px | Shows graph visualization only |
Split view (embed=true&split=true) | 500px+ | Shows source code + graph side by side |
| Full height | 600px+ | Best for complex graphs with many nodes |
How It Works
When embed=true is set:
- The playground hides the top navigation bar
- The specified example is loaded automatically
- The specified graph type is selected
- With
split=true, both the source editor and graph panel are visible
The widget runs entirely client-side via WebAssembly. No server calls are made for the analysis itself (C compilation still uses Compiler Explorer if the example contains C code).
Responsive Design
Widgets adapt to their container width. For narrow containers, consider increasing the height:
@media (max-width: 768px) {
.saf-widget iframe {
height: 300px;
}
}
Security
Widgets use iframe sandboxing. The embedded playground cannot access the parent page's DOM or cookies. Communication between the parent page and widget is not currently supported.