Tracing — getting traces into the simulator¶
Companion to trace-format.md, which specs the on-disk binary formats (ChampSim v1 + CasimV2). This doc is about where the traces come from — given a workload you want to study, which path produces files the simulator can replay?
The sim accepts three trace input forms today:
| Form | Use when | Format | Driver flag |
|---|---|---|---|
| Per-core ChampSim binary | You have an existing ChampSim corpus (DPC-3 / IPC-1, single-thread workloads sharded across cores) | <dir>/p<i>.champsimtrace |
--trace-dir DIR |
| Manifest of mixed traces | You want heterogeneous workloads across cores (different traces per core) | line-per-file manifest | --trace-list FILE |
| Multi-thread program | You want one logical program with N threads sharing memory + sync | CasimV2 .casim per thread + manifest |
--program FILE |
Producing traces¶
Option A: synthetic via casim_synth (works today)¶
The fastest way to get a meaningful multi-thread trace into the
sim. tools/casim_synth/ is a small C++ library + four example
binaries (synth_lock_chain, synth_lock_chain_mem,
synth_dot_product, synth_prodcon). Each one constructs a
program description programmatically and writes t<N>.casim files
plus a *.manifest ready for sim --program.
Pros: instant feedback loop, deterministic, no external deps, covers the sync subsystem's interesting cases. Cons: instruction streams are synthesized (ALU filler between sync events), not real program behavior — useful for studying cache coherence and lock contention; less useful for studying frontend / branch prediction in real code.
See tools/casim_synth/programs/*.cpp for templates. Adding a
new synthetic workload is one C++ file plus a CMakeLists entry.
Option B: ChampSim corpus (works today)¶
If you have an existing ChampSim trace (e.g. one of the DPC-3
SPEC traces), drop it as <dir>/p0.champsimtrace and run with
--trace-dir <dir> --cores 1. For multi-core sweeps with
homogeneous workloads, copy the same trace to p0..pN-1. For
heterogeneous mixes, use --trace-list with a manifest.
This path doesn't produce inter-core coherence traffic (each core runs an independent address space). It's the right starting point for cache + branch-predictor sweeps; not the right one for contention studies.
Option C: DynamoRIO-based tracer (not implemented)¶
A DR client that captures per-thread CasimV2 traces from real pthread programs. This is the highest-fidelity option and the gap between "we have a sim" and "we have real-workload numbers from real programs."
Status: not implemented. Sketched here so a future implementation has a clear scope.
Required:
- DR 11+ on Linux x86_64 (DR on macOS arm64 is too patchy).
- A DR client (~800-1200 LOC C++) using drmgr + drwrap:
- per-thread output file open in dr_thread_init_event
- basic-block instrumentation that emits Instr records with
IP, branch info, register IDs, memory addresses (via
drutil_insert_get_mem_addr)
- drwrap hooks on pthread_mutex_lock, pthread_mutex_unlock,
pthread_barrier_init, pthread_barrier_wait,
pthread_create, pthread_join that emit SyncRecord /
LifecycleRecord entries
- per-mutex / per-barrier sequence-number tracking with
dr_mutex-protected atomic counters
- A Dockerfile that pins Ubuntu + DR + g++ + the target benchmark
source (SPLASH-2 is the canonical starting point).
- A runner script that mounts a host volume for trace output and
invokes drrun -c client.so -- /path/to/benchmark.
Validation path: run a small pthread program with two threads
contending on one mutex; confirm the produced .casim files
roundtrip cleanly through sim --program and that the
utilization.rpt shows a cascading sync-stall pattern matching
the lock-chain synth example.
Option D: Intel Pin (not implemented, possible alternative to DR)¶
Pin offers similar dynamic-instrumentation capabilities to DR.
Some ChampSim corpora were originally captured with Pin tools.
We don't have a Pin client today; structurally it would mirror
the DR sketch above with Pin's INS_InsertCall /
RTN_InsertCall APIs.
Option E: project2 → champsim converter (works today, niche)¶
tools/proj2_to_champsim/ converts the textual project2 trace
format into ChampSim binary. Useful only if you have project2-
style traces from a course context.
Why not just one path¶
Each option occupies a different point on the speed-vs-fidelity curve:
- casim_synth: full control, full speed, low fidelity. Best for unit / property testing of the sim itself, and for controlled experiments where you want to isolate one variable.
- ChampSim corpus: real workload behavior at the frontend level, no inter-core coherence interaction. Best for uniprocessor or homogeneous-multicore sweeps.
- DR client (future): real pthread programs, real coherence traffic, real cache pressure. Best for any result claim that appears in a paper — "speedup of X% on SPLASH-2 LU.b under protocol Y" — assuming the DR pipeline gets built.
Trace file layout summary¶
report dir layout (output) trace dir layout (input)
--------------------------- ------------------------
report/<run>/ ChampSim per-core:
report.rpt <dir>/
config.rpt p0.champsimtrace
stats.rpt p1.champsimtrace
coherence.rpt ...
utilization.rpt pN-1.champsimtrace
report.csv
Trace-list manifest:
<file>.txt
├─ /path/to/trace0
├─ /path/to/trace1
└─ ...
(one path per line; '#' comments)
CasimV2 program:
<dir>/
prog.manifest
├─ program: name
├─ threads: N
├─ t0: t0.casim
├─ t1: t1.casim
└─ ...
t0.casim
t1.casim
...
Known limitations¶
The non-trace-format limitations live in STATUS.md. Trace-related ones:
- No condition-variable capture path. Even when the DR client lands, the sim doesn't model cond-wait pair-matching, so the client can either skip cond-var emission or emit records that pass the gate trivially.
- No I-stream capture in casim_synth. The synthetic programs emit a sequence of "filler" ALU records with rotating dest registers between sync events. Realistic frontend behavior (branch prediction, I-cache pressure) is not modeled.
- Trace size. Real SPLASH-2 LU.b runs ~1B instructions per
thread. At 64 bytes per CasimV2 record that's ~64GB per
thread uncompressed. Future work: in-trace bracket
primitives (
trace_begin/trace_end) so the client only captures the region of interest, plus on-the-fly gzip.