Running the simulator¶
Running the Multicore OoO Simulator¶
A step-by-step guide for running this simulator end to end: how to build it, how to feed it a workload, where every output file lands on disk, and what the numbers in those files mean.
The text in "What this means in architecture terms" boxes is background on what the simulator is actually modeling. If you're already comfortable with Tomasulo, MESI, ROBs, etc., skim those and focus on commands and file paths.
Table of contents¶
- The 30-second version
- What this simulator models
- Prerequisites
- Building the simulator
- Inputs: configs and traces
- Running a single simulation
- Where output files go
- Reading
report.rptlike an architect - Sweeps: running many configs at once
- Cleanup
- Troubleshooting
- File-path cheat sheet
1. The 30-second version¶
make build # compile (into build-release/)
make run TRACE=traces/core_4 # run one simulation
ls report/core_4_mesi_c4/ # report.rpt is the human report
For a full validation sweep:
make smoke # tiny end-to-end: synth traces + sweep + summary
If make smoke finishes without errors and report/_sweep/smoke/summary.md
says Errors: 0, the simulator is working on your machine.
2. What this simulator models¶
This is a chip-multiprocessor (CMP) simulator: several out-of-order cores sharing memory through a directory-based cache coherence protocol over a ring interconnect.
+-----------------------------+
| Coherent Network | (ring)
| (directory + interconnect) |
+--+-------+-------+-------+--+
| | | |
Core 0 Core 1 Core 2 Core N-1
+----+ +----+ +----+ +----+
|OoO | |OoO | |OoO | |OoO |
| L1 | | L1 | | L1 | | L1 |
| L2 | | L2 | | L2 | | L2 |
+----+ +----+ +----+ +----+
|
DRAM
What this means in architecture terms. Each core is a Tomasulo-style superscalar pipeline. The simulator models five stages: fetch → dispatch (decode+rename) → schedule (issue) → execute → retire (state update). Results broadcast on the CDB at the end of execute — there is no dedicated writeback stage. Stores don't write back to a register file at all; they sit in the LSU queue and drain to memory. A real branch predictor (Yeh-Patt or perceptron) drives speculation; a reorder buffer (ROB) holds in-flight instructions so they can commit in program order even though they execute out of order. Each core has private L1 and L2 caches. When several cores read or write the same memory block, a coherence protocol (MI / MSI / MESI / MOSI / MOESIF) keeps their views consistent. Misses that escape L2 hit a simple DRAM model with a fixed latency.
You drive the simulator with a trace: a recorded stream of instructions per core (one file per core). The simulator replays each core's trace through its pipeline, accounting for every cycle a structural hazard, cache miss, branch mispredict, or coherence stall costs you.
3. Prerequisites¶
| Thing | Why |
|---|---|
| C++20 compiler | Builds the simulator |
| CMake ≥ 3.21 | Build system |
| Python 3 | Sweep harness, synthetic trace generator |
bash + curl (macOS/Linux) |
scripts/fetch_traces.sh for ChampSim traces |
The first CMake configure downloads three small dependencies (nlohmann/json,
CLI11, Catch2) into build-release/_deps/.
4. Building the simulator¶
The repo's Makefile wraps CMake so you don't need to remember the flags.
make build # default: Debug-ish, skip if binaries already exist
make build FAST=1 # release-style: -O2 (much faster simulations)
Under the hood make build runs:
cmake -S . -B build-release [-DCASIM_FAST=ON]
cmake --build build-release -j
The two binaries you'll actually use end up at:
| Binary | Purpose |
|---|---|
| build-release/src/sim | The simulator itself |
| build-release/tools/gen_trace/gen_trace | Synthetic-trace generator (used by make traces) |
Tip. Use
FAST=1for any real sweep. Debug builds are 5–10× slower and long-tier sweeps will take all night.
For finer-grained CMake options (AddressSanitizer, profiling, treat-warnings-as-errors) see the README's build options table.
5. Inputs: configs and traces¶
A simulator run needs two things:
- A machine config — a JSON file describing the chip (cores, cache geometries, FU counts, predictor type, coherence protocol, ...).
- Per-core traces — one trace file per core, in ChampSim binary trace format.
5.1 The machine config¶
The default lives at configs/baseline.json. Highlights:
{
"cores": 4, // number of cores
"core": {
"fetch_width": 4, // instructions fetched per cycle
"rob_entries": 96, // reorder buffer size
"schedq_entries_per_fu": 2, // scheduler entries per FU
"alu_fus": 3, "mul_fus": 2, "lsu_fus": 2,
"predictor": { "type": "yeh_patt", "history_bits": 10, "pattern_bits": 5 }
},
"l1": { "size_kb": 32, "assoc": 8, "replacement": "lru", "hit_latency": 2 },
"l2": { "size_kb": 256, "assoc": 8, "replacement": "lip", "hit_latency": 10 },
"memory": { "latency": 100 },
"interconnect":{ "topology": "ring", "link_latency": 1 },
"coherence": { "protocol": "mesi" }
}
What this means in architecture terms. Every line is a knob the hardware designer would turn.
rob_entriescaps how far ahead of the commit point you can look — bigger ROB = more in-flight instructions = more memory-level parallelism, but more area and a longer rename pipeline.associs set associativity (conflict-miss tradeoff).latencyfor memory is the DRAM round-trip in cycles you've all seen on slides as "~100 cycles."protocolpicks which states the coherence FSM has (MI, MSI, MESI, MOSI, MOESIF — each adds states that reduce write-upgrade traffic in different sharing patterns).
You can override any single field on the command line without editing JSON:
./build-release/src/sim \
--config configs/baseline.json \
--cores 8 \
--protocol moesif
5.2 The traces¶
Three sources of trace data live under traces/:
| Path | What it is |
|---|---|
| traces/core_4/ | A 4-core project3 coherence-regression fixture (committed in repo). |
| traces/synth/ | Synthetic patterns (random / sequential / stream / loop) generated by make gen-synth. |
| traces/champsim/ | Real SPEC-style traces, fetched on demand by make fetch-traces. |
Read TRACES.md before reasoning about any number in a report. It documents each trace's reuse %, working-set size, expected miss rate, and known measurement anomalies. A 100% L1 miss rate on
traces/core_4/is a property of the workload, not a simulator bug — knowing that up front saves an afternoon of debugging.
Generate the synth + champsim trace bundle for a given size tier with:
make traces TIER=smoke # ~1 s, tiny
make traces TIER=short # ~30 s, small
make traces TIER=medium # ~5 min
make traces TIER=long # ~20 min, 100 M-instruction synth
What this means in architecture terms. A trace is the input workload. Different patterns stress different parts of the memory hierarchy:
sequentialis friendly to spatial-locality predictors;streamblows through cache like an iota loop;randomdefeats prefetching;loopexercises the branch predictor and stresses temporal locality. Mixing trace flavors across cores (see--trace-listbelow) lets you study how heterogeneous workloads interact through coherence.
6. Running a single simulation¶
The simulator binary takes a config and a trace source, and prints a summary to stdout while writing detailed report files to disk. The Makefile wraps this into a one-liner so you don't have to type the full CLI.
6.1 The easy way: make run¶
make run TRACE=traces/core_4
That's it. Defaults to configs/baseline.json, picks
up cores from the trace directory automatically, and writes reports to
report/core_4_mesi_c4/.
Common variations:
# Tag the run so re-runs don't overwrite each other
make run TRACE=traces/core_4 TAG=baseline-v1
# Try a different coherence protocol on the same trace
make run TRACE=traces/core_4 TAG=mosi PROTOCOL=mosi
# Use a custom config
make run TRACE=traces/core_4 CONFIG=configs/my_tweak.json
# Lowercase variable names work too
make run trace=traces/synth/random_small tag=v1
What make run actually does:
- Builds build-release/src/sim if it's missing.
- Inspects
TRACE: - If it's a directory, passes
--trace-dirand auto-countsp*.champsimtracefiles inside to set--cores. - If it's a file, passes
--trace-list(treats it as a manifest). - Forwards
TAG/PROTOCOL/CORES/CONFIGto the simulator if set. - Echoes the full command before executing so you can copy it for debugging.
6.2 The raw command¶
If you need to pass flags make run doesn't expose, run the binary directly:
./build-release/src/sim \
--config configs/baseline.json \
--trace-dir traces/core_4
This:
- Parses
configs/baseline.jsoninto an in-memorySimConfig. - Opens
traces/core_4/p0.champsimtrace,p1.champsimtrace, ... one per core. - Builds 4 OoO cores, each with private L1+L2, all hanging off a ring directory running MESI.
- Ticks the global clock until every core hits trace EOF and the network drains.
- Prints a short overview to stdout and writes the full reports under
report/<run-name>/(path rules in §7).
6.3 The CLI flags you'll actually use¶
| Flag | Purpose |
|---|---|
--config FILE |
Required. Machine config JSON. |
--trace-dir D |
Per-core directory; expects D/p0.champsimtrace, D/p1.champsimtrace, … |
--trace-list F |
Manifest file: one trace path per line. Lets you mix workloads across cores. |
--cores N |
Override cores from the config. |
--protocol P |
One of mi msi mesi mosi moesif. Overrides cfg.coherence.protocol. |
--tag NAME |
Suffix appended to the report directory name (see §7). |
--mode M |
Run a subsystem in isolation: cache, predictor, ooo, coherence. Default = full multicore. |
--log-level L |
trace debug info warn error off. Default info. |
--out FILE |
Dump the merged config to JSON (sanity check) and exit. |
6.4 Heterogeneous traces (mixing workloads)¶
A common experiment: run different workloads on different cores to see how they interact through the coherence directory. Make a manifest:
# traces/mix_4core.txt — one path per line, blank/'#' ignored.
# Relative paths are resolved against THIS FILE'S directory (traces/),
# not your shell's cwd. Absolute paths are also fine.
synth/random_tiny/p0.champsimtrace
synth/random_tiny/p1.champsimtrace
synth/stream_tiny/p0.champsimtrace
synth/stream_tiny/p1.champsimtrace
Then, from the repo root:
./build-release/src/sim --config configs/baseline.json \
--trace-list traces/mix_4core.txt
or via the wrapper:
make run TRACE=traces/mix_4core.txt TAG=mix
Manifest entry count must equal cores. The resolution rule
(manifest_dir / entry) is at
src/full/full_mode.cpp:147 — keep it in mind
if you put the manifest somewhere other than traces/ or write entries
relative to the repo root and they fail to open.
6.5 Subsystem modes¶
When you only want to study one piece in isolation:
--mode |
What it runs |
|---|---|
cache |
Just the L1/L2 cache hierarchy on a single trace. |
predictor |
Just the branch predictor. |
ooo |
Single-core OoO pipeline (no coherence, no other cores). |
coherence |
Multicore + coherence + caches, but with a trivial core (no OoO). |
| (omitted) | Full multicore OoO + coherence. This is what you want by default. |
What this means in architecture terms. This is the same factoring you see in textbook chapters: cache hierarchy first, predictor second, OoO pipeline third, coherence fourth. The full mode glues them all together into the chip.
7. Where output files go¶
This is the part that's easy to get wrong. There are three different destinations depending on what you're running.
7.1 Single-run output (what you get from ./build-release/src/sim ...)¶
Every full-mode run writes a folder under report/ named:
report/<trace-stem>_<protocol>_c<cores>[_<tag>]/
where:
<trace-stem>= the last component of--trace-dir(or the filename without extension of--trace-list)<protocol>=mi/msi/mesi/mosi/moesif<cores>= the value ofcoresafter CLI overrides<tag>= the value of--tag(omitted if absent)
Concrete example. This command:
./build-release/src/sim --config configs/baseline.json \
--trace-dir traces/core_4 \
--tag baseline
writes to:
report/core_4_mesi_c4_baseline/
├── report.rpt ← human-readable run report (the one you read first)
├── config.rpt ← just the configuration section
├── stats.rpt ← detailed per-core/cache/predictor stats
├── coherence.rpt ← coherence FSM transition counts and traffic
└── report.csv ← machine-readable per-core row data
If you set the LOG=1 environment variable, you also get:
report/<...>/log.rpt ← per-instruction commit trace (first 50 dyn instructions per core)
So:
LOG=1 ./build-release/src/sim --config configs/baseline.json --trace-dir traces/core_4
writes a log.rpt alongside the other reports. The file format (LSU
issue events, RETIRE events, branch metadata) is documented in
docs/log-format.md, and a header block at the top
of log.rpt itself summarizes the same. Note LOG=1 is an env var, not
a CLI flag; --log-level only affects what prints to stderr and is
unrelated to log.rpt.
7.2 The source of those file names¶
The output directory is computed in
src/full/full_mode.cpp:705 (build_run_dir_pre),
and the per-file writes happen at
src/full/full_mode.cpp:870-883. Read those
~20 lines if a path looks wrong; they're authoritative.
7.3 Sweep output (what you get from make smoke / short / medium / long)¶
A sweep runs many configs against many traces in parallel. It produces two kinds of artifacts:
(a) One per-run folder per (config × trace) combination¶
These live at the same report/<trace-stem>_<protocol>_c<cores>_<tag>/ paths
as in §7.1. Each contains its own report.rpt, config.rpt, stats.rpt,
coherence.rpt, report.csv. The sweep tags them with the variant name
(e.g. baseline, cores_2, cap500).
(b) One sweep-wide aggregation folder¶
report/_sweep/<SWEEP_ID>/
├── summary.md ← human-readable: violation list + caveats. Read first.
├── summary.csv ← machine-readable: one row per run, all metrics.
├── progress.tsv ← live progress log (status of every run as it ran)
├── configs/ ← every per-run config JSON the sweep generated
│ ├── baseline__synth_random_small.json
│ └── ...
└── logs/ ← stdout/stderr of every run + meta JSON
├── baseline__synth_random_small.out
├── baseline__synth_random_small.err
└── baseline__synth_random_small.meta.json
<SWEEP_ID> defaults to the tier name (smoke, short, …) but you can
override it: make short SWEEP_ID=v3 writes to report/_sweep/v3/.
What this means in architecture terms. A sweep is a design space exploration: vary one hardware parameter at a time, hold workload constant, see which knob actually changes IPC.
summary.mdis where you'll first notice that your fancier coherence protocol shaved 3% off cycles, or that nothing changed because all the traces hit in L1 anyway.
8. Reading report.rpt like an architect¶
Here's an actual run report header (from report/core_4_mesi_c4_baseline/report.rpt):
Multicore OoO Simulator -- Run Report
================================================================================
Trace : traces/core_4
Cores : 4
Protocol : MESI_PRO
Status : Simulation complete
Total cycles : 122273
Then a Configuration section (echo of the merged config) and a Per-core results section, one block per core:
[ Core 0 ]
Pipeline
cycles : 122274
instructions retired : 1000
instructions fetched : 1000
IPC : 0.008
What to look at and what it tells you:
| Metric | Architecture meaning |
|---|---|
| IPC | Instructions Per Cycle. Single most important number. The achievable peak is min(fetch_width, dispatch_width, total FU count, retire_width); in this sim fetch_width caps both fetch and dispatch, and retire is unbounded, so the practical ceiling is min(fetch_width, alu_fus + mul_fus + lsu_fus). |
| CPI | 1 / IPC. Easier to reason about as a sum of stalls. |
| MPKI | Branch Mispredictions Per Kilo-Instructions. Predictor quality. |
| L1 / L2 miss rate | Fraction of accesses that escaped each level. Drives AAT. |
| L1 / L2 AAT | Average Access Time = hit_latency + miss_rate × miss_penalty. |
| coherence transitions | Per-state FSM event counts. Tells you whether MESI's E-state actually saved upgrades vs MSI. |
| Network traffic | Per-link byte counts on the ring. Stresses the interconnect-bandwidth budget. |
What this means in architecture terms. The achievable peak IPC is bounded by the narrowest pipeline stage and by FU availability, i.e.
min(fetch_width, dispatch_width, sum of FU counts, retire_width)— not just byfetch_widthalone. When measured IPC sits well below that ceiling, something is stalling the pipeline. Walk down the report: high MPKI → predictor; high L1 miss rate but low L2 → working set blew L1; high L2 miss rate → memory-bound; low miss rates everywhere but still low IPC → check the FU mix (instruction-type imbalance against youralu_fus/mul_fus/lsu_fus) or ROB size (in-flight cap throttling MLP). Coherence transitions matter when several cores write the same line: you'll see lots of M→I transitions (write invalidations) and thec2ctransfer counts go up.
If you'd rather slurp the data into a spreadsheet, every number in
report.rpt is also a column in report.csv (and aggregated across all
runs in report/_sweep/<SWEEP_ID>/summary.csv).
9. Sweeps: running many configs at once¶
Sweeps are driven by configs/sweep.json, which defines
tiers (smoke, short, medium, long) and axes (which knobs to
vary). The Makefile chains them:
make smoke # ~1 min, tiny synth, proto axis only
make short # ~10 min, tiny+small + champsim, all axes
make medium # ~1 hour
make long # overnight, 100M synth
Each tier shortcut runs four phases: build → traces → sweep → aggregate.
You can also invoke them separately:
make build FAST=1
make traces TIER=short
make sweep TIER=short SWEEP_ID=experiment-1 JOBS=4 TIMEOUT=900
make aggregate SWEEP_ID=experiment-1
JOBS controls parallelism; TIMEOUT is the per-run wallclock budget
(seconds). To preview which runs would execute without actually running:
make dry-run TIER=short
The output of any sweep ends up under report/_sweep/<SWEEP_ID>/ (see §7.3).
What this means in architecture terms. Sweeps are how you turn a simulator into evidence for an argument. "MESI saves 8% over MSI on this workload set." — that's a sweep, plus a
summary.csvlookup, plus an honest paragraph about which workloads contributed.
10. Cleanup¶
The Makefile is scoped by SWEEP_ID to make it hard to nuke the wrong
thing. Four targets:
| Command | Removes |
|---|---|
make clean SWEEP_ID=<id> |
report/_sweep/<id>/ plus any per-run dir matching report/*_<id>__* |
make clean-reports |
Everything under report/ — sweep aggregations, sweep per-run dirs, and manual single-run dirs |
make clean-all |
Alias for clean-reports |
make clean-traces |
traces/synth/ and traces/champsim/ (regen via make traces TIER=…; keeps traces/core_4/) |
make clean SWEEP_ID=v3 # remove one sweep's artifacts (scoped, safe)
make clean-reports # remove every report (manual + sweep)
make clean-all # same as clean-reports
make clean-traces # remove generated trace data (not reports)
make clean refuses to run without an explicit SWEEP_ID (or with
SWEEP_ID=all) — that's deliberate, to prevent thumb-fumble disasters
during long sweeps. Use clean-reports (or its alias clean-all) when
you really do want a clean slate under report/.
10.1 What gets cleaned¶
clean-reports deletes every entry directly under report/ (it preserves
the report/ directory itself, so subsequent runs can still write into
it). Concretely:
| Folder | Removed by clean-reports / clean-all? |
|---|---|
report/_sweep/<id>/ (sweep aggregation) |
yes |
report/loop_small_mesi_c2_v3__cores_2/ (sweep run) |
yes |
report/core_4_mesi_c4_baseline/ (manual run) |
yes |
10.2 Recipes¶
# Remove just one sweep (other sweeps and manual runs untouched)
make clean SWEEP_ID=v3
# Full reset of all reports
make clean-reports
# Also reclaim the trace data (forces a re-fetch / re-gen next time)
make clean-traces
11. Troubleshooting¶
| Symptom | What's likely wrong |
|---|---|
default mode requires --trace-dir DIR or --trace-list FILE |
You passed --trace (single-trace flag); full mode is per-core. Use --trace-dir or --trace-list. |
interconnect.topology=xbar is not supported |
Only ring is implemented. Edit the config back to ring. |
Status: Simulation terminated in report.rpt |
Hit the global cycle cap (kGlobalCap in full_mode.cpp); usually a coherence deadlock. The sweep's summary.md will flag it as deadlock. |
Sweep run shows exit -6 in summary.md |
Hit the Python sweep TIMEOUT (wallclock, not cycles). Either raise TIMEOUT= or rebuild with FAST=1. |
IPC looks too low (~0.008) on tiny traces |
Synth traces have very low retire rates relative to memory latency. Generate larger traces or use champsim. |
Per-run dir has no report.rpt |
Earlier crash before reports were written. Check report/_sweep/<id>/logs/*.err for that run. |
clang: command not found (macOS) |
Install Xcode CLT: xcode-select --install. |
12. File-path cheat sheet¶
| What | Where |
|---|---|
| Default config | configs/baseline.json |
| Sweep matrix | configs/sweep.json |
| Sim binary | build-release/src/sim |
| Synth-trace generator | build-release/tools/gen_trace/gen_trace |
| Committed smoke trace | traces/core_4/ (p0.champsimtrace … p3.champsimtrace) |
| Generated synth traces | traces/synth/<pattern>_<size>/p<i>.champsimtrace |
| Fetched ChampSim traces | traces/champsim/ |
| Single-run reports (full mode) | report/<trace-stem>_<protocol>_c<cores>[_<tag>]/report.rpt (+ config.rpt, stats.rpt, coherence.rpt, report.csv, optional log.rpt) |
| Sweep aggregation | report/_sweep/<SWEEP_ID>/summary.md + summary.csv + progress.tsv |
| Per-run sweep configs | report/_sweep/<SWEEP_ID>/configs/<variant>__<workload>.json |
| Per-run sweep stdout/stderr | report/_sweep/<SWEEP_ID>/logs/<variant>__<workload>.{out,err,meta.json} |
That's the whole pipeline: config + traces in, report.rpt and friends out.