15 — Baseline Characterization (2026-05-08)¶
What the simulator says about its workloads under the default configs/baseline.json — 4 cores, MESI_PRO, ring interconnect, 32 KB L1 / 256 KB L2, Yeh-Patt predictor, no prefetcher. This is the headline number sheet a reader can scan to know "where the simulator is" today.
The companion docs that produced these numbers are 12-heterogeneous-traces.md (mix mechanism) and 13-log-mode-and-rpt-split.md (synth miss-rate reality check).
1. Configuration¶
Single config, unchanged from the repo default:
| Knob | Value |
|---|---|
| Cores | 4 |
| Coherence | MESI_PRO |
| Interconnect | ring (link latency 1, link width 2³ = 8 B/flit) |
| L1 | 32 KB / 8-way / 64 B / writeback / LRU / 8 MSHRs |
| L2 | 256 KB / 8-way / 64 B / writeback / LIP |
| DRAM | 100-cycle uniform-latency oracle |
| ROB | 64 entries |
| Schedule queue | 2 entries / FU |
| Functional units | 8 ALU / 4 MUL / 2 LSU |
| Branch predictor | Yeh-Patt (H=10, P=5) |
| Global cycle cap | 100 M (src/full/full_mode.cpp:62) |
Each synth run retires exactly 400 K instructions (4 cores × 100 K records). Real-trace cache-mode runs walk to EOF on the 4 GiB-truncated traces (~67 M records each) per scripts/fetch_traces.sh:48-65.
2. Synthetic patterns (full mode, MESI, 4 cores)¶
Each row is a homogeneous run: all four cores read the same trace flavor from traces/synth/<pattern>_tiny/ (per-core distinct streams over disjoint 1 TiB-spaced address ranges; see scripts/gen_synth.py:48).
| Trace flavor | cycles | Per-core IPC | L1 miss | L2 miss | branch MPKI | Mem reads | Mem writes | C2C | Inval |
|---|---|---|---|---|---|---|---|---|---|
synth/loop_tiny |
108 K | 0.926 | 0.19 % | 100 % (cold) | 75.0 | 256 | 0 | 0 | 0 |
synth/sequential_tiny |
13.87 M | 0.007 | 100 % | 92 % | 75.5 | 135 896 | 21 558 | 0 | 0 |
synth/stream_tiny |
13.82 M | 0.007 | 100 % | 92 % | 75.0 | 135 498 | 21 480 | 0 | 0 |
synth/random_tiny |
13.16 M | 0.008 | 99.80 % | 91 % | 75.0 | 128 936 | 21 543 | 0 | 0 |
Reading the table:
loop_tinyis the diagnostic that the OoO core works. ~1 IPC per core, near-zero L1 miss, two orders of magnitude faster wall-clock. The 256 system-wide cache misses are exactly the cold-start compulsory fills (4 cores × 64 unique blocks each = 256). After cold, the 4 KB working set fits trivially in 32 KB L1.- The other three are stress tests, not workloads. L1 miss ≈ 100 %, L2 miss ≈ 92 %, IPC two orders of magnitude lower than loop. This is by design — generator addresses march forward at stride 64 with no reuse, so the working set blows past every cache. See tools/gen_trace/gen_trace.cpp:64-71.
- No coherence work in any homogeneous run. Zero C2C transfers,
zero invalidations across all four flavors. That's a property of
the trace layout: the 1 TiB-per-core stride in
gen_synth.pymakes cores' address ranges fully disjoint, so even with the most permissive protocol there is nothing to share. - Branch MPKI ~75 across all four flavors. That's a property of the generator, not the predictor — synthetic traces emit a branch every ~13 instructions over essentially-random PCs. Real workloads (§4) are very different.
3. Heterogeneous synth mix¶
traces/mix_4core.txt — random_tiny/{p0,p1}
on cores 0/1, stream_tiny/{p0,p1} on cores 2/3.
| Metric | Value |
|---|---|
| cycles | 13.28 M |
| Aggregate IPC | 0.030 |
| Per-core IPC (c0/c1/c2/c3) | 0.008 / 0.008 / 0.008 / 0.008 |
| L1 miss rate (c0/c1/c2/c3) | 0.998 / 0.998 / 1.000 / 1.000 |
| Mem reads | 130 052 |
| Mem writes | 21 882 |
| Cache-to-cache transfers | 1 942 |
| Coherence invalidations | 534 |
This is the first row in this report with non-zero coherence activity.
The reason is subtle: although cores 0/2 (random p0 and stream p0) are
on different generators, both files use addr_base = 0x10000000
because the 1 TiB per-core stride in gen_synth.py is applied
within a single pattern dir, not across dirs. So core 0 (random
p0) and core 2 (stream p0) walk overlapping address ranges with
different patterns — incidental sharing. Same story for cores 1 and 3.
The IPC stays flat at 0.008 because both component traces are high-miss-rate. The interesting signal here is the 1942 C2C transfers — proof the simulator's coherence path moves data between cores, not just to/from main memory, when the workload actually shares addresses.
4. Real SPEC2017 traces — cache mode (works)¶
Single-core, single-trace runs through the cache hierarchy in
isolation (--mode cache). These bypass the OoO core and the
coherence directory, so they characterize the trace itself plus the
L1+L2+DRAM path.
| Trace | Records walked | L1 miss | L1 writebacks | DRAM accesses |
|---|---|---|---|---|
champsim/perlbench |
30.32 M | 0.04 % | 1 762 | ~14 K |
champsim/leela |
26.33 M | 0.58 % | 29 539 | ~182 K |
champsim/xz |
20.71 M | 1.95 % | 289 184 | ~692 K |
champsim/mcf |
3.20 M | 53.87 % | 11 258 | 1.73 M |
These match the bands published in the CRC-2 / DPC-3 / IPC-1 literature for these SimPoints — the spread from 0.04 % (perlbench's hot interpreter loop) to 53.87 % (mcf's pointer-chasing) is two orders of magnitude wider than anything synth produces, and it tracks the published MPKI ranking. The trace ingestion path and cache machinery handle real traces correctly.
5. Real SPEC2017 traces — full mode (BROKEN)¶
Same traces, full mode (OoO + coherence). Two distinct failure modes:
| Trace | Cores | Result |
|---|---|---|
champsim/mcf |
1 | Deadlock at cycle 1 000 006 (rob=29, sq=28, dispq=15, retired=0, fetched=44, in_mispred=1) |
champsim/mcf |
4 | Same deadlock, 4-core flavor |
champsim/perlbench |
1 | Segfault (exit 139) |
champsim/leela |
1 | Hang (no progress; killed at 30 s wall clock) |
champsim/xz |
1 | Hang (same) |
The [ERROR] OoO core deadlock watchdog is the
stage_state_update check. The
pipeline state at deadlock tells the story:
- 44 fetched, 0 retired
- ROB has 29 entries waiting; the head is not ready
- Store queue has 28 entries (high store fraction)
- A mispredicted branch was fetched and is sitting in the ROB
- Fetch is stalled by
in_mispred=1waiting for the mispredict to retire - The mispredict can only retire after older instructions retire
- The oldest instruction is a load that's never received its fill
The first L1 miss never gets filled. Cache mode on the same addresses fills correctly, so the bug is in the OoO ↔ coherence integration, not the cache itself. The plan that produced this report (report_doc/14 leftovers, this session's plan file) explicitly gates further work on this finding.
Three suspect places to investigate, in priority order, captured in memory/project_real_trace_deadlock.md:
- MSHR allocation on real-trace addresses. Synth addresses
cluster narrowly around
0x10000000+i*2^40; real traces scatter across 64-bit virtual addresses. If MSHR slot allocation hashes poorly, an in-flight load might never see its fill. (src/cache/cache.cpp MSHR machinery.) - Directory state on first-touch lines. Synth pre-populates the directory hash table via the early predictable stream; real traces touch fresh addresses constantly. A bad sticky state on a fresh block could swallow the response. (src/coherence/directory.cpp.)
cache_fill('R')on a Store miss. The doc'd known issue at src/coherence/coherence_adapter.cpp:115-121 and report_doc/13: adapter callscache_fill(... 'R')even when the original miss wasOp::Write. This may interact badly with realistic store-mix traces in a way that synth's all-load patterns hide.
6. What the report says, in one paragraph¶
The simulator is correct on the four subsystems in isolation —
cache mode produces published-quality miss rates for both synthetic
and real workloads, predictor mode trains and reports MPKI
sensibly, and the OoO core retires near-1 IPC on a workload that
fits in cache (loop_tiny). The simulator is also correct on
multicore synthetic stress tests — 4-core synth runs all complete,
the heterogeneous mix shows expected coherence activity, and ctest
runs 128/128 green. What does not yet work is full mode (OoO +
coherence) on real ChampSim traces: the first L1 miss never gets
its fill response, fetch stalls behind an early mispredict, and the
deadlock watchdog fires. Until that is fixed, real benchmark numbers
remain blocked. The trace ingestion infrastructure shipped this
session is ready and waiting on that bug.
7. Reproducer commands¶
# Synth, all four flavors:
for p in loop sequential stream random; do
make run TRACE=traces/synth/${p}_tiny TAG=rpt
done
# Heterogeneous synth mix:
make run TRACE=traces/mix_4core.txt TAG=rpt
# Cache-mode characterization of real traces (works):
for b in mcf perlbench leela xz; do
build-release/src/sim --config configs/baseline.json \
--trace traces/champsim/${b}/raw.champsimtrace --mode cache
done
# Full-mode attempt on real traces (will deadlock or segfault):
make run TRACE=traces/champsim/mcf TAG=rpt_real CORES=1
Cross-references¶
- TRACES.md — what each trace file is and how it was built.
- RUNNING.md — invocation reference,
make run, log levels. - docs/log-format.md —
LOG=1per-instruction trace format. - report_doc/13-log-mode-and-rpt-split.md — earlier finding that motivated the synth-vs-real split.
- report_doc/14-writeback-and-private-synth.md
— the gen_synth rewrite that produced today's
_tinydirectories.