LOG=1 trace mode and multi-file report split¶
Status: done.
Files touched: src/full/full_mode.cpp, src/ooo/core.cpp, include/comparch/ooo/core.hpp, src/ooo/CMakeLists.txt. New: include/comparch/ooo/trace_logger.hpp, src/ooo/trace_logger.cpp.
Tests: 128 / 128 passing on build-release and build-werror-rel.
1. Why this was needed¶
Two complaints, one investigation, then two changes:
- "Cache hit rate looks suspiciously low on some traces." I audited the cache + coherence + OoO integration and the answer turned out to be the workload, not a bug (see §2). But to make this easy for anyone to verify on their own runs, I added a per-instruction execution log so the user can read off exactly what the cache does on the early access stream.
- "Split
report.rptinto more focused files with deeper academic-style stats." The previous report.rpt was one ~80-line dump with header / config / per-core / system glued together. That made diffing two runs noisy and hid the metrics worth tracking (AMAT, MPKI, occupancy %, branch accuracy, …). The split below mirrors how gem5 / ChampSim / ZSim organize their per-run output.
2. Code review — is the simulator under-counting hits?¶
No. I traced the cache → coherence path end-to-end and the bookkeeping is correct.
| What I checked | Result |
|---|---|
Cache::access() increments accesses exactly once per call |
✓ |
Cache::issue() calls access() once on a fresh allocate; merge fast-path skips access() entirely so secondaries are not double-counted as misses |
✓ (cache.cpp:437-485) |
L1 miss → L2 access → coherence sink path doesn't lose the fill: adapter calls cache_fill(L2) and cache_fill(L1) on DATA arrival, and mark_block_ready flips every MSHR slot tied to that block |
✓ (coherence_adapter.cpp:108-130) |
Tag/index used to fill matches what access() looks up later (block-index form, idempotent through block_in()) |
✓ |
So a 100% miss rate on traces/core_4/p0 looked alarming, but it's the trace's own property. I dumped the trace and counted unique 64 B blocks against memory-op count:
| Trace | mem ops | unique 64 B blocks | reuse |
|---|---|---|---|
traces/core_4/p0.champsimtrace |
458 | 458 | 0.0 % |
synth/sequential_tiny |
34,039 | 34,039 | 0.0 % |
synth/stream_tiny |
34,042 | 34,042 | 0.0 % |
synth/random_tiny |
34,382 | 32,270 | 6.1 % |
synth/loop_tiny |
34,114 | 64 | 99.8 % |
loop_* is the only family with temporal locality by construction, and on it we observe L1 miss rate = 0.0019 — exactly what the validation report (10-validation-bugs.md, line ~101) predicts. So the headline number is reporting reality.
Two model-fidelity items I noticed (neither affects hit rate)¶
- OoO under-counts mem traffic vs. cache mode. src/ooo/inst.cpp:50-61 takes only the first non-zero entry of
source_memory[]/destination_memory[]. A ChampSim record can encode 4 sources / 2 destinations (e.g.rep movs-shaped instructions).--mode cacheiterates all of them; OoO collapses to one. Doesn't change miss rate, but means OoO accesses≤cache-mode accesses for the same trace. - Coherence-managed write-allocate fills clean. coherence_adapter.cpp:115-121 calls
cache_fill(... 'R')even when the original miss was aOp::Write. The comment promises a follow-upOp::Writeto set dirty, but no follow-up exists in the codebase — so stores via the coherence path may not mark the line dirty, which can under-count later writebacks. Acknowledged in the comment but unresolved.
Filed for later — not blocking.
3. LOG=1 execution-trace mode¶
3.1 What it does¶
When the simulator runs full mode and the environment variable LOG is set to anything truthy (1, on, true, yes, or any non-empty value other than 0/off/false/no), it writes a log.rpt file to the run directory containing a per-instruction execution trace for each core's first 50 dynamic instructions.
# Default — no log.rpt:
./build-release/src/sim --config configs/baseline.json --trace-dir traces/core_4
# Logged:
LOG=1 ./build-release/src/sim --config configs/baseline.json --trace-dir traces/core_4
# -> report/.../log.rpt
3.2 What's in the file¶
Two event kinds, both prefixed with [c<core> cy=<cycle> dyn=<dyn>]:
LSU <op> pc=… addr=… -> L1 <hit|miss>— fired when a load or store enters the L1 MSHR. The hit/miss outcome is the synchronous resultCache::access()reported whenCache::issue()allocated the slot, captured viaMSHR::peek(id)->result.hit. So the user sees how the cache responded at issue time on the early access stream.RETIRE <op> pc=… [addr=…] [branch info]— fired when an instruction commits at the head of the ROB. Includes branch outcome and mispredict flag for branch instructions, and the memory address for loads/stores.
Subtracting an instruction's LSU cycle from its RETIRE cycle gives the observed memory latency for that access, which is exactly what the user wanted: "how the cache is acting" on early instructions.
3.3 Sample (truncated)¶
# Multicore OoO Simulator -- per-core execution trace
# trace : traces/core_4
# protocol : MESI_PRO
# window : first 50 dynamic instructions per core (dyn_count <= 50)
# format : [c<core> cy=<cycle> dyn=<dyn>] <event> ...
…
[c0 cy=00000002 dyn=00003] LSU LOAD pc=0x400008 addr=0x10000080 -> L1 miss
[c2 cy=00000002 dyn=00002] LSU LOAD pc=0x400004 addr=0x10000040 -> L1 miss
[c1 cy=00000004 dyn=00001] RETIRE BRANCH pc=0x400000 branch=T pred=N *MISPRED*
[c0 cy=00000223 dyn=00003] RETIRE LOAD pc=0x400008 addr=0x10000080
[c0 cy=00000644 dyn=00006] RETIRE BRANCH pc=0x400014 branch=T pred=N *MISPRED*
The dyn=00003 load above issued at cycle 2 and retired at cycle 223 → 221 cycles of memory latency, consistent with a 100-cycle DRAM round-trip plus ring + directory hops.
3.4 Implementation¶
include/comparch/ooo/trace_logger.hpp declares a small TraceLogger class:
class TraceLogger {
public:
TraceLogger(std::ostream& out, std::size_t cores, std::size_t max_per_core = 50);
void write_header(const std::string& trace_label, const std::string& proto_label);
bool active(int core_id, std::uint64_t dyn_count) const;
void on_lsu_issue(int core_id, std::uint64_t cycle, std::uint64_t dyn_count,
std::uint64_t pc, std::uint64_t addr, bool is_load, bool hit);
void on_retire(int core_id, std::uint64_t cycle, std::uint64_t dyn_count,
std::uint64_t pc, const char* opcode, std::uint64_t mem_addr,
bool is_branch, bool taken, bool predicted_taken, bool mispredict);
};
The budget is a simple dyn_count <= max_per_core_ check inside active(), so the same instruction's issue and retire both make it through as long as it's in the first 50 of its core. No per-core counter needed.
OooCore (include/comparch/ooo/core.hpp) gained one accessor:
void set_trace_logger(TraceLogger* logger, int core_id);
Default is null; existing test sites and --mode ooo are unaffected.
Hook sites in src/ooo/core.cpp:
stage_state_update— both retire paths (mispredicted-branch retire + regular retire) calltrace_logger_->on_retire(...)if the logger is set.stage_schedule— afterl1d_->issue(req)succeeds, peek the just-allocated MSHR slot and calltrace_logger_->on_lsu_issue(...)withmshr->result.hit.
Driver (src/full/full_mode.cpp):
log_trace_enabled()readsgetenv("LOG")with a permissive parse.- The run directory is resolved up-front (so
log.rptcan be opened before tick() runs) via the newbuild_run_dir_pre()helper. - If LOG is on, an
std::ofstream log_streamand astd::unique_ptr<TraceLogger>are constructed; the latter is wired into everyOooCoreviaset_trace_logger. - Both objects outlive the simulation loop.
The hot path overhead when LOG is off is one null-pointer compare per retire and per LSU issue. When LOG is on, formatting cost is bounded by cores × 50 × 2 events ≈ a few hundred lines.
4. Multi-file report split¶
4.1 What changed¶
Before, the run directory had:
report/<trace>_<proto>_c<N>_<tag>/
├── report.rpt # everything: header + config + per-core + system
└── report.csv # per-core scalars (sweeper input)
After:
report/<trace>_<proto>_c<N>_<tag>/
├── report.rpt # short overview: aggregate IPC + per-core 6-line summary
├── config.rpt # full configuration dump (no sim numbers)
├── stats.rpt # detailed per-core: pipeline / branch / L1 / L2 / pipeline resources / stalls
├── coherence.rpt # protocol counters + derived ratios (C2C/miss, system miss rate, total invals)
├── log.rpt # only if LOG=1: per-instruction execution trace (first 50 per core)
└── report.csv # unchanged — sweep tooling stays compatible
Stdout shows report.rpt's short overview only — easier to grep across many runs.
4.2 Stats added (industry / academic standard)¶
Each metric below was either already computed but not exposed, or trivially derivable from the existing OooStats / CacheStats / CoherenceStats packs.
Pipeline throughput
- cycles, instructions fetched, instructions retired, IPC, CPI
- deadlock state: clean or DEADLOCKED@<cycle> from the per-core watchdog.
Branch predictor
- branches, mispredictions, branch accuracy %, mispredict rate %
- branches/kinst and MPKI (mispred) — the standard branch-density and branch-MPKI numbers in any architecture paper.
Cache (per level)
- accesses, reads / writes, hits, misses
- hit rate %, miss rate %
- APKI (accesses per kilo-instruction) and MPKI (misses per kilo-instruction) — the two pressure metrics gem5/ChampSim sweep papers report side-by-side
- AMAT — average memory access time (Hennessy & Patterson Ch. 2, two-level form hit_lat + miss_rate × AMAT_next_level)
- For L2: local miss rate (L2 misses / L2 accesses) and global miss rate (L2 misses / L1 accesses). The two-rate distinction is standard in cache-hierarchy sections of the architecture textbooks.
- writebacks, coherence invals (broken out from the system-wide pack)
- Prefetcher accounting (issued / hits / misses) shown only when a prefetcher is configured.
Pipeline resources
- ROB / SchedQ / DispQ rows reporting avg / max / capacity / occupancy % — gives a one-shot picture of how full each structure was during the run.
Stalls
- no-fire cycles — fraction of cycles with zero issues from stage_schedule.
- ROB-full dispatch stalls — fraction of cycles where dispatch was blocked because the ROB was full. Together these are the textbook "front-end vs. back-end stall" classification, scaled to a single core.
Coherence (coherence.rpt)
- Cache accesses, Cache misses, Silent upgrades, C2C transfers, Memory reads, Memory writes — the existing system-wide counters.
- C2C / miss — fraction of misses satisfied by a peer cache. A protocol-comparison metric.
- System miss rate — cache_misses / cache_accesses aggregated.
- Coherence invalidations — sum of L1+L2 coherence_invals across cores.
config.rpt is exactly the existing write_config() output, factored out so config sweeps can diff it without churning stats.
4.3 Implementation notes¶
- I split the old monolithic
write_rpt()into 4 section writers (write_overview,write_config_only,write_stats,write_coherence_file), one per output file, plus a small genericwrite_to(name, fn)helper inrun_full_modethat opens and writes each one. CoreMetrics(src/full/full_mode.cpp:191-260) was extended to carry the derived rates / percentages so bothstats.rptandreport.csvuse identical numbers.report.csvschema is unchanged — every header column from before still appears in the same order. The sweep aggregator (scripts/aggregate.py) will keep working without changes.
5. Test surfaces¶
| Test layer | Result |
|---|---|
ctest --test-dir build-release |
128 / 128 passing |
ctest --test-dir build-werror-rel |
128 / 128 passing |
--mode cache regression (project1 fixtures) |
unchanged; cache code untouched |
--mode coherence regression (project3 fixtures × proto × cores) |
unchanged; coherence code untouched |
tests/full/test_full_mode.cpp (5 cases) |
passing — the test that asserted "instructions retired" in stdout still finds it (the label is preserved in the new short overview) |
Sanity run: core_4 × MESI |
4 cores × 1 K instr, 100 % L1 miss as expected (workload property) |
Sanity run: synth/loop_tiny × MESI |
4 cores × 100 K instr, L1 miss rate = 0.0019 — confirms hit rate works when the trace has reuse |
6. Out of scope, surfaced for later¶
- The
(Op::Write)coherence-fill issue noted in §2 — fix would be to plumb the original op through tocache_fill, or to issue a synthetic dirty-marking pass in the adapter. Not addressed here. - The OoO multi-mem record collapse noted in §2 — would require expanding one trace record into multiple in-flight memory ops, with attendant LSU / MSHR pressure changes. Different design discussion.
- Per-core breakdown of coherence counters (today they're system-wide). The agent FSMs already track a few of these per-block; surfacing them per core is a small refactor in
CoherenceStats. - Optionally make
LOG_WINDOW(currently 50) configurable via env var or--log-window=Nif 50 turns out to be too small for some debugging workflow.