Store-miss writeback fix and per-core private synth traces¶
Status: done.
Files touched: include/comparch/cache/mem_req.hpp, src/cache/cache.cpp, src/ooo/core.cpp, include/comparch/coherence/coherence_adapter.hpp, src/coherence/coherence_adapter.cpp, scripts/gen_synth.py, scripts/sanity_rules.py, scripts/aggregate.py.
Tests: 128 / 128 passing on build-release (including the project3 bit-for-bit regression suite).
Smoke sweep: 20 runs, 0 errors, 0 warnings, 0 info.
This pair of fixes closes two long-standing items: the write-allocate-fills-clean bug flagged but unresolved at the end of 13-log-mode-and-rpt-split.md §2, and the synthetic-trace sharing artefact that was producing nonsensical cross-protocol IPC spreads on the validation sweep.
Table of contents¶
- Symptoms observed
- Bug A — store misses fill clean (the writeback bug)
- Bug B — synth traces are fully shared instead of private
- Why these two are coupled
- Implementation details
- Verification
- Sanity-rule and caveat updates
- Follow-up items
1. Symptoms observed¶
A user reading report/loop_tiny_moesif_c4_smoke__proto_moesif/stats.rpt
noticed that all four cores reported L1 writebacks: 0 under every
protocol and asked the obvious question: "is there really no case
where a dirty block gets evicted and written back to memory?"
Sweeping every smoke run confirmed it wasn't isolated to loop_tiny:
| Trace | L1 writebacks (sum across cores) |
|---|---|
loop_tiny |
0 (all 5 protocols) |
sequential_tiny |
0 (all 5 protocols) |
stream_tiny |
0 (all 5 protocols) |
random_tiny |
472 (MESI/MSI/MOSI/MOESIF), 204 (MI) |
loop_tiny's zero is correct (working set of 64 blocks fits trivially
in a 32 KB / 8-way L1 — there are no evictions to write back). But
sequential_tiny and stream_tiny have ~34 000 unique blocks each and
~10% of records are stores; with 100% miss rate, roughly half the
evicted lines ought to be dirty. That zero was a real bug, not a
workload property.
In parallel, the smoke sweep was reporting two
proto_invariance_private warnings:
proto IPC spread 47.80% > 1% on private-address trace; values=
{baseline: 0.0082, proto_mi: 0.0146, proto_moesif: 0.0086,
proto_mosi: 0.0076, proto_msi: 0.0076}
That's two anomalies in one: a 47% spread (rule says < 1%), AND MI running ~80% faster than MESI on what was supposed to be a private workload — which is impossible if the trace is genuinely private, because MI can only ever be slower than the more sophisticated protocols.
2. Bug A — store misses fill clean (the writeback bug)¶
2.1 Mechanism¶
In a writeback cache, the dirty bit tells the cache "I owe a writeback when I evict this line." No dirty bit set → eviction is silent. So for writebacks to fire, two things have to happen, in order:
- A store to that line marks it dirty.
- The line later gets evicted while still dirty.
A store-miss in this simulator follows this path:
core executes Op::Write to address A
↓
L1 lookup → MISS
↓
LSU sends a coherence request to the directory (via the adapter)
↓
[network round-trip; data eventually arrives]
↓
CoherenceAdapter::tick() inserts the line into L1 + L2
↓
... cycles later, line A gets evicted ...
↓
cache sees: dirty bit is 0 → silent drop
The fill site at
coherence_adapter.cpp::tick()
called cache_fill(*l1d_, cache_block, /*rw=*/'R') unconditionally —
even when the originating miss was a store. The line landed clean. The
dirty bit was never set, so the eviction silently dropped what should
have been a writeback.
The fill comment promised a "follow-up Op::Write through the L1 hit
path" to set dirty correctly. No such follow-up existed anywhere in the
codebase. The OoO core's store-completion path
(core.cpp:226-231) just marks the ROB
ready and erases the schedQ entry — it does not re-issue the store
to L1 to set the dirty bit.
2.2 Why random_tiny partly escaped¶
random_tiny reported 472 writebacks across 4 cores while
sequential_tiny reported 0. The difference is the 6.1 % reuse rate:
~2 000 of random_tiny's 34 000 memory ops hit in cache rather than
miss. Write hits don't go through the adapter fill path — they
update the resident line directly via the L1 hit-write splice
(cache.cpp:306), which sets the dirty bit
correctly. When those hit-dirtied lines later get evicted by the next
~32 000 misses, they trigger writebacks. That's the 472.
sequential_tiny and stream_tiny have 0% reuse — every access is
unique → every store is a miss → every fill goes through the buggy
path → no dirty bits ever get set → no writebacks ever fire.
2.3 The plumbing problem with the obvious fix¶
The "obvious" fix — pass 'W' to cache_fill when the original op was
a store — runs into a layered-cache issue. L1 forwards a write-miss to
L2 as Op::Read, not Op::Write, because L2 also runs the WBWA
write-allocate path: passing Op::Write to L2 would land on its
hit-write splice and incorrectly mark L2 dirty, when L2 is just a
storage tier holding a copy. The convention is that L1 reads the line
from L2, then dirties it locally on its own write hit.
Concretely:
// src/cache/cache.cpp — L1 forwarding to L2 on a miss
if (cfg_.next_level) {
const auto sub = cfg_.next_level->access(
MemReq{block_addr << cfg_.b, Op::Read, req.pc}); // <- always Read
...
}
By the time the request reaches L2's coherence_sink->on_miss(...)
call, req.op has already been laundered to Op::Read. The adapter
has no way to tell the original L1 op was a store.
2.4 The fix: thread originating_op through MemReq¶
Add a separate originating_op field to MemReq that travels alongside
the local op. L1 sets originating_op = req.op when forwarding to
L2; L2 forwards it through; the adapter inspects originating_op to
decide whether to fill L1 dirty.
// include/comparch/cache/mem_req.hpp
struct MemReq {
std::uint64_t addr = 0;
Op op = Op::Read;
std::uint64_t pc = 0;
Op originating_op = Op::Read; // top-level op (set by caller)
};
// src/cache/cache.cpp — L1 forwarding to L2 (write-allocate fetch)
const auto sub = cfg_.next_level->access(
MemReq{block_addr << cfg_.b, Op::Read, req.pc,
/*originating_op=*/req.op});
// src/cache/cache.cpp — L2's miss path into the coherence sink
} else if (cfg_.coherence_sink) {
cfg_.coherence_sink->on_miss(block_addr, req.originating_op);
suspended = true;
}
// src/ooo/core.cpp — the OoO core sets it at the top of the chain
req.op = u.is_load ? cache::Op::Read : cache::Op::Write;
req.originating_op = req.op;
auto id = l1d_->issue(req);
The CoherenceAdapter then tracks pending store misses and consults the set when a fill arrives:
// include/comparch/coherence/coherence_adapter.hpp
std::unordered_set<std::uint64_t> pending_stores_;
// src/coherence/coherence_adapter.cpp::on_miss
if (op == cache::Op::Write) {
pending_stores_.insert(byte_block);
}
// src/coherence/coherence_adapter.cpp::tick (fill site)
const bool was_store = pending_stores_.erase(byte_block) > 0;
cache_fill(*l2d_, cache_block, /*rw=*/'R'); // L2 stays clean
cache_fill(*l1d_, cache_block, /*rw=*/was_store ? 'W' : 'R'); // L1 dirty for stores
insert_new_block's existing write-allocate path
(cache.cpp:194-196) already sets
new_block.dirty = true when called with 'W', so the fix lands
exactly where the existing dirty bookkeeping expects it.
2.5 Why this doesn't break the project3 regression suite¶
Coherence semantics are tracked by the directory, not by the local
cache's dirty bit. The directory's handle_writeback distinguishes
dirty from clean by its own per-block ownership state, and the
adapter's on_evict always emits MessageKind::DATA_WB regardless of
the local dirty flag
(coherence_adapter.cpp:83-94
— (void)dirty; is the relevant line). So the writeback fix is a
local-stats correctness fix; it does not alter the cross-cache messages
that the regression suite pins on. All 128 tests still pass bit-for-bit
against dirsim reference outputs.
3. Bug B — synth traces are fully shared instead of private¶
3.1 What the rule was checking¶
The harness's cross-run sanity rule (sanity_rules.py) said: on a workload where each core touches its own private addresses, all five coherence protocols should produce IPC within 1% of each other. That follows directly from what coherence does — if no two cores ever touch the same block, the protocol never has to invalidate, downgrade, or transfer anything, so MI / MSI / MESI / MOSI / MOESIF should all degenerate to the same per-core L1 hit-rate / miss-rate / round-trip behavior.
The rule classified synth/sequential_* and synth/random_* as
private, based on filename.
3.2 What the trace generator was actually producing¶
scripts/gen_synth.py called
build-release/tools/gen_trace/gen_trace once per (pattern × size),
producing one raw.champsimtrace, then symlinked
p0..p3.champsimtrace → raw.champsimtrace so the directory looked like
a 4-core trace dir to the simulator:
traces/synth/sequential_tiny/
├── raw.champsimtrace ← actual data (single stream)
├── p0.champsimtrace -> raw.champsimtrace
├── p1.champsimtrace -> raw.champsimtrace
├── p2.champsimtrace -> raw.champsimtrace
└── p3.champsimtrace -> raw.champsimtrace
When the simulator opened p0, p1, p2, p3, it read the same byte
stream four times. All four cores executed byte-identical instruction
streams over byte-identical addresses. Far from being private, this is
the most shared possible workload — every memory op on core 0 touched
the same address as the corresponding op on cores 1, 2, and 3. The
rule's filename-based "private" classification didn't match the
generator's output.
3.3 Why MI was faster than MESI under sharing¶
MI has only two states (Modified, Invalid). Reads and writes both yank the line into M state, evicting it from whoever held it. MESI/MSI/MOSI/MOESIF have a Shared state that a read can leave the line in, but then a subsequent write requires an upgrade (broadcast invalidate, wait for acks, transition to M) — a directory round-trip per upgrade.
On four cores running lockstep over identical addresses with stores mixed in, MESI's S → M upgrade path costs strictly more than MI's read-yanks-the-line semantics. The line ping-pongs under MI too, but MI skips the upgrade step. So MI looking faster wasn't an MI bug — it was MESI doing extra work that paid off only on workloads with unsharing of an S-state line, which lockstep traces never produce.
The "anomaly" was real, but it was a property of the trace, not of the simulator.
3.4 The fix: per-core distinct streams¶
scripts/gen_synth.py now calls gen_trace
four times per (pattern × size), once per core, with two distinct
per-core knobs:
--seed (seed_base + i)— different RNG sequences per core, so the load/store/branch decisions diverge even for deterministic patterns.--addr-base (DEFAULT_ADDR_BASE + i × 1 TB)— disjoint base address per core. Sequential / stream walks march through a different region of the 64-bit address space; random / loop windows live in non-overlapping regions.
traces/synth/sequential_tiny/
├── p0.champsimtrace ← seed = S+0, addr_base = B + 0 TB
├── p1.champsimtrace ← seed = S+1, addr_base = B + 1 TB
├── p2.champsimtrace ← seed = S+2, addr_base = B + 2 TB
└── p3.champsimtrace ← seed = S+3, addr_base = B + 3 TB
The 1 TB stride dwarfs the largest per-core working set (long-tier sequential = 6.4 GB / core; random = 16 MB window; loop = 4 KB), so collisions are impossible across all sweep tiers.
gen_trace already exposed both knobs (--seed and --addr-base); no
C++ change was needed — just rewrite of the Python generator. The
legacy raw.champsimtrace is removed during regeneration so old
symlink layouts don't shadow the new files.
3.5 Trade-off: synth no longer exercises shared coherence¶
By making every synth trace genuinely private, no synth trace exercises shared-line coherence anymore. The only ways to stress shared coherence today are:
- The
tests/coherence/fixtures/proj3/traces/core_4/core_8/... fixtures used by the bit-for-bit regression suite. - A user-built
--trace-listmanifest that points multiple cores at the same trace file (see TRACES.md §2).
A future gen_synth_shared.py (or a --shared flag on the existing
generator) would close this gap. Tracked under TODO in
TRACES.md §10.4.
4. Why these two are coupled¶
The writeback bug had been visible for months in the 0 writebacks on
sequential_tiny symptom but didn't surface as a sweep failure because
the sweep was running fully-shared synth traces, where MI's protocol
quirks dominated the IPC numbers and the writeback-stats anomaly was
just one statistic among many. Fixing the trace generator to produce
private streams revealed the writeback bug clearly: with sharing
removed, the only thing protocols could differ on was per-core local
behavior, and the missing writebacks became conspicuous.
Conversely, fixing only the writeback bug would have left the
proto_invariance_private warnings firing forever. Fixing only the
trace generator would have left
sequential_tiny/stream_tiny/loop_tiny reporting 0 writebacks —
this time correctly for loop_tiny (working set fits, no evictions)
but still incorrectly for the sequential / stream / random patterns
that now run private streams with massive eviction churn.
So both fixes belong in the same pass.
5. Implementation details¶
5.1 Files changed (C++)¶
| File | Change |
|---|---|
| include/comparch/cache/mem_req.hpp | Added originating_op field to MemReq. |
| src/cache/cache.cpp | L1's miss-forward to L2 sets originating_op = req.op. L2's coherence_sink->on_miss call uses req.originating_op. |
| src/ooo/core.cpp | OoO core sets req.originating_op = req.op when issuing into L1 (the topmost MemReq in the chain). |
| include/comparch/coherence/coherence_adapter.hpp | Added std::unordered_set<std::uint64_t> pending_stores_ member. |
| src/coherence/coherence_adapter.cpp | on_miss(... Op::Write) inserts into pending_stores_. tick() consults the set on fill arrival and passes 'W' to cache_fill(*l1d_, ...) for store-fills. L2 still fills clean (write-allocate dirty is L1-only). |
5.2 Files changed (Python harness)¶
| File | Change |
|---|---|
| scripts/gen_synth.py | Generates 4 distinct trace files per (pattern × size) with per-core seed offset and per-core 1 TB addr_base stride. Removes legacy raw.champsimtrace and any old symlinks during regeneration. already_done rejects symlinks so a forced re-gen always replaces them. |
| scripts/sanity_rules.py | All synth/* traces now classified as private (no shared synth exists anymore). proto_invariance_private tolerance bumped 1% → 5% to absorb per-core RNG noise on tiny traces. l2_l1_miss_mismatch tolerance bumped 10% → 100% because L2 now correctly sees L1 writeback traffic in addition to demand misses. |
| scripts/aggregate.py | Removed the stale shared-traces ABORT/INTERVENE caveat (no shared synth traces exist). Replaced with the private-by-default note. |
5.3 Why pending_stores_ is a set keyed by byte_block¶
Because the L1 MSHR merges loads and stores on the same block into one outstanding request. If any of the merged ops on a block is a store, the resulting line owes a writeback once it gets evicted. A set naturally handles "any store seen on this block" idempotently. Erasing on fill returns whether the block was in the set, which doubles as the "was a store" flag.
A potential edge case — block being evicted before its fill arrives —
isn't reachable under standard MSHR semantics: the MSHR holds the slot
open until mark_ready fires, and the cache can't evict an in-flight
block.
6. Verification¶
6.1 Writeback fix¶
Before:
$ make run TRACE=traces/synth/sequential_tiny TAG=before
L1 wb total: 0
L2 wb total: 0
After:
$ make run TRACE=traces/synth/sequential_tiny TAG=after
L1 wb total: 29743
L2 wb total: 255
The L2 writeback count is small because L1 writebacks under coherence
flow to the directory via on_evict, not through L2 directly; L2 only
sees writebacks when an L1-evicted dirty line happens to land in an
L2-resident-clean slot and dirties it.
6.2 Per-core distinct streams¶
Before:
proto_invariance_private [synth/random_tiny]:
spread 47.80% > 1%
values = {baseline: 0.0082, proto_mi: 0.0146, ...}
proto_invariance_private [synth/sequential_tiny]:
spread 50.21% > 1%
values = {baseline: 0.0077, proto_mi: 0.0145, ...}
After:
proto_invariance_private: no warnings
random_tiny IPC spread: 1.04% (statistical noise)
sequential IPC spread: < 1%
MI runtime on the smoke tier collapsed from 200+ seconds (synth sequential / stream / random) to ~6 seconds, because the network ping-pong pathology that emerges under heavy line-sharing is gone.
6.3 Test suite and sweep¶
$ ctest --test-dir build-release
100% tests passed, 0 tests failed out of 128
Total Test time (real) = 1.65 sec
$ make smoke
runs=20 errors=0 warnings=0 info=0
The 128-test suite includes all
(MSI / MESI / MOSI / MOESIF) × (4 / 8 / 12 / 16 cores) project3
bit-for-bit regression cases. Bit-for-bit equivalence is preserved
because:
- The writeback fix is local-stats only (the directory tracks dirty independently — the cross-cache message stream is unchanged).
- The synth-trace change touches
traces/synth/*only; the project3 fixtures live undertests/coherence/fixtures/proj3/and weren't touched.
7. Sanity-rule and caveat updates¶
Two rule thresholds and one caveat were updated to reflect the new correct behavior, so the harness doesn't fire false-positives:
proto_invariance_private: 1% → 5% spread tolerance. Per-core RNG variance on tiny traces produces a few percent IPC spread even on truly private workloads (a branch outcome here, a store-vs-load there). The original 1% threshold was tuned for the symlinked byte-identical-stream world where any spread had to be a protocol effect; with per-core seeds, RNG noise is in the same band, and 1% is too tight.l2_l1_miss_mismatch: 10% → 100% ratio tolerance. The expected relationship isl2_accesses ≈ l1_misses + l1_writebacks, and worst case writeback rate equals miss rate, sol2_acccan approach2 × l1_misses. The old 10% threshold assumed the buggy zero-writebacks world. (Threadingl1_writebacksintoCoreRowand checking the precise relationship is a more exact fix; the loose-threshold version is simpler and still catches the genuinely suspicious case where L2 sees traffic that doesn't trace back to any L1 event.)- The summary-level "ABORT/INTERVENE knob is TODO" caveat was rewritten to match the new private-by-default reality. The proto_invariance_shared code path is preserved as a placeholder for the future shared-trace family; today it never fires because no synth trace is classified as shared.
8. Follow-up items¶
-
Shared-coherence synth traces. A future
gen_synth_shared.py(or a--sharedflag on the existing generator) should produce traces where 2+ cores share a working set, so sweeps can stress shared-line coherence the way the project3 fixtures do, but at arbitrary tier sizes. Tracked in TRACES.md §10.4. -
Plumb
l1_writebacksintoCoreRow. Thel2_l1_miss_mismatchrule could check the exact relationshipl2_acc - l1_writebacks ≈ l1_missesinstead of the loose 100% tolerance. Requires extending the report.csv columns and the parser inaggregate.py. -
OoO under-counts mem traffic vs cache mode. Same item flagged in 13-log-mode-and-rpt-split.md §2: src/ooo/inst.cpp:50-61 takes only the first non-zero entry of
source_memory[]/destination_memory[], while--mode cacheiterates all of them. Doesn't change miss rate, but means OoO'saccessescount is≤cache-mode for the same trace. Still unresolved.