Phase 5B — Default mode: multi-core OoO + coherence¶
Goal: make the default invocation (no --mode flag) run the full
multi-core OoO simulator. N OoO cores, each with a private finite L1+L2,
all connected through the Phase 5A coherence ring + directory + agents.
After this phase the simulator engine is feature-complete: every microarchitectural piece in the original plan runs together. What's left is Phase 1 leftover (real workload tracing) and Phase 6 (polish, plots, public release).
What was built¶
| Piece | Where it lives now | Notes |
|---|---|---|
CoherenceSink interface |
include/comparch/cache/coherence_sink.hpp | Tiny abstract base. Lives in the cache module so the cache layer doesn't pull coherence headers. |
Cache::Config::coherence_sink slot |
include/comparch/cache/cache.hpp | Wired into 8 miss/eviction call sites in cache.cpp. With sink null, behavior is byte-identical to Phase 4. |
Cache::mark_ready / coherence_invalidate |
src/cache/cache.cpp | External MSHR completion + silent block drop on directory invalidation. |
kCoherenceSuspendedLatency sentinel |
include/comparch/cache/coherence_sink.hpp | unsigned int max returned by Cache::access when a miss has been routed to the sink. Propagates through L1→L2 chains so L1's issue() parks the MSHR with due_cycle = UINT64_MAX. |
CpuPort abstract type |
include/comparch/coherence/cpu_port.hpp | Replaces the FiciCpu* field in coherence::Cache. FiciCpu (5A) and CoherenceAdapter (5B) both inherit. |
CoherenceAdapter |
src/coherence/coherence_adapter.cpp | The per-core shim. Implements CoherenceSink (L2 calls it on miss/evict) and CpuPort (the agent's send_DATA_proc writes into it). Owns the per-core coherence::Cache (the unbounded protocol-state map from Phase 5A) and holds non-owning pointers to L1+L2. |
MessageKind::DATA_WB repurposed as WRITEBACK |
src/coherence/directory.cpp | New shared DirectoryController::handle_writeback helper; each protocol's tick gets a leading branch. Dirty M/O/F evictions increment memory_writes; clean drops just clear presence. |
| Network second constructor | include/comparch/coherence/network.hpp, src/coherence/network.cpp | Phase 5A constructor stays for --mode coherence. The 5B constructor takes pre-built (CpuPort*, coherence::Cache*) pairs and a DirectoryController and just stitches them into the ring. |
| Async OoO stores | src/ooo/core.cpp | Phase 4's synchronous l1d_->access(req) for stores is replaced by l1d_->issue(req) + peek/complete polling on the LSU FU — same path as loads. Required because under coherence a store may need a network round-trip to acquire M-state. |
run_full_mode driver |
src/full/full_mode.cpp | Builds N OoO cores, N L1+L2 caches, N adapters, 1 directory, 1 Network. Per-cycle loop: tick all cores, then network, then tock. Global cycle cap as a backstop. |
CLI: drop "full" from --mode validator |
src/common/cli.cpp | Default invocation (no --mode) → Mode::Full → run_full_mode. parse_mode("full") still resolves internally for round-trip tests. |
What was dropped:
-
The synchronous-store path in the OoO core. Under coherence it's fundamentally unsound (a store needs M-state, which may take a network round-trip). The async path subsumes both single-core and multi-core configurations; Phase 4
--mode oootests still pass with the async path because thedue_cyclelogic still drives completion in the absence of a coherence sink. -
The
Cache::issueOp::Write rejection. Was a precondition designed to prevent merge fast-path from losing dirty-bit semantics; now writes simply skip the merge path and always allocate a fresh MSHR slot.
What was deferred to Phase 6 cleanup (acknowledged correctness gaps):
-
Store on S-state hit: A store on a line resident in L1 in S-state hits the cache (
block_in() = true) and completes synchronously without consulting the agent. Under coherence it should issue a GETM to upgrade S→M first. Phase 5B's tests don't exercise this pattern (synthetic shared-load and producer-consumer cases land blocks in M directly), so the gap doesn't surface in the suite — but a real workload doing read-then-write on shared data would hit it. Fix: route writes through a write-aware path that consults agent state, OR mark write-hits to non-M lines as forced misses. -
RECALL_GOTO_S clean-bit propagation: When the agent transitions M→O / E→F via RECALL_GOTO_S, the line stays resident in L1+L2 but its dirty bit doesn't get cleared. The next eviction will spuriously count it as a dirty writeback. Inflates
memory_writesslightly; doesn't cause incorrect coherence behavior. -
Adapter outbound queue under deep contention: L2 misses in the same cycle from multiple LSU FUs are buffered in the adapter's
outbound_proc_deque and drained one per cycle. Acceptable for Phase 5B but a real bottleneck-modeling pass would expose this as a configurable per-cycle dispatch limit.
Architecture¶
Per core i (i ∈ [0, cores)):
OooCore_i ──► l1d_i ──► l2_i ──► CoherenceAdapter_i
│
▼
coherence::Cache_i (state-table, agents)
│
Network (Phase 5A, RING)
│
┌───────────┴───────────┐
▼ ▼
DirectoryController (other cores)
Per-cycle ordering (in run_full_mode's main loop):
- Tick every OoO core. The core advances its pipeline; on an L2 miss the adapter queues a LOAD/STORE Message into its outbound buffer.
- Tick the Network. Each Node ticks: the adapter drains one queued
message into
coh_cache->cpu_in_next, the agent processes any message that landed incpu_inlast tock, ring movement happens. - Tock the Network. Half-cycle staging buffers shift forward.
Termination: every core reports done AND the Network has no in-flight messages. A 5M-cycle global cap catches deadlock-shaped hangs.
CLI changes¶
Before:
--mode {full,cache,predictor,ooo,coherence}
--mode {cache,predictor,ooo,coherence} # omit for full multi-core simulator
Mode::Full stays as the internal default (CliArgs::mode's initializer).
parse_mode("full") continues to work for to_string/parse_mode
round-trip tests. Passing --mode full is now rejected with exit 1
and a clear error.
The default invocation:
./build/src/sim --config configs/baseline.json \
--cores 4 \
--trace-dir traces/myworkload/
traces/myworkload/p0.champsimtrace ...
p3.champsimtrace are the expected per-core ChampSim binary traces.
Verification¶
Test count: 122 (Phase 5A) → 126 (Phase 5B). New tests:
test_cli_dispatch.cpp(Phase 5A): 5 —Mode::Fullis the default,parse_mode("full")still resolves,--mode fullrejected, no-flag invocation reachesMode::Full.test_coherence_sink.cpp: 4 —Cache::accessreturns suspended sentinel when a sink is wired;mark_readyflips MSHR ready;coherence_invalidatedrops resident blocks; sink-wired evictions notify on both dirty and clean victims.test_writeback.cpp: 5 — one per protocol (MSI/MESI/MOSI/MOESIF- MI). Verify dirty-owner WRITEBACK increments
memory_writes, clean-sharer WRITEBACK is silent, presence drops, state collapses to I when last sharer leaves. test_full_mode.cpp: 4 — 1-core ALU-only smoke; 2-core ALU-only; 4-core private-loads under MESI; all 5 protocols run a 2-core private-load trace.
cmake --build build -j && ctest --test-dir build --output-on-failure
# 126 / 126 tests passed
cmake --preset ci && cmake --build build-ci -j && ctest --test-dir build-ci
# 126 / 126, clean under -Werror on Apple Clang
cmake --preset release && cmake --build build-release -j
# 126 / 126, -O3 build
Phase 5A's 16/16 proj3 parity combos still pass — the WRITEBACK
directory branch never fires under FICI traces (which never produce
evictions on the unbounded coherence::Cache state-table).
What's next¶
The simulator engine is done. The remaining roadmap items are:
- Phase 1 finish-out: the real DynamoRIO-based tracer
(
tools/tracer/drmem2champsim), public trace fetcher (scripts/fetch_traces.sh), end-to-end validation on a pthreads matmul workload, optional Pin-based ChampSim-tracer wrapper. Until this lands, Phase 5B can only run synthetic and project-converted traces. - Phase 6 polish: correctness gaps listed above (S→M store
upgrade, dirty-bit propagation on RECALL_GOTO_S), an interesting
"result" plot or two (IPC vs cache size, MPKI per predictor,
scaling vs core count under MSI/MESI/MOSI/MOESIF),
docs/architecture.mdwith timing diagrams, public-release decision.