Architecture¶
Multicore-OoO-sim is a multi-core, out-of-order, cache-coherent CMP simulator
in modern C++20. Each core models a Tomasulo-style out-of-order pipeline with
a pluggable branch predictor and a private L1/L2 cache hierarchy. Cores share
a directory over a ring interconnect, with five interchangeable coherence
protocols (MI, MSI, MESI, MOSI, MOESIF). One binary, one JSON config, one
trace format — runs ChampSim v1 traces and the project's CasimV2 multi-thread
extension.
System block diagram¶
┌─ Core 0 ──┐ ┌─ Core 1 ──┐ ... ┌─ Core N-1 ─┐
│ Fetch │ │ │ │ │
│ Decode │ │ │ │ │
│ Rename │ │ │ │ │
│ ROB / RS │ │ │ │ │
│ Exec / LSU│ │ │ │ │
│ L1 D + I │ │ │ │ │
│ L2 priv. │ │ │ │ │
└─────┬─────┘ └─────┬─────┘ └─────┬──────┘
│ │ │
└──────── Ring interconnect ────────┘
│
┌───── Directory ─────┐
│ (sharer vectors) │
└──────────┬──────────┘
│
┌─┴─┐
│DRAM│
└───┘
Each core is a full OoO pipeline (fetch → decode → rename → dispatch → schedule
→ execute → writeback → retire). Each core has private L1 (I + D) and L2 caches.
The L1s on different cores stay coherent through a directory-based protocol;
the directory and the ring live in src/coherence/.
Per-module breakdown¶
Fetch / decode / rename¶
src/ooo/core.cpp, src/ooo/inst.cpp, src/ooo/rat.cpp.
The front-end pulls instructions from the trace reader, decodes them into the
internal inst representation, and renames their source/destination registers
through the Register Alias Table (RAT). Fetch and decode widths, ROB size,
RS size, and number of functional units are all config knobs.
Out-of-order backend (ROB + reservation stations + LSU)¶
src/ooo/rob.cpp, src/ooo/schedq.cpp.
A Tomasulo-style scheduler with a unified scheduling queue (RS) and a separate Reorder Buffer for in-order retirement. Operands are tracked by ROB tag; when a producer broadcasts on the Common Data Bus, dependent entries in the RS capture the value and become ready to issue. Loads and stores flow through the LSU; stores wait until commit to write the cache.
Once instructions retire from the ROB head, their architectural state is final — this is what gives the simulator precise IPC numbers and lets it speak honestly about branch-mispredict and memory-stall behavior.
Study material: Tomasulo's algorithm, Reorder Buffer, LSQ and store-to-load forwarding (written progressively — see the study index for current status).
Branch predictor¶
src/predictor/ —
pluggable through factory.cpp.
Five predictors available, selected from JSON config:
| Predictor | File | Notes |
|---|---|---|
| always-taken | always_taken.cpp |
Baseline / sanity check. |
| Yeh-Patt | yeh_patt.cpp |
Two-level adaptive (BHR + PHT). |
| perceptron | perceptron.cpp |
Long-history capture via linear classifier. |
| hybrid | hybrid.cpp |
Combines two component predictors. |
| tournament | (via hybrid.cpp) |
Per-branch chooser between components. |
Predictor-only mode (--mode predictor) was bit-for-bit validated against the
project2 reference output. See
Phase 3 — Branch predictor for the regression
methodology.
L1 / L2 caches¶
Private L1 and L2 per core. Both are non-blocking, MSHR-backed
(mshr.cpp),
with pluggable replacement
(replacement.cpp:
LRU / LIP / MIP), write policy
(write_policy.cpp:
WBWA), and prefetcher
(prefetcher_plus_one.cpp / prefetcher_markov.cpp / prefetcher_hybrid.cpp).
Cache geometry — sets, ways, block size, latency — is fully configurable per
level. The cache-only mode (--mode cache) is bit-for-bit validated against
project1's reference output. See
Phase 2 — Cache subsystem for AAT methodology
and reference numbers.
Coherence agents¶
src/coherence/agent_*.cpp —
one agent per protocol, dispatched through
factory.cpp.
A coherence agent sits between each core's L1 and the ring. It tracks the per-line protocol state (M / E / S / I / O / F depending on protocol), issues requests, snoops responses, and updates state on transitions. Five protocols:
| Protocol | Agent file | States |
|---|---|---|
| MI | agent_mi.cpp |
Modified, Invalid |
| MSI | agent_msi.cpp |
M, S, I |
| MESI | agent_mesi.cpp |
M, E, S, I |
| MOSI | agent_mosi.cpp |
M, O, S, I |
| MOESIF | agent_moesif.cpp |
M, O, E, S, I, F |
The coherence layer is bit-for-bit validated across all 16 protocol × topology combos against project3. See Phase 5A — Cache coherence.
Directory¶
src/coherence/directory.cpp
plus protocol-specific directory_*.cpp files.
Central directory with per-line sharer vectors. Serves requests from agents, tracks owner / sharers, generates invalidations and downgrades. Each protocol family has its own directory implementation because the state machine is protocol-specific (e.g. MOESIF tracks the Forward responder explicitly).
Ring interconnect¶
src/coherence/network.cpp, src/coherence/message.cpp, src/coherence/node.cpp.
Ring topology with separate message classes (request, response, snoop) to avoid protocol deadlock. Per-hop latency configurable; ring contention is modeled — agents stall when the ring slot is occupied.
DRAM¶
A single shared channel with a fixed access latency. This is the largest deliberate simplification in the sim — there's no rank / bank queueing, no row-buffer modeling, no controller scheduling. See "What's modeled vs. abstracted" below.
Operating modes¶
--mode selects which subsystem the binary exercises. Default is full.
| Mode | Exercises | Validated against |
|---|---|---|
cache |
L1 + L2 + MSHR + replacement + prefetcher | project1 reference output, bit-for-bit |
predictor |
Branch predictor only (one predictor per trace) | project2 reference output, bit-for-bit |
ooo |
Single-core OoO pipeline (fetch → retire) + private caches | Internal tests; see Phase 4 review |
coherence |
Multi-core caches + directory + ring (no OoO timing) | project3 reference output, bit-for-bit (16/16 combos) |
full |
Everything: N OoO cores + coherence + ring + DRAM | Internal regressions; SPEC2017 IPC matches published baselines |
Modes are gateways into the same code paths — they exist so each subsystem can be regressed independently against its course-project reference, and so a failing test can be localized.
Configuration surface¶
The JSON config in configs/baseline.json is the single source of truth. CLI flags override individual fields. Major knobs:
| Knob | Why you'd tune it |
|---|---|
cores |
Scale studies. IPC saturates at 4 cores in the baseline (see config sweep). |
coherence.protocol |
Compare MI / MSI / MESI / MOSI / MOESIF on the same workload. |
cache.l1.size, assoc, block_size |
AAT and miss-rate studies; capacity vs. associativity tradeoff. |
cache.l2.size, etc. |
Same for L2; inclusion behavior. |
cache.replacement |
LRU / LIP / MIP — replacement-policy ablations. |
cache.prefetcher |
none / plus_one / markov / hybrid — coverage vs. accuracy. |
dram.latency |
Sensitivity studies; how memory-bound is the workload? |
ooo.rob_size, rs_size, width |
OoO sizing tradeoffs; instruction-window vs. IPC. |
predictor.type |
Compare always-taken / Yeh-Patt / perceptron / hybrid / tournament. |
See Running for the full CLI and config schema.
What's modeled vs. what's abstracted¶
Modeled:
- OoO pipeline timing (fetch / decode / rename / dispatch / issue / exec / writeback / retire).
- ROB-bounded instruction window and in-order retirement (so the IPC numbers are real).
- Non-blocking caches with MSHR-induced stalls and secondary-miss combining.
- Coherence state machines including races (e.g. concurrent upgrades, eviction during snoop).
- Directory serialization and the contention it implies.
- Ring contention and per-hop latency.
- Multi-thread traces (CasimV2): per-core trace streams with sync/lifecycle records.
Abstracted:
- DRAM as a fixed-latency channel — no rank/bank queueing, no row-buffer locality, no controller scheduling.
- No TLB, no page-walk modeling, no virtual memory translation cost.
- No OS effects (context switches, page faults, interrupts).
- No frequency scaling, voltage, or power modeling.
- Functional execution: instructions are timed but not actually computed — values come from the trace.
These limits are stated honestly in Status so the IPC numbers are read in context.
Code-to-concept map¶
| Concept | Source | Phase report |
|---|---|---|
| Trace reader + formats | tools/gen_trace/, src/coherence/fici_cpu.cpp |
Phase 1 — Traces |
| L1/L2 cache + MSHR | src/cache/ |
Phase 2 — Cache |
| Branch prediction | src/predictor/ |
Phase 3 — Predictor |
| OoO pipeline (ROB / RS / LSU) | src/ooo/ |
Phase 4 review |
| Coherence agents + directory | src/coherence/agent_*.cpp, directory_*.cpp |
Phase 5A — Coherence |
| Ring + message classes | src/coherence/network.cpp, message.cpp, node.cpp |
Phase 5B — Full integration |
| Multi-core integration | src/full/, src/main.cpp |
Phase 5B, Config sweep |
| LLS shared cache + NINE | src/coherence/lls_cache.cpp |
LLS + hybrid coherence, LLS study guide |
Where to go next¶
- How to run it — build, configs, CLI, trace formats.
- How it was built — phase-by-phase development journal.
- Concepts explained — concept-first study material on Tomasulo, ROB, coherence protocols, predictors, AAT, consistency, and the ring.