Architecture¶

Multicore-OoO-sim is a multi-core, out-of-order, cache-coherent CMP simulator in modern C++20. Each core models a Tomasulo-style out-of-order pipeline with a pluggable branch predictor and a private L1/L2 cache hierarchy. Cores share a directory over a ring interconnect, with five interchangeable coherence protocols (MI, MSI, MESI, MOSI, MOESIF). One binary, one JSON config, one trace format — runs ChampSim v1 traces and the project's CasimV2 multi-thread extension.

System block diagram¶

   ┌─ Core 0 ──┐  ┌─ Core 1 ──┐  ...  ┌─ Core N-1 ─┐
   │ Fetch     │  │           │       │            │
   │ Decode    │  │           │       │            │
   │ Rename    │  │           │       │            │
   │ ROB / RS  │  │           │       │            │
   │ Exec / LSU│  │           │       │            │
   │ L1 D + I  │  │           │       │            │
   │ L2 priv.  │  │           │       │            │
   └─────┬─────┘  └─────┬─────┘       └─────┬──────┘
         │              │                    │
         └──────── Ring interconnect ────────┘
                          │
                  ┌───── Directory ─────┐
                  │   (sharer vectors)  │
                  └──────────┬──────────┘
                             │
                           ┌─┴─┐
                           │DRAM│
                           └───┘

Each core is a full OoO pipeline (fetch → decode → rename → dispatch → schedule → execute → writeback → retire). Each core has private L1 (I + D) and L2 caches. The L1s on different cores stay coherent through a directory-based protocol; the directory and the ring live in src/coherence/.

Per-module breakdown¶

Fetch / decode / rename¶

src/ooo/core.cpp, src/ooo/inst.cpp, src/ooo/rat.cpp.

The front-end pulls instructions from the trace reader, decodes them into the internal inst representation, and renames their source/destination registers through the Register Alias Table (RAT). Fetch and decode widths, ROB size, RS size, and number of functional units are all config knobs.

Out-of-order backend (ROB + reservation stations + LSU)¶

src/ooo/rob.cpp, src/ooo/schedq.cpp.

A Tomasulo-style scheduler with a unified scheduling queue (RS) and a separate Reorder Buffer for in-order retirement. Operands are tracked by ROB tag; when a producer broadcasts on the Common Data Bus, dependent entries in the RS capture the value and become ready to issue. Loads and stores flow through the LSU; stores wait until commit to write the cache.

Once instructions retire from the ROB head, their architectural state is final — this is what gives the simulator precise IPC numbers and lets it speak honestly about branch-mispredict and memory-stall behavior.

Study material: Tomasulo's algorithm, Reorder Buffer, LSQ and store-to-load forwarding (written progressively — see the study index for current status).

Branch predictor¶

src/predictor/ — pluggable through factory.cpp.

Five predictors available, selected from JSON config:

Predictor	File	Notes
always-taken	`always_taken.cpp`	Baseline / sanity check.
Yeh-Patt	`yeh_patt.cpp`	Two-level adaptive (BHR + PHT).
perceptron	`perceptron.cpp`	Long-history capture via linear classifier.
hybrid	`hybrid.cpp`	Combines two component predictors.
tournament	(via `hybrid.cpp`)	Per-branch chooser between components.

Predictor-only mode (--mode predictor) was bit-for-bit validated against the project2 reference output. See Phase 3 — Branch predictor for the regression methodology.

L1 / L2 caches¶

src/cache/.

Private L1 and L2 per core. Both are non-blocking, MSHR-backed (mshr.cpp), with pluggable replacement (replacement.cpp: LRU / LIP / MIP), write policy (write_policy.cpp: WBWA), and prefetcher (prefetcher_plus_one.cpp / prefetcher_markov.cpp / prefetcher_hybrid.cpp).

Cache geometry — sets, ways, block size, latency — is fully configurable per level. The cache-only mode (--mode cache) is bit-for-bit validated against project1's reference output. See Phase 2 — Cache subsystem for AAT methodology and reference numbers.

Coherence agents¶

src/coherence/agent_*.cpp — one agent per protocol, dispatched through factory.cpp.

A coherence agent sits between each core's L1 and the ring. It tracks the per-line protocol state (M / E / S / I / O / F depending on protocol), issues requests, snoops responses, and updates state on transitions. Five protocols:

Protocol	Agent file	States
MI	`agent_mi.cpp`	Modified, Invalid
MSI	`agent_msi.cpp`	M, S, I
MESI	`agent_mesi.cpp`	M, E, S, I
MOSI	`agent_mosi.cpp`	M, O, S, I
MOESIF	`agent_moesif.cpp`	M, O, E, S, I, F

The coherence layer is bit-for-bit validated across all 16 protocol × topology combos against project3. See Phase 5A — Cache coherence.

Directory¶

src/coherence/directory.cpp plus protocol-specific directory_*.cpp files.

Central directory with per-line sharer vectors. Serves requests from agents, tracks owner / sharers, generates invalidations and downgrades. Each protocol family has its own directory implementation because the state machine is protocol-specific (e.g. MOESIF tracks the Forward responder explicitly).

Ring interconnect¶

src/coherence/network.cpp, src/coherence/message.cpp, src/coherence/node.cpp.

Ring topology with separate message classes (request, response, snoop) to avoid protocol deadlock. Per-hop latency configurable; ring contention is modeled — agents stall when the ring slot is occupied.

DRAM¶

src/cache/main_memory.cpp.

A single shared channel with a fixed access latency. This is the largest deliberate simplification in the sim — there's no rank / bank queueing, no row-buffer modeling, no controller scheduling. See "What's modeled vs. abstracted" below.

Operating modes¶

--mode selects which subsystem the binary exercises. Default is full.

Mode	Exercises	Validated against
`cache`	L1 + L2 + MSHR + replacement + prefetcher	project1 reference output, bit-for-bit
`predictor`	Branch predictor only (one predictor per trace)	project2 reference output, bit-for-bit
`ooo`	Single-core OoO pipeline (fetch → retire) + private caches	Internal tests; see Phase 4 review
`coherence`	Multi-core caches + directory + ring (no OoO timing)	project3 reference output, bit-for-bit (16/16 combos)
`full`	Everything: N OoO cores + coherence + ring + DRAM	Internal regressions; SPEC2017 IPC matches published baselines

Modes are gateways into the same code paths — they exist so each subsystem can be regressed independently against its course-project reference, and so a failing test can be localized.

Configuration surface¶

The JSON config in configs/baseline.json is the single source of truth. CLI flags override individual fields. Major knobs:

Knob	Why you'd tune it
`cores`	Scale studies. IPC saturates at 4 cores in the baseline (see config sweep).
`coherence.protocol`	Compare MI / MSI / MESI / MOSI / MOESIF on the same workload.
`cache.l1.size`, `assoc`, `block_size`	AAT and miss-rate studies; capacity vs. associativity tradeoff.
`cache.l2.size`, etc.	Same for L2; inclusion behavior.
`cache.replacement`	`LRU` / `LIP` / `MIP` — replacement-policy ablations.
`cache.prefetcher`	`none` / `plus_one` / `markov` / `hybrid` — coverage vs. accuracy.
`dram.latency`	Sensitivity studies; how memory-bound is the workload?
`ooo.rob_size`, `rs_size`, `width`	OoO sizing tradeoffs; instruction-window vs. IPC.
`predictor.type`	Compare always-taken / Yeh-Patt / perceptron / hybrid / tournament.

See Running for the full CLI and config schema.

What's modeled vs. what's abstracted¶

Modeled:

OoO pipeline timing (fetch / decode / rename / dispatch / issue / exec / writeback / retire).
ROB-bounded instruction window and in-order retirement (so the IPC numbers are real).
Non-blocking caches with MSHR-induced stalls and secondary-miss combining.
Coherence state machines including races (e.g. concurrent upgrades, eviction during snoop).
Directory serialization and the contention it implies.
Ring contention and per-hop latency.
Multi-thread traces (CasimV2): per-core trace streams with sync/lifecycle records.

Abstracted:

DRAM as a fixed-latency channel — no rank/bank queueing, no row-buffer locality, no controller scheduling.
No TLB, no page-walk modeling, no virtual memory translation cost.
No OS effects (context switches, page faults, interrupts).
No frequency scaling, voltage, or power modeling.
Functional execution: instructions are timed but not actually computed — values come from the trace.

These limits are stated honestly in Status so the IPC numbers are read in context.

Code-to-concept map¶

Concept	Source	Phase report
Trace reader + formats	`tools/gen_trace/`, `src/coherence/fici_cpu.cpp`	Phase 1 — Traces
L1/L2 cache + MSHR	`src/cache/`	Phase 2 — Cache
Branch prediction	`src/predictor/`	Phase 3 — Predictor
OoO pipeline (ROB / RS / LSU)	`src/ooo/`	Phase 4 review
Coherence agents + directory	`src/coherence/agent_.cpp`, `directory_.cpp`	Phase 5A — Coherence
Ring + message classes	`src/coherence/network.cpp`, `message.cpp`, `node.cpp`	Phase 5B — Full integration
Multi-core integration	`src/full/`, `src/main.cpp`	Phase 5B, Config sweep
LLS shared cache + NINE	`src/coherence/lls_cache.cpp`	LLS + hybrid coherence, LLS study guide

Where to go next¶

How to run it — build, configs, CLI, trace formats.
How it was built — phase-by-phase development journal.
Concepts explained — concept-first study material on Tomasulo, ROB, coherence protocols, predictors, AAT, consistency, and the ring.