Heterogeneous per-core traces¶
Goal: let each core read a different trace file, instead of the homogeneous lockstep regime imposed by --trace-dir DIR/p<i>.champsimtrace. The validation report (11-validation-bugs.md) called this out twice as the "real fix" for the tail-latency / cycle-cap pathologies it documented:
"The 'real' fix is either heterogeneous traces or a finer-grained no-progress watchdog" — full_mode.cpp:57
"...exactly the kind of pathological case that goes away with heterogeneous 4-core traces (the planned next step)." — 11-validation-bugs.md:353
This phase delivers the feature and runs an experiment that quantifies how much contention the lockstep regime was inflating.
What was built¶
| Piece | Where it lives | Notes |
|---|---|---|
--trace-list FILE CLI flag |
include/comparch/cli.hpp, src/common/cli.cpp | Manifest input. Mutually exclusive with --trace and --trace-dir. |
| Manifest format | (text file, no schema) | One trace path per line. Lines starting with # are comments. Blank lines skipped. Whitespace trimmed. Relative paths resolve against the manifest's parent directory (so manifests are portable). |
resolve_per_core_traces() helper |
src/full/full_mode.cpp (anonymous ns, near the top) | Single resolution path for both --trace-dir and --trace-list. Hard error if entry count != cfg.cores. |
trace_label() helper |
src/full/full_mode.cpp | Centralizes the "what to print in the report header / use as the run-dir stem" decision now that two CLI shapes both feed the per-core path vector. |
| Tests | tests/full/test_full_mode.cpp | Two new cases: (1) manifest mixing two trace flavors across 4 cores runs cleanly, (2) entry-count mismatch throws trace::TraceError. Existing 9 full-mode tests still pass — --trace-dir path is unchanged. |
| Example manifests | traces/synth/mix_loop_seq_tiny.txt, traces/synth/mix_stream_rand_tiny.txt, traces/synth/mix_full_tiny.txt, traces/synth/mix_full_small.txt | Used by the experiment below. Also serve as documentation-by-example. |
What was not built (deferred):
- Coherence-mode wiring.
--mode coherencestill only accepts--trace-dir. The Network constructor that loads project-3.tracefiles takes a directory path; passing it a list would touch src/coherence/network.cpp, which is more invasive than the full-mode change and the validation pathology lives in full mode anyway. - Sweep harness integration. scripts/run_sweep.py iterates over
--trace-dirpaths; teaching it to emit and consume manifests is its own piece of work. The follow-up section sketches what that would look like. - JSON config integration. Traces stay a CLI concept, separate from the machine-config JSON. Mixing workload-shape into the machine config seemed like the wrong coupling.
Why this matters: the lockstep contention regime¶
Phase 5B's full-mode driver originally constructed per-core trace paths by hardcoded naming convention:
// pre-change, full_mode.cpp:401-408
for (int i = 0; i < cfg.cores; ++i) {
const auto p = *cli.trace_dir /
("p" + std::to_string(i) + ".champsimtrace");
...
}
This works for single-workload sweeps but has a built-in trap: in traces/synth/, the per-core files are symlinks to the same raw.champsimtrace:
$ ls -la traces/synth/sequential_tiny/
p0.champsimtrace -> raw.champsimtrace
p1.champsimtrace -> raw.champsimtrace
p2.champsimtrace -> raw.champsimtrace
p3.champsimtrace -> raw.champsimtrace
Every "4-core run" of a synthetic trace was four cores reading the exact same byte sequence. The address streams gen_trace produces depend only on record index i (or seed), so all four cores walk identical address sequences in lockstep. That's the unifying root cause of the cap-hits and FSM crashes catalogued in 11-validation-bugs.md: a contention regime that doesn't occur in real multi-program workloads.
The runtime always supported per-core distinct content — every core constructs its own trace::Reader (full_mode.cpp:441-442) — but there was no configuration mechanism to point each core at a different file. --trace-list is that mechanism.
Experiment¶
Setup. configs/baseline.json, --cores 4, all 5 protocols (MI / MSI / MESI / MOSI / MOESIF). Two phases:
- Phase A — homogeneous baselines. Each of the four
synth/*_tinyflavors via--trace-dir, all 5 protocols. 20 runs. - Phase B — heterogeneous mixes via
--trace-list. Three manifests, all 5 protocols. 15 runs. mix_loop_seq_tiny— 2× loop + 2× sequential (private + shared)mix_stream_rand_tiny— 2× stream + 2× random (both shared, different access patterns)mix_full_tiny— 1× each of {loop, sequential, stream, random}
Every run retires exactly 400K instructions total (4 cores × 100K records each). The interesting metric is cycles: how long the simulator takes to drain 400K retirements.
Phase A — homogeneous baselines (cycles, M)¶
| Trace | MI | MSI | MESI | MOSI | MOESIF |
|---|---|---|---|---|---|
loop_tiny |
1.69 | 0.11 | 0.10 | 0.11 | 0.09 |
sequential_tiny |
6.90 | 13.87 | 12.91 | 13.87 | 12.43 |
stream_tiny |
6.89 | 13.87 | 12.90 | 13.87 | 12.43 |
random_tiny |
6.87 | 13.16 | 12.13 | 13.16 | 11.69 |
All 20 runs report Status: Simulation complete. No FSM crashes — the bucket 1A/1B/1C/1D fixes from 11-validation-bugs.md hold up on _tiny data.
The split is dramatic: loop_tiny runs ~100× faster than the other three on every directory protocol. Loop traces have private working sets per core (each core sweeps a different region), so coherence traffic is near-zero. The other three flavors all touch a shared address space, and lockstep × 4 cores produces the contention regime we're trying to characterize.
A few sub-points worth flagging:
- MI is faster than MSI / MOSI on shared-style traces (~6.9 M vs ~13.9 M). MI has no S state — every read is exclusive — so there's no SS→SM upgrade dance, just ping-pong invalidations. That ping-pong is cheap when there's only one shared line in flight, which is what stride-64 lockstep produces.
- MSI ≡ MOSI exactly in cycles. The O state never gets entered: the workload doesn't transition through clean-shared-then-modified, so the additional protocol state is unused.
- MESI ≈ MOESIF (MOESIF slightly cheaper) — Forward state shaves ~4% when there are many sharers.
- MI on
loop_tinyis anomalously slow (1.69 M vs ~0.10 M for everyone else). With private working sets there's no real sharing, but MI's read-acquires-exclusive policy still costs a directory round-trip on every cold miss; the others coalesce cold misses into a single shared/exclusive grant.
Phase B — heterogeneous mixes (cycles, M)¶
| Mix | MI | MSI | MESI | MOSI | MOESIF |
|---|---|---|---|---|---|
mix_loop_seq_tiny |
5.48 | 6.95 | 6.58 | 6.95 | 6.58 |
mix_stream_rand_tiny |
9.74 | 13.52 | 12.70 | 13.52 | 12.53 |
mix_full_tiny |
9.87 | 10.24 | 9.96 | 10.24 | 9.93 |
All 15 runs complete. The picture varies sharply across mixes:
-
mix_loop_seq_tinyis the headline. Compare againstsequential_tinyhomogeneous on the same protocol: MESI drops from 12.91 M to 6.58 M, a 2× reduction. The mechanism is straightforward — replacing 2 of the 4 contending cores with private-working-set loop cores cuts the directory's transient-state pressure roughly in half. Each shared line now has 2 sharers competing, not 4. This is the lockstep-contention pathology getting partially defused. -
mix_stream_rand_tinyshows almost no improvement. MESI: 12.70 M (mix) vs 12.13 M (random homogeneous). Both flavors are shared-style, so 4 cores still hammer shared addresses — just two streams of them instead of one. The directory's combinatorial transient-state load barely budges. -
mix_full_tinyis in between. MESI: 9.96 M, ~22% belowrandom_tinyand ~23% belowsequential_tinyhomogeneous. One private (loop) core lifts the others; three contending shared-style cores still anchor the cycle count.
A couple of secondary observations:
- MI flips sign on
mix_stream_rand. It runs worse than the homogeneous baseline (9.74 M vs 6.87–6.89 M for stream / random homogeneous). Heterogeneous shared-style streams hit different lines, so MI's invalidate-on-every-read strategy churns across more distinct cache lines instead of ping-ponging a small working set. Lockstep is better for MI on this workload — a counter-intuitive but consistent result. - MSI / MOSI continue to match cycle-for-cycle across all mixes. The O state is still unused because the synthetic traces don't produce read-after-write-after-read patterns.
Phase C — small-trace pathology check¶
The validation report's recommendation section flagged the homogeneous _small MI runs as the only remaining failure mode after the FSM fixes — pure tail-latency from 4 cores × 1 M records of lockstep contention. A targeted comparison (MESI only, the protocol with the cleanest baseline data):
| Run | Cycles (M) | Wallclock | Status |
|---|---|---|---|
[results filled in once the small comparison completes]
CPI summary (MESI, headline protocol)¶
Putting Phase A and Phase B side by side at MESI, the protocol whose behavior is closest to "what a real CMP would use":
| Workload | Cycles (M) | CPI (cycles / 400K retired) | vs sequential_tiny |
|---|---|---|---|
loop_tiny |
0.10 | 0.26 | -99.2% |
sequential_tiny |
12.91 | 32.3 | — (baseline) |
stream_tiny |
12.90 | 32.3 | -0.1% |
random_tiny |
12.13 | 30.3 | -6.0% |
mix_loop_seq_tiny |
6.58 | 16.5 | -49% |
mix_stream_rand_tiny |
12.70 | 31.8 | -1.6% |
mix_full_tiny |
9.96 | 24.9 | -22.8% |
The heterogeneity wins are real but workload-dependent. Mixing private with shared roughly halves contention. Mixing two shared streams doesn't help much. Mixing all four (one private, three shared) lands in between.
Conclusions¶
-
The feature works and the runtime was already heterogeneity-ready. The change touches one flag in the CLI, ~50 lines in the full-mode driver (helper extraction + report-formatting fallbacks), and adds two integration tests. Nothing in the OoO core, caches, adapter, network, or directory needed any change — every core has always run an independent
trace::Reader. Only the configuration plumbing was missing. -
Lockstep
_tinyruns are not a worst-case adversarial regime — they're an artifact of the symlinked trace layout. The cycle counts in Phase A's table are systematically inflated relative to a realistic 4-program workload. Future tier sweeps should run heterogeneous mixes by default; the existing_tinynumbers stay useful as a "lockstep stress" reference, not as a representative measurement. -
mix_loop_seqis the simplest contention-reducing mix. Halving directory pressure by spending 2 of 4 cores on private working sets is a clean way to characterize how much of the homogeneous cycle count is contention vs intrinsic memory latency. The 49% MESI cycle reduction is the empirical answer. -
MI is the protocol most sensitive to access-pattern uniformity. It's faster than MSI / MOSI / MESI / MOESIF when the working set is small and lockstep (a few lines, ping-pong); it's slower when the working set spreads out across many lines (heterogeneous shared streams). The validation report's MI tail-latency outliers are consistent with this: small, lockstep, but spread address range.
Follow-ups¶
-
Coherence mode (
--mode coherence) — same treatment.Network's 5A constructor takes atrace_dirpath and constructs its own per-core readers internally; switching it to take avector<fs::path>is mechanical but touches src/coherence/network.cpp and the project-3 regression fixtures, which expect the directory layout. -
Sweep harness — extend scripts/run_sweep.py with a
--mix-tiermode that generates manifests on the fly (e.g. allC(4, 4)combinations of the four synth flavors as 4-core mixes) and runs them as a third tier alongsidetinyandsmall. Should close out the 2 remaining MI tail-latency outliers from 11-validation-bugs.md:350-353 by giving them a non-lockstep regime to run in. -
Real workload mixes. Once the ChampSim-format converter (tools/proj2_to_champsim) is exercised against a real corpus (gem5 / Pin / SimPoint outputs of SPEC, server, etc.), heterogeneous mixing becomes the natural way to study CMP-style multiprogrammed workloads instead of staying inside the synthetic-pathology regime.