Apex Core Benchmark Report

이 문서는 Apex Core 프레임워크의 핵심 컴포넌트 성능을 측정하고, 아키텍처 선택(SPSC vs MPSC, 커스텀 할당기 vs malloc, zero-copy vs memcpy 등)의 방법론적 성능 차이를 수치로 증명하는 벤치마크 보고서이다.

System Information

CPU

Intel Core i7-14700 (Raptor Lake Hybrid)

RAM

65253 MB

Cores

20C/28T

Cache

L1D 48 KB (per-core) / L2 2 MB (per-core) / L3 33 MB (shared)

Version

v0.6.5.0

Commit

Date

2026-03-27

Compiler

MSVC 19.44, C++23, C++23, Release

Benchmarks

20 files

Baseline

v0.6.1.0

v0.6.1.0(2026-03-24)에서 v0.6.5.0(2026-03-27)까지 3일간의 변경. 이 기간의 핵심 아키텍처 변경은 NUMA 바인딩 + Core Affinity(BACKLOG-40)로, 물리 코어 1:1 핀닝, P/E 코어 분류, NUMA set_mempolicy, topology discovery가 도입되었다.

마이크로벤치마크에서 affinity 유/무에 의한 차이가 극명하게 드러나는 영역(Cross-core, FramePipeline, Session Create, TimingWheel)과 차이가 미미한 영역(SPSC Queue, RingBuffer, FrameCodec)이 뚜렷이 구분된다. 이는 단일 코어 내 연산은 이미 최적이고, 코어 간 + 메모리 계층 상호작용이 병목이었음을 실증한다.

i7-14700의 Raptor Lake Hybrid 아키텍처(8P+12E 코어)에서 P/E 분류 + 핀닝의 효과가 특히 크게 나타났으며, 이는 동일 프레임워크를 Xeon 또는 EPYC 서버급 CPU에 배포할 때 NUMA 노드 분리 효과가 더 극대화될 것을 시사한다.

Queue Performance — SPSC & MPSC

SPSC Queue (Wait-free)

Benchmark	CPU Time	Real Time	Iterations	Throughput
SpscQueue_Throughput/1024	3.4 ns	3.4 ns	203,636,364	296.2M items/s
SpscQueue_Throughput/4096	3.4 ns	3.4 ns	203,636,364	296.2M items/s
SpscQueue_Throughput/32768	3.3 ns	3.4 ns	203,636,364	303.1M items/s
SpscQueue_Throughput/65536	3.5 ns	3.5 ns	203,636,364	289.6M items/s
SpscQueue_Latency	1.9 ns	1.9 ns	344,615,385	537.9M items/s
SpscQueue_Backpressure	1.5 ns	1.5 ns	497,777,778	-
SpscQueue_ConcurrentThroughput/repeats:5_mean	13.2 ns	13.2 ns	5	75.6M items/s
SpscQueue_ConcurrentThroughput/repeats:5_median	13.2 ns	13.3 ns	5	75.9M items/s
SpscQueue_ConcurrentThroughput/repeats:5_stddev	0.3 ns	0.4 ns	5	1.8M items/s
SpscQueue_ConcurrentThroughput/repeats:5_cv	0.0 ns	0.0 ns	5	0 items/s

MPSC Queue (Lock-free)

Benchmark	CPU Time	Real Time	Iterations	Throughput
MpscQueue_1P1C/1024	6.6 ns	6.5 ns	112,000,000	152.5M items/s
MpscQueue_1P1C/4096	6.4 ns	6.6 ns	112,000,000	155.8M items/s
MpscQueue_1P1C/32768	6.8 ns	6.8 ns	112,000,000	146.3M items/s
MpscQueue_1P1C/65536	6.1 ns	6.6 ns	112,000,000	162.9M items/s
MpscQueue_2P1C/repeats:5_mean	63.1 ns	63.1 ns	5	15.9M items/s
MpscQueue_2P1C/repeats:5_median	62.8 ns	62.7 ns	5	15.9M items/s
MpscQueue_2P1C/repeats:5_stddev	0.6 ns	0.8 ns	5	154,861 items/s
MpscQueue_2P1C/repeats:5_cv	0.0 ns	0.0 ns	5	0 items/s
MpscQueue_Backpressure	1.5 ns	1.5 ns	448,000,000	-

SPSC Queue는 v0.6.1.0과 v0.6.5.0 모두 ~3.4ns의 안정적 throughput을 유지한다. 1K 기준 3.50ns → 3.38ns, 65K 기준 3.37ns → 3.46ns로 측정 오차 범위 내 동등. Latency(1.89ns → 1.90ns)와 Backpressure(1.52ns 동등) 역시 변화 없다. ConcurrentThroughput는 55ns → 13.2ns(-76%)로 극적 개선을 보이나, v0.6.5.0이 repeats:5 aggregate 방식을 적용하여 측정 안정성이 높아진 결과이며 실제 성능 차이보다 측정 방법론 차이의 영향이 크다.

MPSC Queue 1P1C는 전 구간 6.5ns로 동등. 2P1C는 163ns → 63ns(-61%)로 나타나나, 역시 repeats:5 aggregate 적용의 영향이 지배적이다. Backpressure는 1.52ns → 1.50ns로 거의 동일. shared-nothing 아키텍처 핵심인 SPSC 경로가 이미 v0.6.1.0에서 최적 수준에 도달해 있었음을 확인한다.

SPSC vs MPSC Methodology

Version Comparison

Memory Allocators — Slab, Bump, Arena, malloc, make_shared

Benchmark	CPU Time	Real Time	Iterations	Throughput
SlabAllocator_AllocDealloc/64	3.4 ns	3.5 ns	186,666,667	291.4M items/s
SlabAllocator_AllocDealloc/256	3.5 ns	3.5 ns	194,782,609	283.3M items/s
SlabAllocator_AllocDealloc/1024	3.5 ns	3.5 ns	194,782,609	283.3M items/s
Malloc_AllocFree/64	22.9 ns	23.2 ns	32,000,000	43.6M items/s
Malloc_AllocFree/256	21.3 ns	21.2 ns	34,461,538	46.9M items/s
Malloc_AllocFree/1024	20.9 ns	20.9 ns	34,461,538	47.9M items/s
MakeShared_AllocDealloc	32.2 ns	32.1 ns	21,333,333	31.0M items/s
BumpAllocator_Alloc/64/16384	3.2 ns	3.3 ns	203,636,364	310.3M items/s
BumpAllocator_Alloc/64/65536	3.4 ns	3.4 ns	213,333,333	296.8M items/s
BumpAllocator_Alloc/64/262144	3.4 ns	3.4 ns	203,636,364	296.2M items/s
BumpAllocator_Alloc/256/16384	3.3 ns	3.3 ns	203,636,364	303.1M items/s
BumpAllocator_Alloc/256/65536	3.2 ns	3.4 ns	203,636,364	310.3M items/s
BumpAllocator_Alloc/256/262144	3.3 ns	3.3 ns	203,636,364	303.1M items/s
BumpAllocator_Alloc/1024/16384	3.3 ns	3.3 ns	213,333,333	303.4M items/s
BumpAllocator_Alloc/1024/65536	3.6 ns	3.6 ns	203,636,364	277.3M items/s
BumpAllocator_Alloc/1024/262144	3.3 ns	3.4 ns	194,782,609	304.1M items/s
ArenaAllocator_Alloc/64/1024	4.5 ns	4.9 ns	144,516,129	220.2M items/s
ArenaAllocator_Alloc/64/4096	4.6 ns	4.5 ns	154,482,759	219.7M items/s
ArenaAllocator_Alloc/64/16384	4.3 ns	4.3 ns	165,925,926	230.9M items/s
ArenaAllocator_Alloc/256/1024	5.2 ns	5.5 ns	112,000,000	193.7M items/s
ArenaAllocator_Alloc/256/4096	5.0 ns	4.9 ns	144,516,129	201.1M items/s
ArenaAllocator_Alloc/256/16384	4.6 ns	4.6 ns	154,482,759	219.7M items/s
ArenaAllocator_Alloc/1024/1024	7.5 ns	7.6 ns	74,666,667	132.7M items/s
ArenaAllocator_Alloc/1024/4096	5.2 ns	5.1 ns	100,000,000	193.9M items/s
ArenaAllocator_Alloc/1024/16384	4.9 ns	4.9 ns	149,333,333	203.3M items/s
BumpAllocator_RequestCycle/16384	39.0 ns	39.2 ns	17,230,769	25.6M items/s
BumpAllocator_RequestCycle/65536	37.5 ns	38.8 ns	17,920,000	26.7M items/s
BumpAllocator_RequestCycle/262144	40.5 ns	41.2 ns	16,592,593	24.7M items/s
ArenaAllocator_TransactionCycle/1024	306.9 ns	310.0 ns	2,240,000	3.3M items/s
ArenaAllocator_TransactionCycle/4096	139.5 ns	142.5 ns	5,600,000	7.2M items/s
ArenaAllocator_TransactionCycle/16384	138.1 ns	141.8 ns	4,977,778	7.2M items/s

Slab Allocator는 v0.6.1.0 대비 소폭 개선. 64B 기준 3.79ns → 3.48ns(-8%), 1024B 기준 3.77ns → 3.50ns(-7%)로 전 크기 구간에서 ~3.5ns 균일 성능을 달성했다. 시스템 malloc은 64B 22.5ns → 23.2ns(동등), 1024B 20.7ns → 20.9ns(동등)로 변화 없어 Slab 대비 격차가 6.6배로 유지된다.

BumpAllocator는 전 구간 3.3~3.6ns로 v0.6.1.0(3.5~3.7ns) 대비 ~5% 개선. RequestCycle은 16K arena에서 34.3ns → 39.2ns(+14%), 262K에서 34.3ns → 41.2ns(+20%)로 소폭 증가했는데, 이는 NUMA 바인딩 + core affinity 적용 후 메모리 할당 경로의 NUMA-aware 처리 오버헤드로 추정된다.

ArenaAllocator TransactionCycle은 1024B 309ns → 310ns(동등), 16384B 60ns → 142ns로 arena 크기에 따른 편차가 확대. make_shared는 31.5ns → 32.1ns로 거의 동일하다.

5 Allocators Comparison

Version Comparison

Frame Processing — FrameCodec

Benchmark	CPU Time	Real Time	Iterations	Throughput
FrameCodec_Encode/64	15.3 ns	15.6 ns	44,800,000	5.0 GB/s
FrameCodec_Encode/512	18.0 ns	18.5 ns	37,333,333	29.1 GB/s
FrameCodec_Encode/4096	36.8 ns	37.3 ns	20,363,636	111.5 GB/s
FrameCodec_Encode/16384	219.7 ns	220.8 ns	3,200,000	74.6 GB/s
FrameCodec_Decode/64	33.7 ns	34.0 ns	21,333,333	2.3 GB/s
FrameCodec_Decode/512	36.8 ns	36.4 ns	18,666,667	14.2 GB/s
FrameCodec_Decode/4096	60.0 ns	60.5 ns	11,200,000	68.5 GB/s
FrameCodec_Decode/16384	314.2 ns	311.6 ns	2,635,294	52.2 GB/s

FrameCodec Encode는 양 버전 모두 15~16ns(64B), 210~220ns(16KB) 구간으로 거의 동등하다. 64B 15.4ns → 15.6ns, 512B 16.8ns → 18.5ns(+10%), 4KB 32.2ns → 37.3ns(+16%), 16KB 210ns → 221ns(+5%). 소형은 동등, 중형에서 v0.6.5.0이 소폭 느린 것은 core affinity로 인한 단일 코어 집중 실행의 열적 스로틀링 가능성이 있다.

Decode는 64B 32.6ns → 34.0ns(+4%), 512B 35.2ns → 36.4ns(+3%), 4KB 56.3ns → 60.5ns(+7%), 16KB 260ns → 312ns(+20%)로 v0.6.5.0이 전반적으로 소폭 느리다. 16KB에서 차이가 두드러지는 것은 L1d 캐시 경합 패턴 변화로 보이며, 절대값 기준으로는 여전히 높은 throughput(52~74 GB/s)을 유지한다.

프레임 코덱 자체 알고리즘은 v0.6.1.0과 동일하므로, 이 차이는 시스템 수준 변경(NUMA, affinity)의 간접 효과다.

Encode vs Decode Throughput Scaling

Version Comparison

Serialization — FlatBuffers vs Heap

Benchmark	CPU Time	Real Time	Iterations	Throughput
FlatBuffers_Build/64	40.8 ns	40.7 ns	17,230,769	1.6 GB/s
FlatBuffers_Build/512	53.1 ns	53.8 ns	10,000,000	9.6 GB/s
FlatBuffers_Build/4096	68.4 ns	67.5 ns	11,200,000	59.9 GB/s
HeapAlloc_Build/64	24.0 ns	24.4 ns	28,000,000	2.7 GB/s
HeapAlloc_Build/512	29.6 ns	29.7 ns	26,352,941	17.3 GB/s
HeapAlloc_Build/4096	50.0 ns	48.7 ns	10,000,000	81.9 GB/s
FlatBuffers_Read/64	3.4 ns	3.4 ns	203,636,364	19.0 GB/s
FlatBuffers_Read/512	3.4 ns	3.4 ns	213,333,333	148.7 GB/s
FlatBuffers_Read/4096	3.5 ns	3.5 ns	203,636,364	1186.3 GB/s
HeapAlloc_Read/64	22.0 ns	21.7 ns	32,000,000	2.9 GB/s
HeapAlloc_Read/512	25.1 ns	25.8 ns	28,000,000	20.4 GB/s
HeapAlloc_Read/4096	42.5 ns	43.7 ns	15,438,769	96.4 GB/s

FlatBuffers Build에서 극적 개선이 관측된다. 64B 86.1ns → 40.7ns(-53%), 512B 88.7ns → 53.8ns(-39%), 4KB 126.3ns → 67.5ns(-47%). bytes/sec 기준 64B에서 784MB/s → 1.57GB/s로 2배, 4KB에서 32.6GB/s → 59.9GB/s로 1.8배 향상. FlatBuffers Read는 64B 4.9ns → 3.4ns(-31%), 512B 5.4ns → 3.4ns(-37%), 4KB 4.7ns → 3.5ns(-26%)로 전 구간 ~3.4ns로 수렴.

HeapAlloc Build도 64B 49.6ns → 24.4ns(-51%), 512B 56.7ns → 29.7ns(-48%), 4KB 100.6ns → 48.7ns(-52%)로 전 구간 절반 수준으로 개선. HeapAlloc Read는 64B 50.2ns → 21.7ns(-57%), 512B 70.7ns → 25.8ns(-64%)로 역시 큰 폭의 개선.

직렬화 전 경로에서 일관된 40~60% 개선은 NUMA 바인딩 + core affinity로 인한 L1/L2 캐시 히트율 향상과 메모리 접근 지역성 개선의 결과다.

Build vs Read Comparison

Version Comparison

Hash Map — flat_map vs std::unordered_map 대규모 순회 비교

Benchmark	CPU Time	Real Time	Iterations	Throughput
Dispatcher_Lookup/10	2.3 ns	2.5 ns	280,000,000	426.7M items/s
Dispatcher_Lookup/100	2.5 ns	2.5 ns	280,000,000	407.3M items/s
Dispatcher_Lookup/1000	2.5 ns	2.5 ns	280,000,000	407.3M items/s
FlatMap_SessionLookup/100	1.6 ns	1.7 ns	407,272,727	606.2M items/s
FlatMap_SessionLookup/1000	1.7 ns	1.8 ns	407,272,727	579.2M items/s
FlatMap_SessionLookup/10000	1.7 ns	1.7 ns	373,333,333	582.8M items/s
FlatMap_SessionLookup/100000	1.7 ns	1.8 ns	448,000,000	585.1M items/s
StdMap_SessionLookup/100	2.2 ns	2.2 ns	320,000,000	455.1M items/s
StdMap_SessionLookup/1000	2.2 ns	2.4 ns	298,666,667	444.5M items/s
StdMap_SessionLookup/10000	2.2 ns	2.3 ns	298,666,667	444.5M items/s
StdMap_SessionLookup/100000	2.3 ns	2.4 ns	320,000,000	435.7M items/s
FlatMap_SessionIterate/100	67.0 ns	70.5 ns	7,466,667	1493.3M items/s
FlatMap_SessionIterate/1000	1.1 us	1.1 us	640,000	930.9M items/s
FlatMap_SessionIterate/10000	9.8 us	10.3 us	64,000	1024.0M items/s
StdMap_SessionIterate/100	104.6 ns	103.2 ns	7,466,667	955.7M items/s
StdMap_SessionIterate/1000	1.3 us	1.3 us	640,000	758.5M items/s
StdMap_SessionIterate/10000	31.4 us	31.7 us	22,400	318.6M items/s

MessageDispatcher Lookup은 10 핸들러 2.49ns → 2.50ns, 100 핸들러 2.48ns → 2.49ns, 1000 핸들러 2.45ns → 2.49ns로 ~2.5ns O(1) 특성이 양 버전 모두 안정적으로 유지된다.

FlatMap SessionLookup은 전 구간에서 15~19% 개선. 100세션 2.06ns → 1.74ns(-15%), 1K 2.04ns → 1.75ns(-14%), 10K 2.05ns → 1.73ns(-16%), 100K 2.05ns → 1.78ns(-13%)로 ~1.75ns 균일 성능에 도달. FlatMap SessionIterate는 100개 74ns → 71ns(-5%), 10K 11.1us → 10.3us(-7%)로 순회 성능도 소폭 개선.

StdMap(unordered_map) SessionLookup도 100세션 1.73ns → 2.21ns(+28%)로 v0.6.5.0에서 약간 느려졌으나 절대값은 여전히 2ns대. StdMap Iterate 10K는 53.2us → 31.7us(-40%)로 큰 폭 개선. core affinity 고정으로 캐시 웜업 효과가 안정화된 결과다.

flat_map vs std::unordered_map — 세션 순회 (Iteration)

Version Comparison

Session & Timer

intrusive_ptr vs shared_ptr

TimingWheel — O(1) Timeout

Benchmark	CPU Time	Real Time	Iterations	Throughput
TimingWheel_ScheduleTick/1000	12.3 us	13.4 us	74,667	81.0M items/s
TimingWheel_ScheduleTick/10000	125.6 us	133.1 us	5,600	79.6M items/s
TimingWheel_ScheduleTick/50000	1.05 ms	1.07 ms	640	47.6M items/s
TimingWheel_ScheduleOnly	26.0 ns	25.9 ns	26,408,421	38.4M items/s

Session Lifecycle

Benchmark	CPU Time	Real Time	Iterations	Throughput
Session_Create	131.1 us	131.6 us	5,600	7,626 items/s
SessionPtr_Copy	10.5 ns	10.5 ns	74,666,667	95.6M items/s
SharedPtr_Copy	10.5 ns	10.5 ns	64,000,000	95.3M items/s

Session Create는 298us → 132us(-56%)로 세션 생성 비용이 절반 이하로 줄었다. 초당 생성 가능 세션 수가 3,389 → 7,626으로 2.25배 증가. 20코어 i7-14700에서 per-core 기준 381 sess/sec/core에 달하며, thundering herd 시나리오에서 서버 응답성을 직접적으로 개선한다.

SessionPtr Copy(intrusive_ptr)는 10.9ns → 10.5ns(-4%), SharedPtr Copy는 10.5ns → 10.5ns(동등)로 양쪽 모두 ~10.5ns에 수렴. v0.6.1.0에서도 이미 intrusive_ptr과 shared_ptr의 성능 차이가 미미했으며 v0.6.5.0에서 완전 수렴.

TimingWheel에서 가장 극적인 개선이 나타났다. ScheduleTick 1K 25.9us → 13.4us(-48%), 10K 275us → 133us(-52%), 50K 1.89ms → 1.07ms(-43%). ScheduleOnly 57ns → 26ns(-55%). core affinity로 인한 캐시 지역성 향상이 타이머 휠의 배열 순회 성능을 극대화한 결과이며, 수만 세션의 idle timeout 관리 비용이 절반으로 줄었다.

Version Comparison

Buffer — RingBuffer

Benchmark	CPU Time	Real Time	Iterations	Throughput
RingBuffer_WriteRead/64	9.8 ns	9.8 ns	74,666,667	6.5 GB/s
RingBuffer_WriteRead/512	10.7 ns	10.9 ns	64,000,000	47.7 GB/s
RingBuffer_WriteRead/4096	22.0 ns	22.6 ns	32,000,000	186.4 GB/s
RingBuffer_Linearize/64	10.7 ns	11.1 ns	64,000,000	6.0 GB/s
RingBuffer_Linearize/512	17.4 ns	17.1 ns	44,800,000	29.4 GB/s
RingBuffer_Linearize/4096	30.5 ns	30.5 ns	23,578,947	134.4 GB/s
NaiveBuffer_CopyWrite/64	22.5 ns	22.4 ns	32,000,000	2.8 GB/s
NaiveBuffer_CopyWrite/512	31.4 ns	31.3 ns	24,888,889	16.3 GB/s
NaiveBuffer_CopyWrite/4096	44.9 ns	45.0 ns	16,000,000	91.2 GB/s

RingBuffer WriteRead는 v0.6.1.0과 거의 동등하다. 64B 9.74ns → 9.78ns, 512B 10.9ns(동등), 4KB 21.3ns → 22.6ns(+6%). bytes/sec 기준 64B 6.5GB/s, 4KB 186~195GB/s로 양 버전 모두 높은 throughput을 유지. 4KB에서 v0.6.5.0이 소폭 느린 것은 통계적 유의 수준 미만의 차이다.

Linearize는 64B 11.5ns → 11.1ns(-4%), 512B 14.4ns → 17.1ns(+19%), 4KB 34.0ns → 30.5ns(-10%)로 페이로드 크기별 편차가 있으나 전체적으로 동등. NaiveBuffer CopyWrite도 64B 21.5ns → 22.4ns(+4%), 4KB 42.7ns → 45.0ns(+5%)로 거의 변화 없다.

RingBuffer vs NaiveBuffer 격차는 64B에서 2.3배(9.8ns vs 22.4ns), 4KB에서 2.0배(22.6ns vs 45.0ns)로 유지. zero-copy 읽기 경로의 구조적 우위가 확고하다.

Zero-copy vs Naive memcpy (Throughput GB/s)

Version Comparison

Summary

All Components Delta %

Methodology Comparison Highlights

Comparison	Approach A	Approach B	Ratio
SPSC vs MPSC (1P1C)	SPSC: 3.4 ns	MPSC: 6.6 ns	1.9x
Slab vs malloc (64B)	Slab: 3.4 ns	malloc: 22.9 ns	6.7x
intrusive_ptr vs shared_ptr	intrusive: 10.5 ns	shared: 10.5 ns	1.0x
Zero-copy vs Naive (512B)	ZeroCopy: 47.7 GB/s	Naive: 16.3 GB/s	2.9x
FlatBuffers vs Heap Build (512B)	FB: 53.1 ns	Heap: 29.6 ns	1.8x
FlatBuffers vs Heap Read (512B)	FB: 3.4 ns	Heap: 25.1 ns	7.3x

v0.6.1.0 → v0.6.5.0 (i7-14700, 동일 하드웨어)은 NUMA 바인딩 + Core Affinity가 핵심 변수인 비교다. 마이크로벤치마크 수준에서는 컴포넌트별로 상이한 결과를 보이나, 통합 벤치마크와 시스템 수준 지표에서 일관된 대폭 개선이 관측된다.

특히 눈에 띄는 영역:
• FlatBuffers/HeapAlloc Build+Read: 전 구간 40~60% 개선 — 캐시 지역성 향상 효과
• TimingWheel: 43~55% 개선 — 배열 순회의 캐시 효율 극대화
• Session Create: 56% 개선 — 접속 폭주 대응력 2.25배 향상
• Cross-core RTT: 42% 개선 — 코어 고정으로 스케줄러 노이즈 제거
• Cross-core PostThroughput: 67% 개선 — 290만 msg/sec 달성
• FramePipeline: 전 구간 49~50% 개선 — end-to-end 파이프라인 절반

반면 SPSC Queue, RingBuffer, FrameCodec 등 이미 L1d 캐시에 완전히 적재되는 단일 코어 경로는 affinity 여부와 무관하게 동등하다. 이는 NUMA/affinity 최적화가 코어 간 통신과 메모리 할당 경로에 선택적으로 작용함을 보여준다.

방법론 비교 핵심 요약 (v0.6.5.0 i7-14700 기준):
• SPSC vs MPSC: SPSC 3.4ns vs MPSC 6.5ns — single-producer가 약 1.9배 빠르며, shared-nothing 아키텍처에서 SPSC 우선 사용의 근거
• Slab vs Malloc vs make_shared: Slab 3.5ns vs Malloc 22ns vs make_shared 32ns — Slab이 malloc 대비 6.3배, make_shared 대비 9.1배 빠름. 20코어 환경에서 per-core 전용 할당기의 결정적 이점
• FlatBuffers vs HeapAlloc: Build에서 FlatBuffers가 1.4~1.7배 느리지만 Read에서 6~12배 빠름(3.4ns vs 22~44ns). 읽기 빈도가 높은 메시지 처리에서 FlatBuffers가 명백히 유리
• flat_map vs unordered_map: Lookup은 FlatMap 1.75ns vs StdMap 2.3ns(24% 우위), Iterate 10K에서 FlatMap 10.3us vs StdMap 31.7us(3.1배 빠름). 캐시 지역성 우위가 확고
• intrusive_ptr vs shared_ptr: 양쪽 모두 ~10.5ns로 동등. 단일 코어 atomic ref-count에서 contention 없음
• RingBuffer(zero-copy) vs NaiveBuffer(memcpy): WriteRead에서 RingBuffer가 64B 2.3배, 4KB 2.0배 빠름. zero-copy 설계의 구조적 우위
• Affinity ON vs OFF: Cross-core 경로 42~67% 개선, 단일 코어 경로 동등 — Raptor Lake Hybrid에서 P/E 코어 분류+핀닝의 선택적 효과 입증

Integration — End-to-end Pipeline

Throughput Scaling — Per-core vs Shared io_context

Cross-core Latency (RTT)

Cross-core Message Throughput

Frame Pipeline & Session Echo Throughput

Cross-core RTT(Round-Trip Time)에서 극적 개선. 14.8us → 8.6us(-42%), one-way 레이턴시 6.5us → 4.1us(-37%). NUMA 바인딩 + core affinity로 코어 간 통신 시 OS 스케줄러의 코어 마이그레이션이 억제되어 L3 캐시 코히어런스 경로가 안정화된 결과.

Cross-core PostThroughput는 1.72M ops/s → 2.88M ops/s(+67%)로 약 290만 msg/sec를 달성. 20코어 i7-14700 환경에서 이 수치는 단일 코어 쌍 간 message-passing의 실질적 상한에 근접한 수준이다.

FramePipeline(Encode→Queue→Decode 전 경로)은 64B 2.83us → 1.44us(-49%), 512B 2.99us → 1.52us(-49%), 4KB 3.02us → 1.52us(-50%)로 전 구간 절반으로 개선. Session EchoRoundTrip은 64B 13.0us → 6.4us(-51%)로 소형 메시지에서 극적 개선. 512B 13.2us → 13.3us, 4KB 14.8us → 14.9us는 동등. 전체적으로 마이크로 최적화보다 시스템 수준 아키텍처 변경(NUMA, affinity)의 end-to-end 효과가 두드러진다.

Cross-core Latency

Benchmark	CPU Time	Real Time	Iterations	Throughput
CrossCore_Latency/iterations:10000/real_time	6.2 us	8.6 us	10,000	-

Cross-core Message Passing

Benchmark	CPU Time	Real Time	Iterations	Throughput
CrossCore_PostThroughput/real_time	354.2 ns	347.3 ns	2,073,212	2.9M items/s

Frame Pipeline

Benchmark	CPU Time	Real Time	Iterations	Throughput
FramePipeline/64	1.4 us	1.4 us	448,000	53.1 MB/s
FramePipeline/512	1.5 us	1.5 us	497,778	347.8 MB/s
FramePipeline/4096	1.5 us	1.5 us	497,778	2.7 GB/s

Session Throughput

Benchmark	CPU Time	Real Time	Iterations	Throughput
Session_EchoRoundTrip/64	6.4 us	6.4 us	112,000	11.8 MB/s
Session_EchoRoundTrip/512	12.9 us	13.3 us	49,778	40.7 MB/s
Session_EchoRoundTrip/4096	14.8 us	14.9 us	56,000	277.8 MB/s

Architecture Comparison

Benchmark	CPU Time	Real Time	Iterations	Throughput
PerCore_Stateful/1/real_time	24.88 ms	25.94 ms	27	1.9M items/s
PerCore_Stateful/2/real_time	27.85 ms	29.93 ms	23	3.3M items/s
PerCore_Stateful/3/real_time	33.48 ms	33.91 ms	21	4.4M items/s
PerCore_Stateful/4/real_time	35.36 ms	35.94 ms	19	5.6M items/s
PerCore_Stateful/8/real_time	52.88 ms	57.07 ms	13	7.0M items/s
PerCore_Stateful/16/real_time	115.62 ms	136.24 ms	5	5.9M items/s
Shared_Stateful/1/real_time	25.11 ms	24.96 ms	28	2.0M items/s
Shared_Stateful/2/real_time	52.46 ms	52.14 ms	14	1.9M items/s
Shared_Stateful/3/real_time	67.19 ms	73.54 ms	10	2.0M items/s
Shared_Stateful/4/real_time	83.98 ms	87.55 ms	8	2.3M items/s
Shared_Stateful/8/real_time	208.33 ms	205.17 ms	3	1.9M items/s
Shared_Stateful/16/real_time	531.25 ms	569.55 ms	1	1.4M items/s
Shared_LockFree_Stateful/1/real_time	25.11 ms	25.18 ms	28	2.0M items/s
Shared_LockFree_Stateful/2/real_time	51.68 ms	52.01 ms	13	1.9M items/s
Shared_LockFree_Stateful/3/real_time	68.75 ms	68.33 ms	10	2.2M items/s
Shared_LockFree_Stateful/4/real_time	83.98 ms	83.87 ms	8	2.4M items/s
Shared_LockFree_Stateful/8/real_time	214.84 ms	215.16 ms	4	1.9M items/s
Shared_LockFree_Stateful/16/real_time	593.75 ms	609.79 ms	1	1.3M items/s