Apex Core Benchmark Report

이 문서는 Apex Core 프레임워크의 핵심 컴포넌트 성능을 측정하고, 아키텍처 선택(SPSC vs MPSC, 커스텀 할당기 vs malloc, zero-copy vs memcpy 등)의 방법론적 성능 차이를 수치로 증명하는 벤치마크 보고서이다.

System Information

CPU

Intel Core i5-9300H (Coffee Lake)

RAM

12168 MB

Cores

4C/8T

Cache

L1D 32 KB (per-core) / L2 256 KB (per-core) / L3 8 MB (shared)

Version

v0.6.5.0

Commit

Date

2026-03-27

Compiler

MSVC 19.44, C++23, C++23, Release

Benchmarks

17 files

Baseline

v0.5.10.0

v0.5.10.0(2026-03-21)에서 v0.6.5.0(2026-03-27)까지 약 6일간의 집중 최적화 주기. 이 기간 동안 NUMA 바인딩 + Core Affinity(BACKLOG-40), Whisper O(1) 코어 라우팅(BACKLOG-149), Acceptor per-IP 연결 제한(BACKLOG-256) 등 시스템 수준 아키텍처 변경이 다수 포함되었다.

개별 컴포넌트의 알고리즘 개선뿐 아니라, 빌드 최적화(Release 설정 정밀화, LTO 등)와 벤치마크 하네스 개선(repeats:5 aggregate, 안정화된 iteration 카운트)이 측정 정확도와 재현성을 높였다. v0.6.5.0의 벤치마크는 core affinity 핀닝이 적용된 상태에서 측정되어 OS 스케줄러 노이즈가 억제된 결과이기도 하다.

전체적으로 마이크로 컴포넌트 40~65%, 통합 파이프라인 8~14%의 체감 가능한 성능 향상을 기록하며, v1.0.0.0 프레임워크 완성 마일스톤을 향한 성능 기반이 확보되었다.

Queue Performance — SPSC & MPSC

SPSC Queue (Wait-free)

Benchmark	CPU Time	Real Time	Iterations	Throughput
SpscQueue_Throughput/1024	3.6 ns	3.6 ns	194,782,609	277.0M items/s
SpscQueue_Throughput/4096	4.6 ns	4.6 ns	194,782,609	218.7M items/s
SpscQueue_Throughput/32768	4.2 ns	4.1 ns	186,666,667	238.9M items/s
SpscQueue_Throughput/65536	3.6 ns	3.6 ns	186,666,667	277.8M items/s
SpscQueue_Latency	3.6 ns	3.6 ns	248,888,889	279.5M items/s
SpscQueue_Backpressure	1.9 ns	1.9 ns	497,777,778	-
SpscQueue_ConcurrentThroughput/repeats:5_mean	24.9 ns	24.9 ns	5	40.6M items/s
SpscQueue_ConcurrentThroughput/repeats:5_median	25.1 ns	25.2 ns	5	39.8M items/s
SpscQueue_ConcurrentThroughput/repeats:5_stddev	2.6 ns	2.6 ns	5	4.5M items/s
SpscQueue_ConcurrentThroughput/repeats:5_cv	0.1 ns	0.1 ns	5	0 items/s

MPSC Queue (Lock-free)

Benchmark	CPU Time	Real Time	Iterations	Throughput
MpscQueue_1P1C/1024	9.6 ns	9.5 ns	74,666,667	103.9M items/s
MpscQueue_1P1C/4096	9.2 ns	9.3 ns	74,666,667	108.6M items/s
MpscQueue_1P1C/32768	9.2 ns	9.2 ns	74,666,667	108.6M items/s
MpscQueue_1P1C/65536	9.2 ns	9.3 ns	74,666,667	108.6M items/s
MpscQueue_2P1C/repeats:5_mean	69.8 ns	69.8 ns	5	14.4M items/s
MpscQueue_2P1C/repeats:5_median	71.5 ns	72.7 ns	5	14.0M items/s
MpscQueue_2P1C/repeats:5_stddev	5.9 ns	5.9 ns	5	1.4M items/s
MpscQueue_2P1C/repeats:5_cv	0.1 ns	0.1 ns	5	0 items/s
MpscQueue_Backpressure	1.8 ns	1.8 ns	407,272,727	-

SPSC Queue는 v0.6.5.0에서 전 구간 2.5~3배 처리량 향상을 달성했다. Throughput 1K 기준 10.1ns → 3.6ns(-64%), 65K 기준 11.9ns → 3.6ns(-70%)로 큐 크기에 무관한 일관된 성능을 보인다. Latency도 7.8ns → 3.6ns(-54%)로 크게 개선되었고, Backpressure 경로는 3.2ns → 1.9ns(-42%)로 만찬(full-queue) 상황에서도 빠른 감지가 가능하다.

ConcurrentThroughput(양방향 동시 접근)은 46ns → 25ns(-46%)로, cache-line bouncing이 줄어든 것을 의미한다. MPSC Queue도 1P1C 전 구간에서 14ns → 9.2~9.6ns(-32%)로 개선되었고, Backpressure는 5.9ns → 1.8ns(-69%)로 극적 향상. 다만 2P1C는 31ns → 70ns로 경합 시나리오에서 레이턴시가 증가했는데, 이는 repeats:5 aggregate 방식 변경의 영향이 큰 것으로 보인다.

shared-nothing 아키텍처에서 SPSC Queue가 코어 간 메시지 전달의 핵심 경로이므로, 이 영역의 개선은 전체 파이프라인 처리량에 직접적인 배수 효과를 준다.

SPSC vs MPSC Methodology

Version Comparison

Memory Allocators — Slab, Bump, Arena, malloc, make_shared

Benchmark	CPU Time	Real Time	Iterations	Throughput
SlabAllocator_AllocDealloc/64	6.4 ns	6.4 ns	112,000,000	155.8M items/s
SlabAllocator_AllocDealloc/256	6.1 ns	6.1 ns	100,000,000	164.1M items/s
SlabAllocator_AllocDealloc/1024	6.1 ns	6.2 ns	112,000,000	162.9M items/s
Malloc_AllocFree/64	51.6 ns	51.7 ns	10,000,000	19.4M items/s
Malloc_AllocFree/256	53.1 ns	53.8 ns	10,000,000	18.8M items/s
Malloc_AllocFree/1024	51.6 ns	52.1 ns	10,000,000	19.4M items/s
MakeShared_AllocDealloc	69.8 ns	70.1 ns	8,960,000	14.3M items/s
BumpAllocator_Alloc/64/16384	4.5 ns	4.5 ns	160,000,000	222.6M items/s
BumpAllocator_Alloc/64/65536	4.7 ns	4.6 ns	154,482,759	214.9M items/s
BumpAllocator_Alloc/64/262144	4.5 ns	4.5 ns	160,000,000	222.6M items/s
BumpAllocator_Alloc/256/16384	4.9 ns	4.9 ns	149,333,333	203.3M items/s
BumpAllocator_Alloc/256/65536	4.6 ns	4.6 ns	149,333,333	217.2M items/s
BumpAllocator_Alloc/256/262144	4.5 ns	4.5 ns	149,333,333	222.3M items/s
BumpAllocator_Alloc/1024/16384	5.2 ns	5.1 ns	100,000,000	193.9M items/s
BumpAllocator_Alloc/1024/65536	4.6 ns	4.6 ns	149,333,333	217.2M items/s
BumpAllocator_Alloc/1024/262144	4.6 ns	4.5 ns	154,482,759	219.7M items/s
ArenaAllocator_Alloc/64/1024	8.8 ns	8.7 ns	74,666,667	113.8M items/s
ArenaAllocator_Alloc/64/4096	7.8 ns	7.8 ns	89,600,000	127.4M items/s
ArenaAllocator_Alloc/64/16384	7.5 ns	7.5 ns	112,000,000	132.7M items/s
ArenaAllocator_Alloc/256/1024	9.4 ns	9.5 ns	74,666,667	106.2M items/s
ArenaAllocator_Alloc/256/4096	8.7 ns	8.7 ns	89,600,000	114.7M items/s
ArenaAllocator_Alloc/256/16384	7.5 ns	7.7 ns	89,600,000	133.4M items/s
ArenaAllocator_Alloc/1024/1024	15.1 ns	15.1 ns	49,777,778	66.4M items/s
ArenaAllocator_Alloc/1024/4096	9.4 ns	9.5 ns	74,666,667	106.2M items/s
ArenaAllocator_Alloc/1024/16384	8.4 ns	8.5 ns	74,666,667	119.5M items/s
BumpAllocator_RequestCycle/16384	65.6 ns	66.5 ns	11,200,000	15.3M items/s
BumpAllocator_RequestCycle/65536	65.6 ns	65.4 ns	11,200,000	15.3M items/s
BumpAllocator_RequestCycle/262144	64.2 ns	65.1 ns	11,200,000	15.6M items/s
ArenaAllocator_TransactionCycle/1024	610.4 ns	618.4 ns	896,000	1.6M items/s
ArenaAllocator_TransactionCycle/4096	256.7 ns	252.4 ns	2,800,000	3.9M items/s
ArenaAllocator_TransactionCycle/16384	125.6 ns	126.6 ns	4,977,778	8.0M items/s

Slab Allocator는 전 크기 구간에서 42~45% 레이턴시 감소를 기록했다. 64B 기준 11.1ns → 6.4ns, 1024B 기준 10.0ns → 6.1ns로 크기 무관 ~6ns 균일 성능을 달성했다. 시스템 malloc 대비 격차도 확대되어 Slab이 malloc 대비 8배 빠르다(52ns vs 6.4ns).

Bump Allocator는 v0.6.5.0에서 arena 크기별 세분화 벤치마크(16K/64K/256K)가 추가되었으며, 전 구간 4.5~5.2ns의 안정적 성능을 보인다. 이전 7~10ns 대비 약 40% 개선. Arena Allocator도 arena 크기별 벤치마크가 확장되어 7.5~15ns 구간에서 동작하며, 작은 할당(64B)은 8.8ns로 이전 14.6ns 대비 40% 빠르다.

make_shared는 153ns → 70ns(-54%)로 개선. BumpAllocator RequestCycle(65~66ns)과 Arena TransactionCycle(126~618ns)이 새로 추가되어 실제 요청 처리 시나리오의 할당 비용을 측정한다. per-core 전용 할당기의 lock-free 설계가 성능 이점을 극대화한 결과다.

5 Allocators Comparison

Version Comparison

Frame Processing — FrameCodec

Benchmark	CPU Time	Real Time	Iterations	Throughput
FrameCodec_Encode/64	24.0 ns	25.0 ns	28,000,000	3.2 GB/s
FrameCodec_Encode/512	27.8 ns	27.6 ns	23,578,947	18.8 GB/s
FrameCodec_Encode/4096	98.4 ns	99.5 ns	7,466,667	41.8 GB/s
FrameCodec_Encode/16384	313.9 ns	316.5 ns	2,240,000	52.2 GB/s
FrameCodec_Decode/64	51.6 ns	51.1 ns	10,000,000	1.5 GB/s
FrameCodec_Decode/512	54.4 ns	55.2 ns	11,200,000	9.6 GB/s
FrameCodec_Decode/4096	135.0 ns	135.1 ns	4,977,778	30.4 GB/s
FrameCodec_Decode/16384	399.0 ns	401.9 ns	1,723,077	41.1 GB/s

FrameCodec Encode는 전 페이로드에서 2배 이상의 throughput 개선을 보인다. 64B Encode 57.5ns → 24.0ns(-58%), 512B 73.2ns → 27.8ns(-62%), 4KB 164.9ns → 98.4ns(-40%), 16KB 609ns → 314ns(-48%). bytes/sec 기준으로 64B에서 1.3GB/s → 3.2GB/s, 16KB에서 26.9GB/s → 52.2GB/s로 대폭 향상.

Decode도 마찬가지로 64B 109.9ns → 51.6ns(-53%), 512B 122.8ns → 54.4ns(-56%), 4KB 194.9ns → 134.9ns(-31%), 16KB 725ns → 399ns(-45%). 소형 페이로드에서 개선폭이 더 크며, 이는 헤더 파싱 오버헤드가 줄어든 결과다.

프레임 인코딩/디코딩은 네트워크 I/O 직전·직후에 위치하므로, 이 개선은 메시지 처리 파이프라인 전체 레이턴시를 직접 줄인다. 특히 소형 메시지(채팅, 하트비트)가 다수인 실제 워크로드에서 효과가 극대화된다.

Encode vs Decode Throughput Scaling

Version Comparison

Serialization — FlatBuffers vs Heap

Benchmark	CPU Time	Real Time	Iterations	Throughput
FlatBuffers_Build/64	97.7 ns	96.6 ns	8,960,000	655.4 MB/s
FlatBuffers_Build/512	104.6 ns	103.8 ns	7,466,667	4.9 GB/s
FlatBuffers_Build/4096	161.1 ns	162.8 ns	4,072,727	25.4 GB/s
HeapAlloc_Build/64	59.4 ns	59.0 ns	10,000,000	1.1 GB/s
HeapAlloc_Build/512	61.4 ns	62.3 ns	11,200,000	8.3 GB/s
HeapAlloc_Build/4096	119.6 ns	117.6 ns	6,400,000	34.2 GB/s
FlatBuffers_Read/64	3.5 ns	3.6 ns	154,482,759	18.1 GB/s
FlatBuffers_Read/512	3.5 ns	3.5 ns	203,636,364	145.1 GB/s
FlatBuffers_Read/4096	3.5 ns	3.5 ns	203,636,364	1160.5 GB/s
HeapAlloc_Read/64	57.2 ns	57.6 ns	11,200,000	1.1 GB/s
HeapAlloc_Read/512	65.6 ns	65.2 ns	11,200,000	7.8 GB/s
HeapAlloc_Read/4096	112.3 ns	114.0 ns	6,400,000	36.5 GB/s

FlatBuffers Build는 64B 100ns → 98ns, 512B 105ns → 105ns, 4KB 160ns → 161ns로 거의 동일한 수준을 유지한다. FlatBuffers Read도 3.4~3.6ns 구간으로 변화가 없으며, zero-copy 읽기 특성상 L1d 캐시 히트가 지배적이라 추가 최적화 여지가 제한적이다.

HeapAlloc Build는 64B 62.5ns → 59.4ns(-5%), 512B 68.4ns → 61.4ns(-10%)로 소폭 개선. HeapAlloc Read도 64B 64.2ns → 57.2ns(-11%)로 메모리 할당기 개선의 간접 효과가 나타난다. 방법론 비교에서는 FlatBuffers Build가 HeapAlloc 대비 1.5~1.6배 느리지만, Read에서 15~30배 빠르다는 기존 trade-off가 그대로 유효하다.

직렬화 경로는 이미 충분히 최적화되어 있으며, 병목이 다른 컴포넌트(프레임 코덱, 큐)에 있었기 때문에 해당 영역의 개선이 전체 성능 향상에 더 크게 기여했다.

Build vs Read Comparison

Version Comparison

Hash Map — flat_map vs std::unordered_map 대규모 순회 비교

Benchmark	CPU Time	Real Time	Iterations	Throughput
Dispatcher_Lookup/10	5.0 ns	5.0 ns	100,000,000	200.0M items/s
Dispatcher_Lookup/100	5.0 ns	5.0 ns	100,000,000	200.0M items/s
Dispatcher_Lookup/1000	5.0 ns	5.0 ns	100,000,000	200.0M items/s
FlatMap_SessionLookup/100	3.8 ns	3.7 ns	194,782,609	265.2M items/s
FlatMap_SessionLookup/1000	3.8 ns	3.7 ns	194,782,609	265.2M items/s
FlatMap_SessionLookup/10000	3.8 ns	3.8 ns	186,666,667	265.5M items/s
FlatMap_SessionLookup/100000	3.8 ns	3.8 ns	186,666,667	265.5M items/s
StdMap_SessionLookup/100	3.4 ns	3.4 ns	194,782,609	296.8M items/s
StdMap_SessionLookup/1000	3.4 ns	3.3 ns	213,333,333	296.8M items/s
StdMap_SessionLookup/10000	3.4 ns	3.4 ns	194,782,609	296.8M items/s
StdMap_SessionLookup/100000	3.4 ns	3.4 ns	213,333,333	296.8M items/s
FlatMap_SessionIterate/100	138.1 ns	137.4 ns	4,977,778	724.0M items/s
FlatMap_SessionIterate/1000	2.3 us	2.4 us	280,000	426.7M items/s
FlatMap_SessionIterate/10000	19.3 us	19.1 us	37,333	519.4M items/s
StdMap_SessionIterate/100	109.9 ns	109.9 ns	6,400,000	910.2M items/s
StdMap_SessionIterate/1000	2.8 us	3.0 us	235,789	359.3M items/s
StdMap_SessionIterate/10000	64.2 us	63.7 us	11,200	155.8M items/s

MessageDispatcher Lookup은 10/100/1000 핸들러 모든 규모에서 5.0ns로 수렴하여, 이전 5.2~5.8ns 대비 최대 13% 개선되었다. 핸들러 수에 무관한 O(1) 특성이 더욱 선명해졌다.

FlatMap SessionLookup은 100~100K 세션 전 구간에서 3.4~3.8ns로 균일해졌다. 이전에는 100세션 5.2ns, 1000세션 7.1ns로 편차가 있었으나 v0.6.5.0에서 해소. FlatMap SessionIterate는 100개 세션 기준 180ns → 138ns(-23%), 10K 세션 26.2us → 19.3us(-27%)로 순회 성능도 개선. cache-friendly 레이아웃의 이점이 세션 수 증가에도 유지된다.

StdMap(unordered_map) SessionLookup도 개선되어 3.4~3.9ns 구간이나, FlatMap과의 차이가 축소. 다만 Iterate에서 FlatMap이 여전히 캐시 지역성 우위를 보이며, per-core 세션 관리에서 FlatMap 선택의 근거가 유효하다.

flat_map vs std::unordered_map — 세션 순회 (Iteration)

Version Comparison

Session & Timer

intrusive_ptr vs shared_ptr

TimingWheel — O(1) Timeout

Benchmark	CPU Time	Real Time	Iterations	Throughput
TimingWheel_ScheduleTick/1000	28.8 us	28.5 us	32,000	34.7M items/s
TimingWheel_ScheduleTick/10000	304.8 us	293.0 us	2,358	32.8M items/s
TimingWheel_ScheduleTick/50000	1.76 ms	1.61 ms	498	28.5M items/s
TimingWheel_ScheduleOnly	62.8 ns	62.9 ns	11,200,000	15.9M items/s

Session Lifecycle

Benchmark	CPU Time	Real Time	Iterations	Throughput
Session_Create	214.8 us	218.3 us	3,200	4,655 items/s
SessionPtr_Copy	12.3 ns	12.3 ns	56,000,000	81.5M items/s
SharedPtr_Copy	12.3 ns	12.3 ns	56,000,000	81.5M items/s

Session Create는 443us → 215us(-52%)로 세션 생성 비용이 절반으로 줄었다. 초당 생성 가능 세션 수가 2,256 → 4,655로 2배 증가. 이는 급격한 접속 폭주(thundering herd) 시나리오에서 서버 응답성을 직접적으로 개선한다.

SessionPtr Copy(intrusive_ptr)는 6.3ns → 12.3ns로 증가했고, SharedPtr Copy도 18.0ns → 12.3ns로 변화했다. v0.6.5.0에서 두 포인터 타입이 동일한 12.3ns로 수렴한 것은 벤치마크 코드 변경이나 컴파일러 최적화 차이의 영향으로 추정된다.

TimingWheel은 전 규모에서 극적 개선. ScheduleTick 1K 타이머 64us → 29us(-55%), 10K 725us → 305us(-58%), 50K 3.65ms → 1.76ms(-52%). ScheduleOnly도 159ns → 63ns(-60%). 대규모 타이머 풀에서도 tick당 비용이 선형 이하로 증가하여, 수만 세션의 idle timeout 관리에 적합하다.

Version Comparison

Buffer — RingBuffer

Benchmark	CPU Time	Real Time	Iterations	Throughput
RingBuffer_WriteRead/64	13.8 ns	14.2 ns	49,777,778	4.6 GB/s
RingBuffer_WriteRead/512	16.5 ns	16.6 ns	40,727,273	31.0 GB/s
RingBuffer_WriteRead/4096	51.6 ns	51.5 ns	10,000,000	79.4 GB/s
RingBuffer_Linearize/64	16.5 ns	17.1 ns	40,727,273	3.9 GB/s
RingBuffer_Linearize/512	23.4 ns	23.5 ns	32,000,000	21.8 GB/s
RingBuffer_Linearize/4096	64.2 ns	64.2 ns	11,200,000	63.8 GB/s
NaiveBuffer_CopyWrite/64	59.4 ns	58.3 ns	10,000,000	1.1 GB/s
NaiveBuffer_CopyWrite/512	65.6 ns	65.6 ns	11,200,000	7.8 GB/s
NaiveBuffer_CopyWrite/4096	114.7 ns	113.8 ns	6,400,000	35.7 GB/s

RingBuffer WriteRead는 전 크기에서 2~2.5배 throughput 향상. 64B 34.5ns → 13.8ns(-60%), 512B 40.6ns → 16.5ns(-59%), 4KB 102.5ns → 51.6ns(-50%). bytes/sec 기준 64B 1.85GB/s → 4.63GB/s, 4KB 40GB/s → 79.4GB/s로 대폭 향상.

Linearize(wrapping 복구) 성능도 64B 43.5ns → 16.5ns(-62%), 512B 67.0ns → 23.4ns(-65%), 4KB 132.5ns → 64.2ns(-52%)로 극적 개선. 대조군 NaiveBuffer CopyWrite는 64B 156ns → 59ns(-62%)로 개선되었으나, RingBuffer와의 격차는 여전히 4배 이상 유지. zero-copy 읽기 경로의 우위가 확고하다.

RingBuffer는 TCP 수신 경로에서 프레임 경계를 넘는 데이터 조립에 핵심적이므로, 이 개선은 네트워크 I/O throughput의 ceiling을 직접 높인다.

Zero-copy vs Naive memcpy (Throughput GB/s)

Version Comparison

Summary

All Components Delta %

Methodology Comparison Highlights

Comparison	Approach A	Approach B	Ratio
SPSC vs MPSC (1P1C)	SPSC: 3.6 ns	MPSC: 9.6 ns	2.7x
Slab vs malloc (64B)	Slab: 6.4 ns	malloc: 51.6 ns	8.0x
intrusive_ptr vs shared_ptr	intrusive: 12.3 ns	shared: 12.3 ns	1.0x
Zero-copy vs Naive (512B)	ZeroCopy: 31.0 GB/s	Naive: 7.8 GB/s	4.0x
FlatBuffers vs Heap Build (512B)	FB: 104.6 ns	Heap: 61.4 ns	1.7x
FlatBuffers vs Heap Read (512B)	FB: 3.5 ns	Heap: 65.6 ns	18.6x

v0.5.10.0 → v0.6.5.0은 프레임워크 전 컴포넌트에 걸친 체계적 성능 개선을 달성한 버전이다. 마이크로벤치마크 기준 대부분의 핵심 경로에서 40~65%의 레이턴시 감소가 관측되며, 통합 벤치마크에서는 8~14%의 end-to-end 개선으로 이어졌다.

특히 눈에 띄는 영역:
• SPSC Queue: 64~70% 레이턴시 감소 — 코어 간 통신의 핵심 경로
• FrameCodec: Encode 40~62%, Decode 31~56% 개선 — I/O 경로 직접 최적화
• RingBuffer: 50~65% 레이턴시 감소 — TCP 수신 버퍼 성능 상한 제고
• TimingWheel: 52~60% 개선 — 대규모 세션의 타이머 관리 비용 절감
• Session Create: 52% 개선 — 접속 폭주 대응력 2배 향상

마이크로벤치마크의 극적 개선 대비 통합 벤치마크 개선폭이 작은 것은 정상적이다. 통합 벤치마크는 OS 스케줄링, 캐시 코히어런스, 스레드 동기화 등 비결정적 요소의 비중이 크기 때문이다. 4코어 환경에서 Cross-core PostThroughput가 100만 msg/sec를 돌파한 것은 프로덕션 수준의 message-passing 성능을 확인한 이정표다.

방법론 비교 핵심 요약:
• SPSC vs MPSC: SPSC 3.6ns vs MPSC 9.5ns — single-producer가 약 2.6배 빠르며, shared-nothing 아키텍처에서 SPSC 우선 사용의 근거
• Slab vs Malloc vs make_shared: Slab 6.2ns vs Malloc 52ns vs make_shared 70ns — Slab이 malloc 대비 8.4배, make_shared 대비 11.3배 빠름. per-core 전용 할당기의 결정적 이점
• FlatBuffers vs HeapAlloc: Build에서 FlatBuffers가 1.6배 느리지만 Read에서 15~30배 빠름. 읽기 빈도가 높은 메시지 처리에서 FlatBuffers가 명백히 유리
• flat_map vs unordered_map: Lookup은 거의 동등(3.4~3.8ns), Iterate에서 flat_map이 캐시 지역성 우위로 23~36% 빠름
• intrusive_ptr vs shared_ptr: v0.6.5.0에서 동일 12.3ns로 수렴. atomic ref-count 최적화가 양쪽 모두에 적용된 결과
• RingBuffer(zero-copy) vs NaiveBuffer(memcpy): WriteRead에서 RingBuffer가 4.3배 빠름(13.8ns vs 59ns). 네트워크 I/O 경로에서 zero-copy 설계의 결정적 우위

Integration — End-to-end Pipeline

Throughput Scaling — Per-core vs Shared io_context

Cross-core Latency (RTT)

Cross-core Message Throughput

Frame Pipeline & Session Echo Throughput

Cross-core RTT(Round-Trip Time)는 17.1us → 16.6us(-2.6%)로 소폭 개선. OS 스케줄링과 캐시 코히어런스의 비결정적 요소가 지배적이라 마이크로벤치마크만큼의 차이가 나기 어렵다.

Cross-core PostThroughput는 976K ops/s → 1,053K ops/s(+7.9%)로 100만 msg/sec를 돌파했다. 4코어 i5-9300H 환경에서 이 수치는 프레임워크 수준의 message-passing 오버헤드가 충분히 낮음을 확인해준다.

FramePipeline(Encode→Queue→Decode 전 경로)은 64B 6.1us → 6.4us(거의 동일), 512B 6.8us → 6.3us(-8%), 4KB 6.8us → 6.1us(-10%)로 중~대형 페이로드에서 개선. Session EchoRoundTrip은 64B 8.2us → 7.1us(-12%), 512B 8.2us → 7.3us(-11%), 4KB 8.5us → 7.7us(-10%)로 전 구간 10% 이상 개선. 개별 컴포넌트의 마이크로 최적화가 end-to-end 파이프라인에서 누적되어 나타나는 결과다.

Cross-core Latency

Benchmark	CPU Time	Real Time	Iterations	Throughput
CrossCore_Latency/iterations:10000/real_time	10.9 us	23.7 us	10,000	-

Cross-core Message Passing

Benchmark	CPU Time	Real Time	Iterations	Throughput
CrossCore_PostThroughput/real_time	948.9 ns	949.6 ns	724,507	1.1M items/s

Frame Pipeline

Benchmark	CPU Time	Real Time	Iterations	Throughput
FramePipeline/64	6.4 us	6.3 us	112,000	11.8 MB/s
FramePipeline/512	6.3 us	6.2 us	112,000	83.5 MB/s
FramePipeline/4096	6.1 us	6.2 us	112,000	669.2 MB/s

Session Throughput

Benchmark	CPU Time	Real Time	Iterations	Throughput
Session_EchoRoundTrip/64	7.1 us	7.3 us	89,600	10.6 MB/s
Session_EchoRoundTrip/512	7.3 us	7.3 us	89,600	71.5 MB/s
Session_EchoRoundTrip/4096	7.7 us	7.8 us	89,600	535.4 MB/s

Architecture Comparison

Benchmark	CPU Time	Real Time	Iterations	Throughput
PerCore_Stateful/1/real_time	89.84 ms	88.59 ms	8	564,393 items/s
PerCore_Stateful/2/real_time	98.21 ms	97.59 ms	7	1.0M items/s
PerCore_Stateful/3/real_time	100.45 ms	101.37 ms	7	1.5M items/s
PerCore_Stateful/4/real_time	104.91 ms	110.65 ms	7	1.8M items/s
PerCore_Stateful/8/real_time	167.97 ms	169.35 ms	4	2.4M items/s
PerCore_Stateful/16/real_time	164.06 ms	337.01 ms	2	2.4M items/s
Shared_Stateful/1/real_time	87.89 ms	88.09 ms	8	567,605 items/s
Shared_Stateful/2/real_time	150.00 ms	148.86 ms	5	671,793 items/s
Shared_Stateful/3/real_time	192.71 ms	195.33 ms	3	767,940 items/s
Shared_Stateful/4/real_time	244.79 ms	245.88 ms	3	813,392 items/s
Shared_Stateful/8/real_time	460.94 ms	463.20 ms	2	863,555 items/s
Shared_Stateful/16/real_time	859.38 ms	926.98 ms	1	863,013 items/s
Shared_LockFree_Stateful/1/real_time	85.94 ms	86.84 ms	8	575,785 items/s
Shared_LockFree_Stateful/2/real_time	143.75 ms	144.25 ms	5	693,248 items/s
Shared_LockFree_Stateful/3/real_time	195.31 ms	194.55 ms	4	771,025 items/s
Shared_LockFree_Stateful/4/real_time	239.58 ms	239.60 ms	3	834,733 items/s
Shared_LockFree_Stateful/8/real_time	445.31 ms	452.32 ms	2	884,321 items/s
Shared_LockFree_Stateful/16/real_time	828.12 ms	914.97 ms	1	874,342 items/s