ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
By Yuseon Choi 1, Jingu Lee 1, Jungjun Oh 1, Sunjoo Whang 1, Byeongcheol Kim 2, Minsung Kim 1
1 KAIST
2 Samsung Electronics
Abstract
Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE’s low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW–SWco-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE—expert and bit—and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average 6.6× speedup and 4.4× energy efficiency gain over naïve MoE serving on xPU across batch sizes 1–16, and delivers 2.2× speedup and 1.4× energy efficiency gain over the best-performing prior accelerator baseline.
Keywords
Near Memory Processing, Hybrid-Bonding, Mixture-of-Experts, Speculative Decoding, On-premises Serving
Related Chiplet
- DPIQ Tx PICs
- IMDD Tx PICs
- Near-Packaged Optics (NPO) Chiplet Solution
- High Performance Droplet
- Interconnect Chiplet
Related Technical Papers
Latest Technical Papers
- ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
- Technology solutions targeting the performance of gen-AI inference in resource constrained platforms
- Rethinking Compute Substrates for 3D-Stacked Near-Memory LLMDecoding: Microarchitecture–Scheduling Co-Design
- DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
- Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads