Research Alert: China's Silicon Reawakening: Meituan's LongCat Lab and the Ascend-Powered Dawn of Sovereign AI
In the crucible of technological containment, China's AI ascent has assumed the character of a sophisticated national symphony—platform capital conducting hardware ingenuity, academic sparsity orchestrating around bandwidth austerity. The latest movement features LongCat-2.0, Meituan's audacious 1.6-trillion-parameter MoE titan: the first publicly chronicled frontier model whose entire pre-training and inference lifecycle transpired on 50,000 Ascend 910C cards within CloudMatrix 384 Superpods. Pre-trained on over 35 trillion tokens with dynamic activation of 33–56 billion parameters per token (averaging ~48B), native 1M context via LongCat Sparse Attention (LSA) —an evolution of DeepSeek DSA deploying lighter indexers for near-linear scaling—and 135B N-gram Embedding parameters at 5-gram depth, it delivers 59.5 on SWE-Bench Pro and 70.8% on Terminal-Bench 2.1. This is no mere parameter flex; it is the super-app's declaration that intelligence must be endogenous to command the physical world's chaotic data streams. - Meituan's dominion as China's lifestyle super-app—melding instantaneous delivery, mobility, hospitality, and discovery into one pulsating interface—renders generic.
Story updates
In the crucible of technological containment, China's AI ascent has assumed the character of a sophisticated national symphony—platform capital conducting hardware ingenuity, academic sparsity orchestrating around bandwidth austerity.
The latest movement features LongCat-2.0, Meituan's audacious 1.6-trillion-parameter MoE titan: the first publicly chronicled frontier model whose entire pre-training and inference lifecycle transpired on 50,000 Ascend 910C cards within CloudMatrix 384 Superpods.
Pre-trained on over 35 trillion tokens with dynamic activation of 33–56 billion parameters per token (averaging ~48B), native 1M context via LongCat Sparse Attention (LSA) —an evolution of DeepSeek DSA deploying lighter indexers for near-linear scaling—and 135B N-gram Embedding parameters at 5-gram depth, it delivers 59.5 on SWE-Bench Pro and 70.8% on Terminal-Bench 2.1.
This is no mere parameter flex; it is the super-app's declaration that intelligence must be endogenous to command the physical world's chaotic data streams. - Meituan's dominion as China's lifestyle super-app—melding instantaneous delivery, mobility, hospitality, and discovery into one pulsating interface—renders generic frontier models insufficient.
Its proprietary graph of real-time inventories, 1.3 billion reviews, and hyper-frequent user intents demands agents capable of autonomous orchestration: parsing vague spoken directives into executed bookings against live merchant states.
Xiaomei already prototypes this; LongCat-2.0 elevates it.
In an ecosystem where super-apps like WeChat, Douyin, and Taobao vie for primacy in daily autopilot existence, ceding the foundational layer equates to strategic surrender.
Meituan's investment, born of survival calculus amid subsidy wars and margin compression, echoes yet surpasses Meta's proprietary stake: here, the LLM becomes the invisible conductor of commerce's physical symphony. - The human filament threading this narrative glows with quiet poignancy.
In early 2023, co-founder Wang Huiwen —fresh from Meituan tenure—poured personal fortune into Light Years Beyond, envisioning China's OpenAI and igniting a cohort of labs including DeepSeek and Moonshot.
Mental health tribulations led to Meituan's ~ 2.065 billion RMB acquisition, absorbing talent and vision into LongCat Lab.
What emerged transcends the original cadre: MOPD (Multi-Expert On-Policy Distill) fuses dedicated Agent (tool-calling, self-correction), Reasoning (STEM depth), and Interaction (nuanced fidelity) experts through on-policy distillation, dynamically gated within a single model.
Paired with zero-computation experts that route simple tokens to negligible cost and N-gram embeddings—hashing sequences via decomposed sub-tables, linear projections, and amplification to sidestep collision while preserving residual pathway potency—this orthogonal sparsity dimension outperforms pure expert scaling on empirical Pareto frontiers. - Huawei's CloudMatrix384 supplies the architectural loom.
A 16-rack supernode uniting 384 dual-die Ascend 910C NPUs (each yielding 752 TFLOPS BF16, 128GB HBM at 3.2 TB/s, seven 224 Gbps transceivers) with 192 Kunpeng CPUs and tiered UB switches, it forges a peer-to-peer fabric where inter-node bandwidth holds within 3% of intra-node and latency penalty stays under 1µs.
Three complementary planes—UB for supernode-scale all-to-all (memory-semantic load/store/atomic), RDMA/RoCE for expansion, VPC for integration—enable global DRAM pooling across CPUs for KV cache, model weights, and checkpoints.
CloudMatrix-Infer disaggregates prefill/decode/caching with EP32+ expert parallelism, attaining 6,688 tokens/s/NPU prefill and 1,943 tokens/s decode ( CloudMatrix384's optics-dominant, multi-rack expanse trades per-FLOP efficiency for systemic amplitude: nearly double BF16 dense compute, 3.6x memory capacity, 2.1x bandwidth at higher power, yet perfectly attuned to domestic realities—SMIC 7nm continuity, CXMT HBM3 maturation, YMTC NAND abundance.
UB-Mesh precursors unify CPU/NPU/NIC pooling, extending beyond NVLink's GPU-centric domain into true heterogeneous memory semantics.
LongCat's training stack leveraged 6D parallelism (adding EMBP to TP/CP/EP/DP/PP), pipeline scheduling, operator kernel fusion, and bitwise determinism to achieve 1.5x MFU uplift, > 1T tokens/day steady-state throughput, and 70%+ reduction in daily fault incidence—triumphs over hardware volatility, communication anomalies, and numerical drift. - These co-evolutions erode the HBM bottleneck through layered ingenuity.
LSA intelligently prunes attention to key tokens in million-context regimes; N-gram branches expand embedding expressivity ~100x while offloading to DRAM; DeepSeek-derived MLA/DSA/CSA/HCA slash KV cache by up to 98% and decode bandwidth by 66%.
Huawei's EMS and disaggregated memory amplify cache hit rates under bursty loads.
With 1.6 million 910C stockpiles feasible and CXMT/YMTC scaling "good-enough" volumes, the ecosystem—bolstered by recurrent self-improvement pathways in peers like GLM—has attained escape velocity.
Training correctness via self-designed deterministic operators and variance alignment further cements reproducibility on non-ideal silicon. - Meituan's financial architecture, despite episodic net losses from competitive subsidies and Middle East forays, rests on formidable cash generation from core operations and a HKEX-listed valuation orbit of tens of billions.
Revenues hover near $50 billion TTM with proven capital markets access.
In China's super-app crucible—where platform lock-in hinges on predictive execution of daily errands—AI constitutes non-negotiable core competency.
LongCat's open-sourcing ethos (MIT weights, code) and agentic specialization signal sustained commitment; rivals cannot lag without ceding user mindshare to more autonomous interfaces. - Ultimately, this convergence transcends hardware substitution.
It reveals a civilizational wager: that distributed ingenuity—platform pragmatism fused with spa