Artificial Intelligence is rapidly becoming ubiquitous, powering applications ranging from large-scale cloud deployments to low-resource edge devices like smartphones and laptops. Despite impressive advancements, most current small models optimized for edge deployments—such as SmolLM2 (Allal et al., 2025), Phi models (Abdin et al., 2024), and Llama 3.2 1B (Grattafiori et al., 2024)—predominantly rely on Transformer-based architectures featuring attention operators, owing to their parallelizable computations and efficient kernels (Vaswani et al., 2017). Optimizing architectures–even for GPUs–can be exceptionally challenging. While hybrid architectures have been shown to deliver quality improvements, they are often slower in deployment than highly-optimized Transformers, particularly in critical regimes for edge deployment, such as on short prompts. This underscores the importance of jointly optimizing model architecture and inference runtime, with performance metrics tailored to target hardware.
Today, we introduce a Liquid architecture called Hyena Edge, a convolution-based multi-hybrid model that not only matches but outperforms strong Transformer-based baselines in computational efficiency and model quality on edge hardware, benchmarked on the Samsung S24 Ultra smartphone. To design Hyena Edge, we use our recently proposed end-to-end automated model design framework.
We plan to open source a series of Liquid foundation models in the coming months including Hyena Edge. Stay tuned as we continue pushing the boundaries of what's possible at the AI edge.
To systematically explore and optimize our architecture, we used STAR (Thomas et al., 2024), our recently introduced automated architecture optimization framework presented at ICLR ‘25. STAR uses evolutionary principles combined with linear systems theory to efficiently navigate architectural spaces towards optimal trade-offs in efficiency and quality performances.
We initialized STAR with a population of 16 candidate architectures, evolving them over 24 generations. The search space included multiple variants of convolutional operators inspired by Hyena (Poli et al., 2023; Ku et al., 2025): Hyena (Full): includes convolutions in the gating mechanism next to Hyena’s inner convolution. Hyena-X (Chandrasegaran et al., 2025): excludes the inner convolution. Hyena-Y (Chandrasegaran et al., 2025): excludes convolutions in the feature groups (gates). In addition to spanning these three Hyena types, we also varied the length of their learned short, explicit (SE) convolution filters (3-128), resulting in a total set of 18 convolutional operators. The search space further included variants of GQA (with varying number of KV-heads; Shazeer, 2019) and SwiGLU (with varying inner widths; Shazeer, 2020).
STAR iteratively evolves the population of architectures towards the efficiency-quality frontier for latency, memory usage, and model quality, informed by initial profiling of individual operator latencies and memory usages on the S24 Ultra and perplexity during training1.
Interestingly, STAR progressively favored the Hyena-Y convolutions as architectures approached the efficiency-quality frontier, demonstrating superior balance across our latency, memory, and quality metrics. Leveraging this insight, our final Hyena Edge architecture strategically replaces two-thirds of the GQA operators from a state-of-the-art GQA-Transformer++ baseline with optimized gated convolutions from the Hyena-Y family.
We evaluated Hyena Edge's performance against a parameter-matched GQA-Transformer++ baseline, focusing on latency, memory usage, and language modeling benchmarks after training both models on the same set of 100 billion tokens.
Hyena Edge outperforms the Transformer-based baseline throughout.
Efficiency: Hyena Edge consistently demonstrated faster prefill and decode latencies on the Samsung S24 Ultra. Prefill latencies were notably lower across common sequence lengths, while decode latencies were faster for sequences >256 tokens. Importantly, Hyena Edge exhibits improved scaling of latency with sequence length, with up to 30% faster decode and prefill latencies at longer sequences, and even faster prefill latencies at the shortest sequence lengths. This is a milestone for alternative architectures, many of which improve in latency only for significantly longer sequences. In addition, Hyena Edge uses less memory during deployment compared to the GQA-Transformer++ baseline across all sequence lengths.
Model Quality: Across various common language modeling benchmarks for small language models—including Wikitext, Lambada, Hellaswag, Winogrande, Piqa, Arc-easy, and Arc-challenge—Hyena Edge consistently outperformed the GQA-Transformer++ baseline.
Hyena Edge marks a step forward in AI edge deployment. By demonstrating that convolution-based multi-hybrid architectures can outperform traditional Transformer models on key efficiency and quality performance metrics for edge devices, we open the door to broader adoption of alternative computational primitives optimized for practical edge applications.
References: