Convolutional Multi-Hybrids for Edge Devices

Artificial Intelligence is rapidly becoming ubiquitous, powering applications ranging from large-scale cloud deployments to low-resource edge devices like smartphones and laptops. Despite impressive advancements, most current small models optimized for edge deployments—such as SmolLM2 (Allal et al., 2025), Phi models (Abdin et al., 2024), and Llama 3.2 1B (Grattafiori et al., 2024)—predominantly rely on Transformer-based architectures featuring attention operators, owing to their parallelizable computations and efficient kernels (Vaswani et al., 2017). Optimizing architectures–even for GPUs–can be exceptionally challenging. While hybrid architectures have been shown to deliver quality improvements, they are often slower in deployment than highly-optimized Transformers, particularly in critical regimes for edge deployment, such as on short prompts. This underscores the importance of jointly optimizing model architecture and inference runtime, with performance metrics tailored to target hardware.

Authors

Liquid Science

Armin Thomas, Stefano Massaroli, Michael Poli, Liquid Edge Team

Published

April 25, 2025

Today, we introduce a Liquid architecture called Hyena Edge, a convolution-based multi-hybrid model that not only matches but outperforms strong Transformer-based baselines in computational efficiency and model quality on edge hardware, benchmarked on the Samsung S24 Ultra smartphone. To design Hyena Edge, we use our recently proposed end-to-end automated model design framework.

We plan to open source a series of Liquid foundation models in the coming months including Hyena Edge. Stay tuned as we continue pushing the boundaries of what's possible at the AI edge.

Automating Architecture Optimization with STAR

To systematically explore and optimize our architecture, we used STAR (Thomas et al., 2024), our recently introduced automated architecture optimization framework presented at ICLR ‘25. STAR uses evolutionary principles combined with linear systems theory to efficiently navigate architectural spaces towards optimal trade-offs in efficiency and quality performances.

We initialized STAR with a population of 16 candidate architectures, evolving them over 24 generations. The search space included multiple variants of convolutional operators inspired by Hyena (Poli et al., 2023; Ku et al., 2025): Hyena (Full): includes convolutions in the gating mechanism next to Hyena’s inner convolution. Hyena-X (Chandrasegaran et al., 2025): excludes the inner convolution. Hyena-Y (Chandrasegaran et al., 2025): excludes convolutions in the feature groups (gates). In addition to spanning these three Hyena types, we also varied the length of their learned short, explicit (SE) convolution filters (3-128), resulting in a total set of 18 convolutional operators. The search space further included variants of GQA (with varying number of KV-heads; Shazeer, 2019) and SwiGLU (with varying inner widths; Shazeer, 2020).

Video 1. Overview of the Hyena operator’s evolution. Top: Scatters indicate architectures of the population. Colored lines indicate mean change in target metrics relative to the initial population. Bottom: Histogram shows total counts of each operator across the current population. Colored lines indicate the relative share of each operator category across the population.

STAR iteratively evolves the population of architectures towards the efficiency-quality frontier for latency, memory usage, and model quality, informed by initial profiling of individual operator latencies and memory usages on the S24 Ultra and perplexity during training¹.‍

The Rise of Hyena-Y

Interestingly, STAR progressively favored the Hyena-Y convolutions as architectures approached the efficiency-quality frontier, demonstrating superior balance across our latency, memory, and quality metrics. Leveraging this insight, our final Hyena Edge architecture strategically replaces two-thirds of the GQA operators from a state-of-the-art GQA-Transformer++ baseline with optimized gated convolutions from the Hyena-Y family.

Benchmarking Hyena Edge

We evaluated Hyena Edge's performance against a parameter-matched GQA-Transformer++ baseline, focusing on latency, memory usage, and language modeling benchmarks after training both models on the same set of 100 billion tokens.

Hyena Edge outperforms the Transformer-based baseline throughout.

Fig.1. Latency and memory profiles collected on a Samsung S24 Ultra smartphone.

Efficiency: Hyena Edge consistently demonstrated faster prefill and decode latencies on the Samsung S24 Ultra. Prefill latencies were notably lower across common sequence lengths, while decode latencies were faster for sequences >256 tokens. Importantly, Hyena Edge exhibits improved scaling of latency with sequence length, with up to 30% faster decode and prefill latencies at longer sequences, and even faster prefill latencies at the shortest sequence lengths. This is a milestone for alternative architectures, many of which improve in latency only for significantly longer sequences. In addition, Hyena Edge uses less memory during deployment compared to the GQA-Transformer++ baseline across all sequence lengths.

Wiki

LMB

PiQA

Hella

Wino

ARC-e

ARC-c

Model

Tokens

ppl ↓

acc ↑

acc n ↑

acc ↑

GQA-Transformer++

100B

17.3

10.8

71.1

49.3

51.4

63.2

31.7

53.34

Hyena Edge

100B

16.2

9.4

72.3

52.8

54.8

64.4

31.7

55.2

Table 1. Language modeling performances after being trained in autoregressive language modeling on the same 100B tokens.

Model Quality: Across various common language modeling benchmarks for small language models—including Wikitext, Lambada, Hellaswag, Winogrande, Piqa, Arc-easy, and Arc-challenge—Hyena Edge consistently outperformed the GQA-Transformer++ baseline.‍

Efficiency Without Compromise

Hyena Edge marks a step forward in AI edge deployment. By demonstrating that convolution-based multi-hybrid architectures can outperform traditional Transformer models on key efficiency and quality performance metrics for edge devices, we open the door to broader adoption of alternative computational primitives optimized for practical edge applications.

¹We optimize architectures at the full target depth of 32 operators but at a reduced width of 512, compared to the final architecture width of 2048 (keeping the size of the attention heads constant at 64). We approximate the latency and memory usage of each candidate architecture generated during STAR evolution by summing initially collected individual operator profiles across all included operators. We evaluate quality (as estimated by the perplexity metric) after training models for 5B tokens.

References:

Allal, L. B., Lozhkov, A., Bakouch, E., Blázquez, G. M., Penedo, G., Tunstall, L., ... & Wolf, T. (2025). SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model. arXiv preprint arXiv:2502.02737.
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., ... & Zhang, Y. (2024). Phi-4 technical report. arXiv preprint arXiv:2412.08905
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., ... & Vasic, P. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020, November). Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning (pp. 5156-5165). PMLR.
Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
Poli, M., Massaroli, S., Nguyen, E., Fu, D. Y., Dao, T., Baccus, S., ... & Ré, C. (2023, July). Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning (pp. 28043-28078). PMLR.
Hasani, R., Lechner, M., Wang, T. H., Chahine, M., Amini, A., & Rus, D. (2022). Liquid structural state-space models. arXiv preprint arXiv:2209.12951.
Thomas, A. W., Parnichkun, R., Amini, A., Massaroli, S., & Poli, M. (2024). STAR: Synthesis of Tailored Architectures. arXiv preprint arXiv:2411.17800.
Ku, J., Nguyen, E., Romero, D. W., Brixi, G., Yang, B., Vorontsov, A., ... & Poli, M. (2025). Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale. arXiv preprint arXiv:2503.01868.
Chandrasegaran, K., Poli, M., Fu, D. Y., Kim, D., Hadzic, L. M., Li, M., Gupta, A., Massaroli, S., Mirhoseini, A., Niebles, J. C., Ermon, S., & Li, F.-F. (2025). Exploring Diffusion Transformer Designs via Grafting. arXiv preprint. To appear.
Shazeer, N. (2019). Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
Shazeer, N. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202.