Today, we report advances in automated neural network architecture design and customization. We developed algorithms for the synthesis of tailored architectures (STAR), based on evolutionary algorithms applied to a numerical representation for model architectures derived from a new design theory. STAR automates the process of architecture discovery and optimization, turning it into an end-to-end process. With these methods, we have been able to tailor architectures to custom tasks, metrics, and hardware. We used STAR to synthesize hundreds of different designs that outperform strong Transformer and hybrid architectures in quality, with smaller caches and number of parameters.
Model architecture design is a fundamental pillar of AI, shaping everything from scaling capabilities and efficiency to the foundations of pretraining, alignment, and inference. A critical challenge in architecture design is balancing quality with hardware constraints - particularly latency and memory costs - to ensure AI systems can be deployed effectively across different environments.
Designing performant architectures remains a highly nontrivial combinatorial problem, even when restricting the search to models optimized to run fast only on GPU. This complexity has often led AI labs and companies to commit to particular designs early, as manual heuristics are limited in predicting performance trade-offs. Moreover, requirements can vary greatly across application domains: for example, a language model designed for edge use cases should be light in memory footprint, energy consumption, and exhibit good performance on the specific target hardware e.g., fast prefilling on CPU, whereas a language model designed for cloud usage typically prioritizes quality and GPU latency over memory footprint. In practice, demands get even more nuanced and complex.
The foundation for an architecture’s performance is laid out by the computational units that it is built from and how these are interconnected. The majority of current deep learning architectures are built by sequential interleaving of attention operators and gated linear units. These layers are simple examples of a much larger class of computational units, which we call linear input-varying systems (LIVs). LIVs are structured operators whose action is modulated pointwise by the input itself, and provide abstractions to generalize diverse classes of computational units such as attention variants, linear attention, (gated) convolutions, (gated) recurrences with linear state transitions, state-space layers, and (gated) linear units.
Evidence already exists for the potential of "beyond Transformer" architectures. Models combining simple LIVs, such as gated convolutions and recurrences, with self-attention in striped hybrid patterns have demonstrated modest quality improvements, more efficient scaling to longer sequences, and faster inference (earlier this year, we reported the first scaling laws for hybrids).
While new computational units and interconnection strategies enable a new frontier in model performance, they also present a key challenge to architecture design due to the vast number of possible designs. Rather than relying on manual optimization and heuristics applied to specific sub-classes of computational units (e.g., attention and convolution) or interconnection strategies (weight sharing, KV-sharing, parallel interconnection), we leverage evolutionary algorithms tailored to LIVs.
One of the core innovations of STAR is representing model architectures as hierarchical numeric sequences called STAR genomes, which we evolve using principles from evolutionary optimization. The process is iterative: we compile genomes into concrete architectures, evaluate them, then select and recombine the best-performing architectures to create the next generation.
Importantly, the evolutionary process can be guided by both static and dynamic objectives: Static objectives are given by the specific configuration of an architecture, such as its parameter count or cache size. Dynamic objectives, on the other hand, require evaluating the architecture, for example, by measuring its perplexity after training on a given dataset, or its latency on the target hardware.
To ensure architecture candidates are novel and performant, genome encoding relies on our design theory, which introduces a new general class of computational unit for neural networks: linear input-varying systems (LIVs). We have identified and built abstractions around the fundamental mechanisms that govern how modern computational units in deep learning modulate their computation based on input context. The framework, grounded in tensor networks and system theory, characterizes LIVs through two key aspects: their structure (the token and channel mixing structure of the operator) and featurization (the functional form of the input dependence in the operator). To be able to represent sophisticated architecture designs, our framework treats operator composition as a first-class concern, opening new pathways for building architectures beyond the sequential stacking of layers. More details on our design theory will follow.
The STAR genome allows us to map the LIV design space to a hierarchical numerical encoding suitable for evolutionary optimization. It defines the characteristics of each computational unit employed by the encoded architecture as well as how these units are interconnected.
We started with evaluating STAR in the design of improved language modeling architectures, optimizing for three mixtures of objectives: i.) quality (perplexity after training), ii.) quality and parameter-efficiency, and iii.) quality and cache-efficiency.
After as few as two or three rounds of evolutions, most of the architectures outperform staples such as Transformers and strong hybrid baselines, with consistent improvements as more rounds are executed. In particular, when optimizing quality only, we find that all evaluated STAR-evolved architectures outperform attention-recurrence hybrids on downstream evaluation benchmarks, exhibiting improvements on benchmarks twice as large as those of hybrids over Transformers. We find this result to be strong evidence in support of the effectiveness of evolutionary search across our design space. Hybrids have been designed and refined manually with a significant investment of resources, whereas STAR can generate architectures in less than a day, with a >90% hit rate.
The search can support multi-objective optimization. When jointly optimizing for quality and model size, the evolved architectures consistently outperform both Transformers and striped hybrids while simultaneously reducing parameter counts, allowing us to compactify the model for edge and resource-constrained environments. We experiment with different methods to transfer results from evolution and evaluation across scales, and generally find optimizing on thin and deep architecture candidates to yield better transfer than optimizing architecture motifs at the target width.
In a similar vein, we have used STAR to balance quality, model size, and latency on target hardware, obtained by profiling directly on the inference stack. This is possible since STAR does not require gradients of a metric: it is compatible with mixtures of static and dynamic metrics computed with the architecture compiled from the genome, including detailed profiling passes to minimize latency and communication overheads.
Beyond optimizing architectures for specific objectives, STAR also provides an analysis tool to identify recurring architecture motifs emerging during evolution, driving the observed performance gains. Interestingly, manual interconnection patterns previously proposed such as KV-sharing and some forms of weight sharing naturally emerge, alongside completely new ones.
The capabilities we've demonstrated with STAR only hint at its full potential. Thanks to the ability to optimize any mixture of metrics, combined with the versatility of LIVs, we're witnessing continuous improvements in both the diversity and quality of synthesized designs. With an improved understanding of which patterns and objectives co-occur, we now look to further refine the STAR’s evolutionary algorithm and initial populations so that every generation is better than the last, including running optimization at the lowest levels of the genome hierarchy. We are also interested in applying a similar methodology to other domains where modular design spaces can be constructed.
For all the details refer to the paper: “STAR: Synthesis of Tailored Architectures”.