Introducing LFM-7B: Setting New Standards for Efficient Language Models

Takeaways

We unveil LFM-7B, the best-performing model in its size class on the market.

LFM-7B uses a non-transformer, Liquid Foundation Model architecture, with high throughput and the lowest memory footprint.

LFM-7B is the natural choice of a language model for local deployment, latency-bound, and cost-constrained tasks.

LFM-7B is the world’s best-in-class multilingual language model in English, Arabic, and Japanese.

Try LFM-7B today on Liquid Playground, AWS marketplace, and soon on Openrouter, Perplexity Playground, and Lambda API.

LFM-7B comes with inference and customization stacks for enterprises. Get in touch with us to learn more.

Try Liquid

Chat Capabilities

LFM-7B is specifically optimized for response quality, accuracy, and usefulness. To assess its chat capabilities, we leverage a diverse frontier LLM jury to compare responses generated by LFM-7B against other models in the 7B-8B parameter category. It allows us to reduce individual biases and produce more reliable comparisons.

We compared answers to English prompts that include curated business use cases such as following instructions, questions from Arena-Hard-Auto (Li et al.), and real-world conversations (Zheng et al.). Thanks to our comprehensive preference alignment process, LFM-7B outperforms every LLM in the same size category.

Fig. 1. LLM-as-a-jury evaluation of chat capabilities in English.

The following head-to-head evaluation shows the proportion of times the LLM jury preferred answers generated by LFM-7B over those from other models. It contains the same exact English prompts.

Fig. 2. Head-to-head evaluation of chat capabilities in English.

Automated Benchmarks

LFM-7B maintains the core capabilities of expansive knowledge and reasoning similar to our other models. In addition to enhanced conversational skills, it also showcases improved coding and instruction-following abilities.

Fig. 3. Average score across thirteen automated benchmarks (MMLU, HellaSwag, ARC-C, IFEval, MMLU-Pro, MATH Lvl 5, GPQA, MuSR, HumanEval, HumanEval+, MBPP, MBPP+).

The following scores were obtained on standard automated benchmarks, using Eleuther AI’s Language Model Evaluation Harness v0.4.5. We only compare post-trained models.

Benchmark

LFM-7B

Liquid AI

7.7B

Ministral

(Mistral AI)

8.0B

Llama 3.1

(Meta)

8.0B

Command R7B

(Cohere)

8.0B

Qwen 2.5

(Alibaba)

7.6B

OLMo 2

(AI2)

7.3B

Context length
‍(tokens)

32k

128k

4k

MMLU
‍(5-shot)

69.34

64.66

67.92

70.44

74.31

62.18

HellaSwag
(10-shot)

83.07

80.58

80.00

80.53

81.37

85.77

ARC-C
(25-shot)

70.56

61.77

60.58

66.55

67.24

68.09

TruthfulQA
(0-shot)

63.89

48.65

54.02

55.38

64.76

54.50

IFEval
(0-shot)

60.72

29.17

50.7

34.56

63.71

59.26

MMLU-PRO
(5-shot)

42.42

35.04

37.72

36.55

44.65

29.66

MATH Lvl 5
(4-shot)

21.42

13.62

11.77

19.07

23.77

9.82

GPQA
(0-shot)

32.29

31.01

33.26

29.55

32.45

28.53

MuSR
(0-shot)

40.79

42.75

39.72

43.33

42.9

39.44

HumanEval
(pass@1)

63.41

25.61

64.02

55.49

26.83

41.46

HumanEval+
(pass@1)

56.71

24.39

59.15

48.78

23.17

37.8

MBPP
(pass@1)

51.60

31.60

52.20

51.20

50.80

26.0

MBPP+

(pass@1)

55.56

45.24

57.41

61.64

52.91

36.51

Table 1. Performance of LLMs across automated benchmarks.

Multilingual Capabilities

LFM-7B supports English, Spanish, French, German, Chinese, Arabic, Japanese, and Korean. While evaluating our models, we observed that automated benchmarks like MMMLU add confounding factors (e.g., world knowledge) and do not require any writing skills in the target language. On the other hand, arena evaluations specifically focus on producing grammatically correct and relevant answers. This is why we built language-specific arenas in Arabic and Japanese to assess the quality of models in a fair and relevant manner.

For the Arabic arena, we use a curated subset of real-world conversations (Zheng et al.) in Arabic. LFM-7B is fluent in Arabic and significantly preferred over other models in the same size category.

Fig. 4. LLM-as-a-jury evaluation of chat capabilities in Arabic.

For the Japanese arena, we use a combination of ELYZA-tasks-100 (Sasaki et al.) and real-world prompts curated by our partner ITOCHU-CTC. This creates a diverse set of prompts representative of business use cases. LFM-7B also leads our Japanese arena by a significant margin.

Fig. 5. LLM-as-a-jury evaluation of chat capabilities in Japanese.

Memory Efficiency

Like our previous models, LFM-7B has a minimal memory footprint compared to other architectures.

Fig. 6. Memory requirements for language model inference for different models as a function of combined input and generation sequence length. All models use bfloat16 precision without quantization. LFM-7B offers significant memory savings over other models. Memory usage can be reduced further through quantization techniques.

The memory efficiency of LFM-7B allows for several key features, including long-context understanding, energy-efficient inference, and high-throughput deployments on local devices. LFM-7B can also be efficiently customized to any knowledge or task using our on-premise fine-tuning stack. Consequently, LFM-7B significantly increases value for end users in applications such as private enterprise chat, secure code generation, fast instruction following, long document analysis, energy-efficient on-device AI assistants, and multi-step agentic workflows.

In addition to having the ability to process long input contexts efficiently, LFM-7B can retrieve from and reason over long contexts effectively. We validated this across all stages of development via our specialized Liquid internal long-context evals. In addition, we also evaluate the long-context ability of LFM-7B via two public long-context evals: RULER (Hsieh et al.) and LongBench v2 (Bai et al.). With RULER, a length is considered “effective” when its corresponding score is higher than 85.6. This shows that LFM-7B has an effective context length of 32k.

Model

LongBench v2

Claimed length

Effective length

RULER 4k

RULER 8k

RULER 16K

RULER 32k

RULER 64k

Ministral

(Mistal AI) 8.0B

26.1

128k

32k

96.0

93.5

90.6

86.4

37.0

Llama 3.1

(Meta) 8.0B

35.0

128k

32k

95.5

93.8

91.6

87.4

84.7

Qwen 2.5

(Alibaba) 7.6B

36.1

128k

32k

95.3

93.0

92.2

90.2

74.5

LFM-7B

(Liquid AI) 7.7B

36.1

32k

91.3

89.2

87.7

88.5

-

Table 2. Long-context performance measured by LongBench v2 and RULER.

Partner With Liquid

We’re making it easier than ever for developers, teams, and enterprises to integrate LFM models into their workflows:

To chat with LFMs go to playground.liquid.ai‍
For testing our models via API get in touch with us or try them on Lambda API.
To build with our models via API, go to OpenRouter.
For enterprise usage via API, go to AWS Marketplace.
If you like our model and want to license or purchase it for on-device or on-prem applications, contact us.

Join us as an early adopter

If your enterprise has use cases that need the efficient and high-throughput performance of our LFMs in order to do more with less, get in touch with us to discuss licensing or purchasing our models.

Get in touch Join the team

Introducing LFM-7B: Setting New Standards for Efficient Language Models

Takeaways

Chat Capabilities

Automated Benchmarks

Multilingual Capabilities

Memory Efficiency

Partner With Liquid

Join us as an early adopter

FAQ

Share your feeback

Manage your preferences