⚛️ Monad
Monad is a 56 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset.
As of 2025, Monad is the best contender for the smallest viable language models. Despite being less than half of GPT-2, Monad not only answers in consistent English but performs significantly beyond chance on MMLU and other major industry benchmarks.
Monad's name is a reference to Leibniz’ concept and the general idea of the smallest possible unit of intelligence.
🔧 Compatibility Note (Chat Template Fix)
Important
This repository hosts the same model as
👉 https://huggingface.co/PleIAs/MonadThere are no architectural changes, no re-training, and no weight modifications.
The only difference is that this version includes a fixed / updated chat template compatible with newer versions of:
- 🤗
transformerstrllm-evalThis resolves recent breaking changes around chat templating and instruction formatting, ensuring correct behavior when using:
tokenizer.apply_chat_template(...)If you are using older versions of the Hugging Face stack, both repositories behave identically.
If you are using recent versions, this repository is recommended.
Features
Monad has been natively trained for instructions with thinking traces. We implemented a series of dedicated pipelines for:
- Memorization of encyclopedic knowledge (50,000 vital articles from Wikipedia), though in this size range hallucinations have to be expected.
- Retrieval-Augmented Generation with grounding (following on our initial experiments with the Pleias-RAG series).
- Arithmetic and simple math resolution problems.
- Editing tasks.
- Information extraction.
- Creative writing, including unusual synthetic exercises like lipograms or layout poems.
Monad is strictly monolingual in English.
We trained a custom tokenizer (likely one of the smallest to date, with fewer than 8,000 tokens), exclusively on SYNTH to maintain a relatively strong compression ratio.
Model design and training
Monad is a 56M-parameter decoder-only model with a standard Qwen/LLaMA-like design, except for its extremely compact size and an opinionated depth-first architecture (64 layers).
Monad was trained on 16 H100 GPUs on Jean Zay (compute plan n°A0191016886).
Full pre-training took slightly less than 6 hours.
Evaluation
Monad attains performance on MMLU significantly beyond chance, with close to 30% accuracy.
We also observe non-random performance on:
- GSM8K: ~8%
- HotPotQA: ~8%
To our knowledge, there is no model remotely comparable in this size range for evaluation.
Both spiritually and practically, Monad remains unique.
Use and deployment
Monad has been trained using the standard Qwen instruction format.
<|im_start|>user
Who are you?<|im_end|>
<|im_start|>assistant
<think>
Monad has no support yet for multi-turn.
A major envisioned use case for Monad is explainability, as the model does provide a unique trade-off between observability and actual reasoning performance.
- Downloads last month
- 92