⚛️ Monad

Blog announcement

Monad is a 56 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset.

As of 2025, Monad is the best contender for the smallest viable language models. Despite being less than half of GPT-2, Monad not only answers in consistent English but performs significantly beyond chance on MMLU and other major industry benchmarks.

Monad's name is a reference to Leibniz’ concept and the general idea of the smallest possible unit of intelligence.


🔧 Compatibility Note (Chat Template Fix)

Important
This repository hosts the same model as
👉 https://huggingface.co/PleIAs/Monad

There are no architectural changes, no re-training, and no weight modifications.

The only difference is that this version includes a fixed / updated chat template compatible with newer versions of:

  • 🤗 transformers
  • trl
  • lm-eval

This resolves recent breaking changes around chat templating and instruction formatting, ensuring correct behavior when using:

tokenizer.apply_chat_template(...)

If you are using older versions of the Hugging Face stack, both repositories behave identically.
If you are using recent versions, this repository is recommended.


Features

Monad has been natively trained for instructions with thinking traces. We implemented a series of dedicated pipelines for:

  • Memorization of encyclopedic knowledge (50,000 vital articles from Wikipedia), though in this size range hallucinations have to be expected.
  • Retrieval-Augmented Generation with grounding (following on our initial experiments with the Pleias-RAG series).
  • Arithmetic and simple math resolution problems.
  • Editing tasks.
  • Information extraction.
  • Creative writing, including unusual synthetic exercises like lipograms or layout poems.

Monad is strictly monolingual in English.
We trained a custom tokenizer (likely one of the smallest to date, with fewer than 8,000 tokens), exclusively on SYNTH to maintain a relatively strong compression ratio.


Model design and training

Monad is a 56M-parameter decoder-only model with a standard Qwen/LLaMA-like design, except for its extremely compact size and an opinionated depth-first architecture (64 layers).

Monad was trained on 16 H100 GPUs on Jean Zay (compute plan n°A0191016886).
Full pre-training took slightly less than 6 hours.


Evaluation

Monad attains performance on MMLU significantly beyond chance, with close to 30% accuracy.
We also observe non-random performance on:

  • GSM8K: ~8%
  • HotPotQA: ~8%

To our knowledge, there is no model remotely comparable in this size range for evaluation.
Both spiritually and practically, Monad remains unique.


Use and deployment

Monad has been trained using the standard Qwen instruction format.

<|im_start|>user
Who are you?<|im_end|>
<|im_start|>assistant
<think>

Monad has no support yet for multi-turn.

A major envisioned use case for Monad is explainability, as the model does provide a unique trade-off between observability and actual reasoning performance.

Downloads last month
92
Safetensors
Model size
56.7M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Shekswess/Monad

Collection including Shekswess/Monad