gpt-oss-20b-FP8-Dynamic / README.md

KJML

Update README.md

f715494 verified 13 days ago

preview code

raw

history blame

5.71 kB

metadata

library_name: transformers
pipeline_tag: text-generation
base_model: openai/gpt-oss-20b
license: apache-2.0
tags:
  - text-generation
  - causal-lm
  - gpt_oss
  - moe
  - fp8
  - conversational
  - english

Model Card for `KJML/gpt-oss-20b-FP8-Dynamic`

This repository provides an FP8-dynamic quantized variant of OpenAI’s gpt-oss-20b model.
It is intended for users who want the reasoning capabilities of gpt-oss-20b with a smaller memory footprint and faster inference on modern GPUs that support FP8 inference.

⚠️ This model is not trained or fine-tuned further; it is a post-training quantization of the original openai/gpt-oss-20b weights.

Model Details

Model Description

Base model: openai/gpt-oss-20b
Architecture: Mixture-of-Experts (MoE) Transformer language model (≈21B total params, ≈3.6B active per token, inherited from base)
Quantization: FP8 dynamic (weights + activations) for inference
Context length: Same as base gpt-oss-20b (long-context, Harmony-format chat)
Language(s): Primarily English; inherits multilingual capability from base model
License: Apache 2.0 (inherits from base model)
Model type: Causal language model for text / chat generation
Developer of this variant: KJML
Finetuned from model: openai/gpt-oss-20b (no additional training; quantization only)

The original gpt-oss-20b is an open-weight reasoning model from OpenAI, designed for agentic workflows, tool use, and configurable reasoning effort. This FP8-dynamic variant preserves those capabilities while targeting more efficient deployment.

Model Sources

Base model repository: https://huggingface.co/openai/gpt-oss-20b
Upstream project / docs: https://github.com/openai/gpt-oss
This quantized model: https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic (this repo)

Uses

Direct Use

Typical direct-use scenarios (without additional fine-tuning):

General chat and assistant-style dialogue (English-first)
Reasoning and analysis (step-by-step / chain-of-thought) for:
- Technical explanations
- Brainstorming and ideation
- Code reasoning and pseudo-code (light coding assistance)
Agentic / tool-using setups:
- Function calling and structured outputs
- Retrieval-augmented generation (RAG) backends
- Local “AI PC” / workstation deployments where FP8 is supported

Note: The model is trained on OpenAI’s Harmony response format. For best results, use a chat template that applies the Harmony format (e.g. tokenizer.apply_chat_template in Transformers) when prompting.

Downstream Use

The FP8-dynamic variant can be used as a drop-in replacement for openai/gpt-oss-20b in:

Custom backends with vLLM / TGI / custom inference servers
Local desktop apps (LM Studio, Ollama-style setups, etc.) that support FP8
RAG systems where latency and VRAM usage are important
Multi-agent frameworks where many concurrent contexts are needed

If you fine-tune or adapt this model further, treat it as you would the base gpt-oss-20b model, but keep in mind that quantization can slightly change numeric behavior, especially for very long generations.

Out-of-Scope Use

The model (and this quantized variant) is not recommended for:

High-stakes decision making without human review, e.g.:
- Medical, legal, or financial advice
- Safety-critical environments (autonomous driving, industrial control, etc.)
Generating content that violates laws or platform policies
Acting as the sole decision-maker in any context where errors could cause harm to people or property

Users should always keep a human in the loop for sensitive or impactful applications.

Bias, Risks, and Limitations

This model inherits all biases, risks, and limitations of the base gpt-oss-20b model. As a large language model trained on internet-scale data, it may:

Produce biased or stereotypical content, including along axes such as gender, race, nationality, or religion.
Hallucinate facts, references, or citations.
Overstate its own certainty.
Generate unsafe or undesirable content if prompted adversarially or without proper safety layers.

The FP8-dynamic quantization may also:

Introduce small degradations in quality vs. BF16 / MXFP4 versions, particularly for:
- Very long generations
- Edge cases that are numerically sensitive
Behave slightly differently from the base model, even with identical prompts.

Recommendations

Do not rely on this model as a single source of truth.
Add safety filters and/or a moderation layer around generations.
Use human review for any high-impact or user-facing deployment.
Evaluate the FP8-dynamic variant on your own tasks and data before using in production.

How to Get Started with the Model

Basic usage with 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "KJML/gpt-oss-20b-FP8-Dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",     # Will use FP8 where supported
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain what FP8 dynamic quantization is in simple terms."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model Card for KJML/gpt-oss-20b-FP8-Dynamic