KJML's picture
Update README.md
f715494 verified
|
raw
history blame
5.71 kB
metadata
library_name: transformers
pipeline_tag: text-generation
base_model: openai/gpt-oss-20b
license: apache-2.0
tags:
  - text-generation
  - causal-lm
  - gpt_oss
  - moe
  - fp8
  - conversational
  - english

Model Card for KJML/gpt-oss-20b-FP8-Dynamic

This repository provides an FP8-dynamic quantized variant of OpenAI’s gpt-oss-20b model.
It is intended for users who want the reasoning capabilities of gpt-oss-20b with a smaller memory footprint and faster inference on modern GPUs that support FP8 inference.

⚠️ This model is not trained or fine-tuned further; it is a post-training quantization of the original openai/gpt-oss-20b weights.


Model Details

Model Description

  • Base model: openai/gpt-oss-20b
  • Architecture: Mixture-of-Experts (MoE) Transformer language model (≈21B total params, ≈3.6B active per token, inherited from base)
  • Quantization: FP8 dynamic (weights + activations) for inference
  • Context length: Same as base gpt-oss-20b (long-context, Harmony-format chat)
  • Language(s): Primarily English; inherits multilingual capability from base model
  • License: Apache 2.0 (inherits from base model)
  • Model type: Causal language model for text / chat generation
  • Developer of this variant: KJML
  • Finetuned from model: openai/gpt-oss-20b (no additional training; quantization only)

The original gpt-oss-20b is an open-weight reasoning model from OpenAI, designed for agentic workflows, tool use, and configurable reasoning effort. This FP8-dynamic variant preserves those capabilities while targeting more efficient deployment.

Model Sources


Uses

Direct Use

Typical direct-use scenarios (without additional fine-tuning):

  • General chat and assistant-style dialogue (English-first)
  • Reasoning and analysis (step-by-step / chain-of-thought) for:
    • Technical explanations
    • Brainstorming and ideation
    • Code reasoning and pseudo-code (light coding assistance)
  • Agentic / tool-using setups:
    • Function calling and structured outputs
    • Retrieval-augmented generation (RAG) backends
    • Local “AI PC” / workstation deployments where FP8 is supported

Note: The model is trained on OpenAI’s Harmony response format. For best results, use a chat template that applies the Harmony format (e.g. tokenizer.apply_chat_template in Transformers) when prompting.

Downstream Use

The FP8-dynamic variant can be used as a drop-in replacement for openai/gpt-oss-20b in:

  • Custom backends with vLLM / TGI / custom inference servers
  • Local desktop apps (LM Studio, Ollama-style setups, etc.) that support FP8
  • RAG systems where latency and VRAM usage are important
  • Multi-agent frameworks where many concurrent contexts are needed

If you fine-tune or adapt this model further, treat it as you would the base gpt-oss-20b model, but keep in mind that quantization can slightly change numeric behavior, especially for very long generations.

Out-of-Scope Use

The model (and this quantized variant) is not recommended for:

  • High-stakes decision making without human review, e.g.:
    • Medical, legal, or financial advice
    • Safety-critical environments (autonomous driving, industrial control, etc.)
  • Generating content that violates laws or platform policies
  • Acting as the sole decision-maker in any context where errors could cause harm to people or property

Users should always keep a human in the loop for sensitive or impactful applications.


Bias, Risks, and Limitations

This model inherits all biases, risks, and limitations of the base gpt-oss-20b model. As a large language model trained on internet-scale data, it may:

  • Produce biased or stereotypical content, including along axes such as gender, race, nationality, or religion.
  • Hallucinate facts, references, or citations.
  • Overstate its own certainty.
  • Generate unsafe or undesirable content if prompted adversarially or without proper safety layers.

The FP8-dynamic quantization may also:

  • Introduce small degradations in quality vs. BF16 / MXFP4 versions, particularly for:
    • Very long generations
    • Edge cases that are numerically sensitive
  • Behave slightly differently from the base model, even with identical prompts.

Recommendations

  • Do not rely on this model as a single source of truth.
  • Add safety filters and/or a moderation layer around generations.
  • Use human review for any high-impact or user-facing deployment.
  • Evaluate the FP8-dynamic variant on your own tasks and data before using in production.

How to Get Started with the Model

Basic usage with 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "KJML/gpt-oss-20b-FP8-Dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",     # Will use FP8 where supported
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain what FP8 dynamic quantization is in simple terms."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))