Waypoint-1-Small is a 2.3 billion parameter control-and-text-conditioned causal diffusion model. It is a transformer architecture utilizing rectified flow, distilled via self forcing with DMD. The model can autoregressively generate new frames given historical frames, actions, and text.
Capabilities:
Can generate worlds in realtime on high-end consumer hardware Allows for exploration and interaction with worlds via control inputs Allows for guidance of generated world via text prompts Can be prompted with any number of starting frames and controls
Usage:
In order to simply use Waypoint-1-Small, we recommend Biome for local, the Overworld streaming client, or the Hugging Face hosted Gradio Space.
To run the model locally, we recommend an NVIDIA RTX 5090, which should achieve 20-30 FPS, or an RTX 6000 Pro Blackwell, which should achieve ~35 FPS.
Keywords
To properly explain limitations and misuse we must define some terms. While the model can be used for general interactive video generation tasks, we herein define interacting with the model via sending controls and receiving new frames as “playing” the model, and the agent/user inputting controls as the “player”. The model has two forms of output, continuations and generations. Continuations occur when seed frames are given and no inputs are given. For example, if a scene has fire or water, you may see them evolve progressively in the generated frames even if no action is given. Likewise, if you seed with an image of a humanoid entity, the entity will persist on the screen as you move/look around. However, generations occur when the player plays with the model extensively, for example moving around, turning around fully, or interacting with objects/items. Continuations roughly correspond to moving around already existing information in the given context frames while generations correspond to creating entirely new information.
Limitations
- Continuations can plausibly model any inputted scene or photo, and will depend largely on the seed frame given. For generations, the model may occasionally:
- Ignore given text prompt
- Ignore certain controls in specific contexts
- Fail to generate realistic text or interactive HUD/UI elements
- Fail to generate human/animal entities
- Fail to generate realistic motion for given entities
- Prompt adherence is heavily dependent on prompting strategy
- Fail to generate faces
Out of Scope Usage
- The model and derivatives must not be used
- For harassment or bullying
- For the purpose of exploiting or harming minors in any way
- For simulating extremely violent acts
- For generating violent/gory video
- For facilitation of large-scale disinformation campaigns
- For the purpose of generating any sexually explicit or suggestive material
- Downloads last month
- -