CODA: Contrastive Object-centric Diffusion Alignment

Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) introduces register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slotโ€“image correspondence.

License GitHub arXiv Huggingface

๐Ÿš€ Installation

The training and evaluation code requires PyTorch. Clone the repository then use requirements.txt to install dependencies

pip install -r requirements.txt

Data preparation

All datasets will be downloaded and placed at $USER_DATA. Run the following command to get the data.

# define where to store data 
export USER_DATA=...

# download the dataset
bash preprocess/download.sh voc coco movi-c movi-e

๐ŸŽฎ Training

We use the following script for training.

bash scripts/train.sh <dataset>

where dataset can be one of [voc, coco, movi-c, movi-e].

To enable logging with wandb, place your API key in a .key file.

๐Ÿ“ Evaluation

The diffusion pipeline can be loaded as follows.

from src.model.pipeline import DiffusionPipeline

image = <image_tensor>
model_path = <path_to_pretrained_model>
model = DiffusionPipeline.from_pretrained(model_path).to("cuda")

with torch.no_grad():
    slots = model.encoder(image)
    image_rec = model.sample(slots, resolution=512)

We use the following script for evaluation.

bash scripts/eval.sh <dataset>

where dataset can be one of [voc, coco, movi-c, movi-e].

๐Ÿ“ฅ Pretrained models are available.

Dataset FG-ARIโฌ†๏ธ mBOiโฌ†๏ธ mBOcโฌ†๏ธ mIoUiโฌ†๏ธ mIoUcโฌ†๏ธ Download
MOVi-C 59.19 46.55 โ€” 51.94 โ€” Hugging Face Spaces
MOVi-E 59.04 43.45 โ€” 45.21 โ€” Hugging Face Spaces
VOC 32.23 55.38 61.32 50.77 56.30 Hugging Face Spaces
COCO 47.54 36.61 41.43 36.41 42.60 Hugging Face Spaces

๐Ÿ“ Citation

Please cite our paper if you find it useful in your research:

@article{nguyen2026coda,
  title={Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment}, 
  author={Bac Nguyen and Yuhta Takida and Naoki Murata and Chieh-Hsin Lai and Toshimitsu Uesaka and Stefano Ermon and Yuki Mitsufuji},
  year={2026},
  journal={arXiv 2601.01224},
}

Acknowledgement

We thank the authors of SlotDiffusion, Latent Slot Diffusion and Latent Diffusion Models for making their implementations publicly available.

License

CODA is released under the Apache License 2.0. See the LICENSE file for more details.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for bacnguyen/coda