Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents Paper • 2510.24702 • Published Oct 28 • 27
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper • 2510.25726 • Published Oct 29 • 45
Simulating Environments with Reasoning Models for Agent Training Paper • 2511.01824 • Published Nov 3 • 1
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models Paper • 2512.07783 • Published 11 days ago • 32
RefineBench: Evaluating Refinement Capability of Language Models via Checklists Paper • 2511.22173 • Published 22 days ago • 12
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research Paper • 2511.19399 • Published 25 days ago • 58
Grounding Multilingual Multimodal LLMs With Cultural Knowledge Paper • 2508.07414 • Published Aug 10 • 1
SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions Paper • 2506.23046 • Published Jun 29 • 1
Evaluating Vision-Language Models as Evaluators in Path Planning Paper • 2411.18711 • Published Nov 27, 2024
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper • 2503.10582 • Published Mar 13 • 24
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators Paper • 2503.19877 • Published Mar 25 • 1
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge Paper • 2504.10342 • Published Apr 14 • 10
Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time Paper • 2504.12329 • Published Apr 12