ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents Paper • 2601.12294 • Published about 1 month ago • 17
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models Paper • 2505.16700 • Published May 22, 2025 • 1
MIRage Collection Official model collection of paper: Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models • 2 items • Updated Feb 4, 2025 • 1