@etemiz on Hugging Face: "Grok 2 is worse than 1 but better than 3. This was already measured using API…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

etemiz

posted an update Oct 9, 2025

Post

204

Grok 2 is worse than 1 but better than 3. This was already measured using API but now we measured the LLM and the results are similar.

GLM is ranking higher and higher compared to previous versions. Nice trend!

etemiz

Oct 9, 2025

Our leaderboard can be used for human alignment in an RL setting. Ask the same question to top models and worst models and the answer from top models can get +1 score, bad models can get -1. Ask many times with higher temperature to generate more answers. What do you think?

In this post

etemiz Emin Temiz