Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
etemiz 
posted an update Oct 9, 2025
Post
204
Grok 2 is worse than 1 but better than 3. This was already measured using API but now we measured the LLM and the results are similar.

GLM is ranking higher and higher compared to previous versions. Nice trend!



Our leaderboard can be used for human alignment in an RL setting. Ask the same question to top models and worst models and the answer from top models can get +1 score, bad models can get -1. Ask many times with higher temperature to generate more answers. What do you think?

In this post