I find it hilarious how all these comparative LLM benchmarks show that models get better and better, but when you open the source data of these benchmarks, it is basically assessors saying this over and over:
Model A’s output is complete bullshit, 2 stars out of 5 Model B’s output is not even relevant, 1 star out of 5