@baldur You are absolutely correct about this, but to be fair, those models are super brittle and can be "broken" by over-tuning in post-training, where certain cases start to perform worse while others may or may not improve.

They're definitely not degrading models on purpose, as it would mean having a worse product for the same cost of inference.