This seems vaguely familiar, but I don't think I've read it yet: https://arxiv.org/abs/2603.09678v1
"We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier"
In reply to
Mark Gritter
@markgritter@mathstodon.xyz
Software Engineer at Thirdlaw. Previously co-founded Tintri, on Vault team at HashiCorp, founding engineer at Akita Software, Principal Engineer at Postman. Big nerd. he/him
mathstodon.xyz
Mark Gritter
@markgritter@mathstodon.xyz
Software Engineer at Thirdlaw. Previously co-founded Tintri, on Vault team at HashiCorp, founding engineer at Akita Software, Principal Engineer at Postman. Big nerd. he/him
mathstodon.xyz
@markgritter@mathstodon.xyz
·
Apr 14, 2026
7
4
5
Conversation (4)
Showing 0 of 4 cached locally.
Syncing comments from the remote thread. 4 more replies are still loading.
Loading comments...