
Which is the best LLM for creating K12 quizzes?
We built a web app that runs multiple language models on the same quiz-generation task, scores the outputs with judge models, and compares quality, consistency, correctness, and cost. Here's the context, how it works, and what we found.