Reply to Hu et al.: Applying different evaluation standards to humans vs. Large Language Models overestimates AI performance | Publicación