On Measuring Large Language Models Performance with Inferential Statistics | Publicación