Cross-entropy loss measures the divergence between predicted and true token probabilities:
L = −∑yi log(ˆyi) 6
It penalizes incorrect predictions, encouraging accurate token selection. In language modeling, it ensures the model assigns high probabilities to correct next tokens, optimizing performance.