Label Consistency: Inter-Annotator vs. Intra-Annotator Agreement
Date
Aug 22, 2025
Author
Elsie Piao
In supervised learning, your model is only as good as your labels. But ground truth isn’t always absolute; it depends on how consistently humans apply labeling guidelines. Two key measures capture this: inter-annotator agreement and intra-annotator agreement.
Inter-Annotator Agreement
Inter-annotator agreement measures how much different annotators agree with each other. If multiple people label the same sample, do they reach the same conclusion?
High agreement suggests clear guidelines and an unambiguous task.
Low agreement often means either the task is inherently subjective (e.g., sentiment labeling) or the instructions aren’t specific enough.
Common metrics: Cohen’s kappa, Fleiss’ kappa, Krippendorff’s alpha.
Intra-Annotator Agreement
This measures how consistently a single annotator labels over time. If you give the same person the same sample twice, do they respond the same way?
High agreement shows that the annotator is applying criteria reliably.
Low agreement suggests fatigue, confusion, or unclear task definitions.
Intra-annotator agreement is less often reported, but it can expose issues like drifting interpretations or inconsistent training of annotators.
Why Both Matter
Inter-annotator agreement tells you about consensus across people. Intra-annotator agreement tells you about consistency within a person. Together, they provide a fuller picture of label reliability.
If both are low: the labeling task itself may be ill-posed.
If intra is high but inter is low: annotators are consistent but disagree. You may need clearer definitions.
If inter is high but intra is low: annotators can align as a group, but some individuals are unreliable. Better training or quality control may help.
Label consistency is a foundational quality check in dataset creation. Inter-annotator and intra-annotator agreement directly determine whether your labels represent a trustworthy ground truth, or just a noisy consensus.



