[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]
Change Logs:
- 2023-09-21: First draft. This paper appears at KDD 2023. The co-lead author – Sarah Musud – has published numerous papers on hate speech detection.
Additional Notes
-
Measuring Dataset Difficulty
The authors compare different datasets’ difficulty using the JS divergence between Laplician smoothed unigram distributions of texts under different label pairs; the lower the divergence, the closer the unigram distributions and this makes texts under a label pair more difficult to distinguish.
For example, the proposed datasets have 4 labels, this will lead to \binom{4}{2} = 6 divergence measures.
- Matthews Correlation Coefficient (MCC)