[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-21: First draft. This paper appears at KDD 2023. The co-lead author – Sarah Musud – has published numerous papers on hate speech detection.

Additional Notes

Measuring Dataset Difficulty

The authors compare different datasets’ difficulty using the JS divergence between Laplician smoothed unigram distributions of texts under different label pairs; the lower the divergence, the closer the unigram distributions and this makes texts under a label pair more difficult to distinguish.

For example, the proposed datasets have 4 labels, this will lead to $\binom{4}{2} = 6$ divergence measures.
Matthews Correlation Coefficient (MCC)

Tag: MoE

Reading Notes | Revisiting Hate Speech Benchmarks – From Data Curation to System Deployment

Additional Notes

Reference