Sound event detection for smart homes — verifying annotations, quantifying agreement, characterizing the label and feature space, and identifying biases for the modeling phase.
Team A-C · Julian Schmidt · Paul Breburda · GitHub
Agreement on verified recordings ranged from 24 % to 93 %, worst on polyphonic multi-class clips.
Systematic omission
Annotators missed over half of events in some clips, biasing classifiers toward precision over recall.
Acoustic confusion
bell_ringing / phone_ringing and door_open_close / wardrobe_drawer_open_close repeatedly swapped due to similar mechanical resonances.
Boundary disagreement
Merge vs. split near ~1 s pauses produced different region counts even when class labels matched.
Transient bias
Brief events (keychain, light_switch) missed disproportionately compared to sustained sounds (vacuum, running water).
Agreement drops monotonically with complexity — below 40 % for polyphonic clips, near 90 % for single-source recordings.
File 002871 — two reviewers, polyphonic domestic audio. Agreement: 24.06%.
| Class | ||
|---|---|---|
| door_open_close | 0 | 2 |
| footsteps | 1 | 2 |
| keyboard_typing | 1 | 1 |
| keychain | 0 | 1 |
| Total regions | 2 | 6 |
Overlapping sound classes mask each other acoustically — majority vote with only two annotators drops any event marked by exactly one reviewer.
Segment-level agreement pairs binary masks per class; this 1D toy model is the same intersection-over-union idea on two intervals.
Toggle each annotator's binarized vote for one segment/class pair, then watch the majority label flip at the threshold line.
~17% of files have a single annotator — those labels pass through directly after binarization, without majority arbitration.
Footsteps dominates (~15.3 %); light_switch is rarest (~0.6 %) — a 24:1 ratio. Multi-label output requires per-class sigmoid heads, not softmax.
Kitchen-heavy skew (~26 %) mechanically over-represents kitchen-associated classes.
Power ranges to 11,140 while flatness and ZCR stay below 1. Toggle to z-scores — per-feature normalization is essential before any distance-based classifier.
Energy–flux r = 0.93; MFCC–log-mel r = 0.82. ZCR, centroid, bandwidth, and rolloff form a spectral-shape cluster (r > 0.6). Delta/delta-delta MFCCs are nearly independent — a temporal-dynamics subspace.
AI disclosure: Claude Opus 4.6 used for analysis code, LaTeX editing, and this deck. All quantitative claims verified against raw data. Source code