I’d become fascinated with the information theoretic ideas about perception and redundancy originated by people like Horace Barlow, and particularly their application to sound by researchers like Paris Smaragdis. I’d originally been interested in ICA but Smaragdis and others were starting to apply NMF to sound around that time.
The basic idea of the paper is: the brain is able to analyze distinct sounds from the jumbled mixture that hits our eardrums because audio signals generally have structure, musical audio particularly so. If you take a time-frequency transform like a spectrogram of a piece of piano music, you find that it has approximately:
- A sparse representation: each note has a characteristic frequency profile. The complete spectrogram is made by scaling and adding all the notes playing simultaneously. In the bases of all these profiles, the sparse representation consists of the activation levels of each notes.
- Scaling invariants: scaling a vibrating system physically up or down is equivalent to translation in log-frequency.
NMF is a way of obtaining a low-rank approximation of a matrix as a product of two others, subject to the constraint that all the elements have to be positive of zero. Amazingly, this simple assumption is enough to discover all kinds of useful structure, including the separation into spectral bases like that in a spectrogram.
Left to its own devices, straight NMF on the piano spectrogram piece would likely approximate the spectral profiles of each note, but in no particular order. The additional step in this paper is to further require there to be only one profile, log-frequency translated – and this is indeed enough to get you recognizable notes.
There is an enormous and booming literature on sparsity in signal processing in areas like compressed sensing – Nuit Blanche blog is a good place to find out more.