Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wow, I find it incredible that this works. As I understand it, the approach is to do a Fourier transform on a couple seconds of the song to create a 128x128 pixel spectrogram. Each horizontal pixel represents a 20 ms slice in time, and each vertical pixel represents 1/128 of the frequency domain.

Then treating these spectrograms as images, train a neural net to classify them using pre-labelled samples. Then take samples from the unknown songs, and let it classify them. I find it incredible that 2.5 seconds of sound represented as a tiny picture captures information enough for reliable classification, but apparently it does!



One reason might be that the mentioned genres are highly formulaic to begin with. The standard rap song contains about 2 bars of unique music stretched out over 3 minutes with slight variations. Same with dubstep and techno. All highly repetitive. Classical music got no drums, so you can detect that. Metal got guitar distortion all over the spectrum. So with these examples the spectral images should have enough distinctive features that can be learned. Why should it be different than with 'normal' pictures? Also it looks like they take four 128x128 guesses per song.


If they can write some code that can classify metal into one of its 72 sub-genres then I'll be truly impressed :)

Although I wonder what that would do to the metal scene if their main topic of discussion and contention got completely solved.


Haha that would be awesome ! I guess we'd need a lot of data, and probably use much more detailed spectrogram (time-wise and frequency-wise).


It's quite possible that it's mainly using even more surface-level audio features, before getting to whether the genres are formulaic or not. For example, if specific mastering studios have telltale production features visible in the audio (choice of dynamic range compression algorithms, mixing approaches, etc.), and some mastering studios mainly master, say, country, you can learn to classify country with pretty high accuracy by just recognizing a half-dozen studios' production signatures, without learning anything fundamental about the genre. Whether this happens depends a lot on your choice of data set and validation method.

There's more on that (and some other pitfalls) in a paper linked elsewhere in the comments here: https://news.ycombinator.com/item?id=13085651


It's true that having very different genres helps the model a lot. It would be much more difficult to distinguish between closer genres, especially when people don't really know which is which and argue all the time about it.


From the description in the walkthrough, it doesn't. The final output looks to be based on 5 of these slices, with each providing a probability distribution that ultimately influences the final classification.


Sorry if the 5 slices are misleading it was only for readability, the average song has 70 slices, which are all classified and used or voting.


I guess the spectogram behaves like an image in that translation of any feature by an arbitrary distance (dx,dy) preserves its predicting quality.

But please correct me if I'm wrong.


Yep you got it right, except the voting system adds tons of reliability because we cannot trust the slice classification (2.5s) too much.


I wonder if training another net on top of the slices would work better than voting for a single winner. I'd presume that there are genres that are well characterized by the distribution and progression of their spectrograms. Probably expand/compress the collection of slices to a standard length before training?

(Nice to see you show up for the discussion. I was worried that you'd given up hope before your article hit the front page.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: