Music Genre CLassification

Kartik Vijayvergiya | 150108017
Sarvesh Raj | 150108033
Sasank Gurajapu | 150108034

Music genre classification has been of interest for a variety of reasons, including management of large music collections. As these collections are becoming more digital, genre classification is essential for creating recommendation systems, that can benefit collective music libraries, social networks, etc.
The term “genre" is subject to interpretation and it is often the case that genres may be very fuzzy in their definition. While traditionally some genres are classified based on the sound and other related features, there are several music genres, which are widely accepted, which are classified on basis of region, time period, etc. Despite the lack of a standard criteria for defining genres, the classification of music based on genres is one of the broadest and most widely used.
Genre classification, till now, had been done manually by appending it to metadata of audio files or including it in album info. Our project however aims at content-based classification, focusing on information within the audio rather than extraneously appended information. The traditional machine learning approach for classification is used - create dataset and extract suitable features of data, train classfier on feature data, make predictions.

1.1 Introduction to Problem

In our project, we aim to classify music based on its genre using feature extraction followed by machine learning classifiers. Our main focus is to study and analyse how different categories of audio signals consist of varying information in both the time and the frequency domain and thus exploit them to build features for accurate classification.

We will be using a standard dataset to extract various features from the given soundtrack. The outcomes of these features are then feeded to our machine learning based classifier, which is thereafter trained and are used to predict score of the possible outcomes.

1.2 Literature Review

Music Genre Classification is an interesting and relevant problem that people have tried tackling with their own unique approach. One prominent research which has set a benchmark for this field of research was done by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.It used spectral and also rhythmic features for training classifiers and further prediction. The employed feature set has become of public use, as part of the MARSYAS framework (Music Analysis, Retrieval and SYnthesis for Audio Signals), and it has been widely used for music genre recognition. Other characteristics such as Inter-Onset Interval Histogram Coefficients, Rhythm Patterns and its derivatives Statistical Spectrum Descriptors, and Rhythm Histograms have been proposed in the literature recently. Several basic spectral features are used in common by several papers around the globe. These includes usage of spectrogram to calculate features like Roll-Off point, Spectral Centroid, Kurtosis etc. Another important feature used is the Mel-Frequency Cepstral Co-Efficients. After extracting various features, we are left with using Machine learning classifiers. In addition to traditional K-Means, Spherical clustering and Bag-Of-Words based classification (for spectral analysis [5]), we also witnessed improved rates with usage of Hidden Markov models or a culmination of several traditional models. [6]

1.3 Figure

1.4 Report Organization

  • Introduction
    1. Introduction to Problem
    2. Literature Review
    3. Figure
  • Proposed Approach
  • Methodology
    1. Dataset description
    2. Preprocessing
    3. Framing
    4. Windowing
    5. Extraction of features
    6. Classification
  • Code
  • Experiments and Results
  • Conclusions
    1. Analysis
    2. Summary
    3. Future Extensions
  • References

Music Genre Classification is an interesting and relevant problem that people have tried tackling with their own unique approach. Our proposed approach is the following:

  • We first of all proceed with collection of the dataset. This part was quite easy as we were able to find the standard GTZAN dataset. However we have reduced the size of database to only 4 genres as our prime focus is on feature extraction.
  • After the data collection, we proceed towards the segmentation of data and reducing its length since a 30 sec audio is very bulky a signal to be processed.
  • After splitting the signal into short-time frames, we have applied the window function to improve the signal to noise ratio and filter out the discrepant frequencies.
  • Next step is to extract the features from the final audio time signal. Detailed procedure is described afterwards in the Methodology section.
  • Now we apply the finally produced feature vector into the classifier and the results produced are shown in the results section.

3.1 Dataset description

We have used the GTZAN dataset from the MARYSAS website. It contains 10 music genres, each genre has 100 audio clips in .au format. Since our project mainly focuses on the feature extraction part, we have decided to perform the experiment over 4 main genres, ie - classical, jazz, rock, metal. Each audio clip is 30 seconds long, 22050 Hz Mono 16-bit file. The dataset incorporates samples from variety of sources like CDs, radios, microphone recordings etc.

3.2 Preprocessing

The preprocessing part involved converting the audio from .au format to .wav format to make it compatible to python's wavread module for reading audio files. The free and open source software FFmpeg was used to achieve this conversion. The next step was segmenting the audio files into smaller frames to reduce computation time and power.

3.3 Framing

After reading the audio wav files, we need to split the signal into short-time frames. The rationale behind this step is that frequencies in a signal change over time, so in most cases it doesn’t make sense to do the Fourier transform across the entire signal in that we would loose the frequency contours of the signal over time. To avoid that, we can safely assume that frequencies in a signal are stationary over a very short period of time. Therefore, by doing a Fourier transform over this short-time frame, we can obtain a good approximation of the frequency contours of the signal by concatenating adjacent frames. Typical frame sizes in speech processing range from 20 ms to 40 ms with 50% (+/-10%) overlap between consecutive frames. The settings we used are 25 ms for the frame size, frame_size = 0.025 and a 10 ms overlap.

3.4 Windowing

After slicing the signal into frames, we apply a window function such as the Hamming window to each frame. A Hamming window has the following form: w[n]=0.54−0.46cos(2πn/(N−1)) where, 0≤n≤N−1, N is the window length. Plotting the previous equation yields the following plot: There are several reasons why we need to apply a window function to the frames, notably to counteract the assumption made by the FFT that the data is infinite and to reduce spectral leakage.

3.5 Extraction of features

  • Fourier transform and power spectrum
  • We can now do an N-point FFT on each frame to calculate the frequency spectrum, which is also called Short-Time Fourier-Transform (STFT), where N is typically 256 or 512, NFFT = 512; and then compute the power spectrum (periodogram) using the following equation: P=|FFT(xi)|2/N where, xi is the ith frame of signal x.

  • Spectral centroid
  • It describes where the "centre of mass" for sound is. It essentially is the weighted mean of the frequencies present in the sound. Consider two songs, one from blues and one from metal. A blues song is generally consistent throughout it length while a metal song usually has more frequencies accumulated towards the end part. So spectral centroid for blues song will lie somewhere near the middle of its spectrum while that for a metal song would usually be towards its end.

  • Mean and variance of the spectral centroid
  • This feature describes the center of frequency at which most of the power in the signal (at the time frame examined) is found. Music signals have high frequency noise and percussive sounds that result in a high spectral mean. On the other hand, in speech signals the pitch of the audio signal stays in a more narrow range of low values. As a result, music has a higher spectral centroid than speech. The spectral centroid for a frame occurring at time t is computed as follows: where k is an index corresponding to a frequency and X(k) is the power of the signal at the corresponding frequency band.

  • Spectral roll off
  • This feature is the value of the frequency that 95% of the power of the signal resides under. As mentioned before, the power in music is concentrated in the higher frequencies; however, speech has a range of low frequency power. The mathematical expression for finding this value of frequency is as follows: where X(k) is the DFT of x(t), the left hand side of the above equation is the sum of the power under the frequency value V, and the right hand side of the equation is 95% of the total signal power of the time frame.

3.6 Classification

Support vector machine(SVM) and Back Propagation Neural Network(BPNN) learning algorithm has been used for the classification of genre classes of music by learning from training data. Experimental results show that the proposed audio classification scheme is very effective and the accuracy rate is 95%. The performance was compared to SVM which showed an accuracy of 83%.Neural network and SVM has been used because of their good classification and training accuracy among machine learning algorithms.

All the code used to perform all the above mentioned tasks can be found here.

Classical

Rock

Jazz

Metal

Classical

Rock

Jazz

Metal

Classical

Rock

Jazz

Metal

Classical

Rock

Jazz

Metal

6.1 Analysis

Fast Fourier Transform & Power Spectral Density:

Looking at FFT & PSD, we can observe the presence of highest magnitude component at extreme frequencies in classical and jazz, where as in rock and metal it’s in midrange frequencies. The magnitude of jazz is surprisingly low at almost 1/10th of magnitude of other genres. The midrange frequencies have higher magnitude in rock (which is evident of having more energy as compared to songs of all other genres).

Spectral Roll-off:

We can observe that roll off of classical tends to remain constant over certain frequency ranges throughout the spectrum.This proves that classical music has a gradual change in the intensity over time. On the other hand, metal tends to fluctuate very rapidly which is evident of its frequent increase and decrease of the musical notes. On comparing rock and jazz, we observe that rock has several impulses across the spectrum against the limited number of impulse in jazz.

Spectral Centroid:

Spectral centroid of jazz clearly depicts the frequency components are concentrated around the central frequency resulting in high values. Spectral centroid of rock has higher magnitudes at low frequencies and that of Metal has lower magnitudes at high frequencies. Spectral centroid of classical is unevenly spread across the spectrum with irregular patterns and impulses at various frequencies.

6.2 Summary

Summarising our work, we learned how to deal with the real life challenges that we face while playing with the digital audio. Our main focus was to examine how different genrical audios differ in their spectral contents. So, we extracted the basic relevant spectral features for different audios and compared the differences first analysing manually and then feeding the data into the machine learning classifier. In the end the features extracted were fast fourier transform, power spectral density, spectral centroid and spectral roll off.

6.3 Future Extensions

While we have classified restricted number of genres from a given dataset. We can use additional improved features like Local Binary Patterns and octave based spectral features to improve genre accuracy and also detect more genres.

After improving at features level we can also use improved classifiers. We can use traditional CNN based neural networks to train and classify at improved accuracy. We can also achieve this by culminating existing classifiers.

Once a higher accuracy is achieved and all standard accepted genres are detected at commendable accuracy we can use to real life projects. Genre classification has a lot of scope in industry and also at consumer end.

It can used to tag and classify millions of songs across streaming services and databases like MusixMatch & Spotify. Consumer can also benefit from this when Machine learning algorithms can incorporate this genre classification and further methods to improve recommendations of songs across various services.

We can also try classifying geographical based genres (like Indian classical, bollywood, world music) into much conventional genres (rock, jazz etc) to standardise genres across all platforms and improve recommendations.

We can incorporate this to detect genres of songs even at live performances and band performances with suitable modifications by improving reception to ambient noise.

These are various ways how we can extend this project into more accurate and into consumer helpful projects.