Audio-fingerprinting via the k-svd algorthm

Music has always played an elemental and unique role in human entertainment and communication. At the end of the 19th century, music was, initially, employed and since then, has been established as a fundamental tool for scientific, medical and educational purposes. In the mid-1960’s, a novel resear...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριος συγγραφέας: Σαραβάνου, Χριστίνα
Άλλοι συγγραφείς: Saravanou, Christina
Γλώσσα:English
Έκδοση: 2020
Θέματα:
Διαθέσιμο Online:http://hdl.handle.net/10889/14185
Περιγραφή
Περίληψη:Music has always played an elemental and unique role in human entertainment and communication. At the end of the 19th century, music was, initially, employed and since then, has been established as a fundamental tool for scientific, medical and educational purposes. In the mid-1960’s, a novel research field, known to most as Music Information Retrieval (MIR)- emerged which aspires to solve various music related problems, such as the song identification and the music genre classification problems, by combing several signal processing and Information Retrieval (IR) techniques. Solving the song identification problem has always been one of the most challenging conundrums of MIR. Throughout the years, several approaches, with the most promising being the audio-fingerprinting scheme, have been proposed to solve this arduous problem. The audio-fingerprinting paradigm, which was introduced in the 1990’s, aims to construct a unique and concise representation -similar to that of a human fingerprint, ergo the name- of an audio track’s signal content. In the last twenty or so years, several alternates of the original audio-fingerprinting scheme have been proposed: Some of which rely on the conventional signal processing and statistical approaches, e.g. the Short-Time Fourier Transform (STFT), while others employ methods and concepts which are applied by several contemporary schemes, such as the Matching Pursuit (MP) algorithm and time-frequency dictionaries-which are used by the Compressive Sensing (CS) and Dictionary Learning (DL) paradigms. This thesis introduces an innovative alternate of the audio-fingerprinting scheme, which aims by employing the Orthogonal Matching Pursuit (OMP) and the K-SVD algorithms -two state-of-the-art techniques applied by the CS and DL paradigms respectively- to construct unique and concise representations of several audio signals to identify their content. Particularly, the suggested approach, initially, aims to create a global dictionary via the K-SVD algorithm and several tens of audio tracks, which uphold the database. The dictionary aspires not only to capture the acoustic/perceptual attributes of the audio signals, but also to apprehend the descriptiveness, the robustness and the discriminability of the audio-fingerprints. Afterwards, the songs-which maintain the database- and the audio excerpt of an unknown audio track- which are used during the querying process-are segmented into several audio frames respectively. Thereupon, the sparse representations of both the audio tracks and clip are computed via the OMP algorithm and the dictionary. Then, the atoms, i.e. the coefficients of an audio signal’s sparse representation, with the highest weight values are extracted and considered equivalent to the most descriptive points of the respective signal’s spectrogram. The landmark pairs, namely four-point peak-pairs, are constructed by using the atoms which were previously selected and are used to construct hash tables. The proposed scheme constructs separate hash tables for every audio track in the database for the audio clip. During the querying process of the suggested paradigm, several voting methods are employed to determine from which song, the audio segment was extracted from and the metadata of the respective song is returned. During the evaluation process of the introduced alternate, several experiments were performed by employing dictionaries of different dimensions. The dictionaries were constructed by using the content of various audio tracks- extracted from two datasets of different size- in both the temporal or the spectral domains. The proposed technique was, initially, assessed to determine which dictionary i.e. which dictionary size and domain, can provide the most accurate results. Moreover, the suggested audio-fingerprinting technique was gauged against its robustness, scilicet whether an audio clip which has been distorted by ambient noise can be identified. Furthermore, the suggested scheme aims to regulate whether an audio clip extracted from a song, which did not partake in the learning process can be identified. In every case, the suggested alternate of the audio-fingerprinting scheme culminated in promising results.