10 pages

A knowledge discovery model of identifying musical pitches and instrumentations in polyphonic sounds

of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Abstract Pitch and timbre detection methods applicable to monophonic digital signals are common. Conversely, successful detection of multiple pitches and timbres in polyphonic time-invariant music signals remains a challenge. A review of these
  AKnowledgeDiscoveryModelofIdentifyingMusicalPitchesandInstrumentationsinPolyphonicSounds.   Rory A. Lewis, Xin Zhang, Zbigniew W. Ra´s University of North Carolina, Comp. Science Dept., 9201 Univ. City Blvd. Charlotte, NC 28223, USA Abstract Pitch and timbre detection methods applicable to monophonic digital signals are common. Conversely, successful detectionof multiple pitches and timbres in polyphonic time-invariant music signals remains a challenge. A review of these methods,sometimes called ”Blind Signal Separation”, is presented in this paper. We analyze how musically trained human listenersovercome resonance, noise, and overlapping signals to identify and isolate what instruments are playing and then what pitcheach instrument is playing. The part of the instrument and pitch recognition system, presented in this paper, responsible foridentifying the dominant instrument from a base signal uses temporal features proposed by Wieczorkowska [1] in addition to thestandard 11 MPEG7 features. After retrieving a semantical match for that dominant instrument from the database, it createsa resulting foreign set of features to form a new synthetic base n  signal which no longer bears the previously extracted dominantsound. The system may repeat this process until all recognizable dominant instruments are accounted for in the segment. Theproposed methodology incorporates Knowledge Discovery, MPEG7 segmentation and Inverse Fourier Transforms. Key words: MPEG-7; Polyphonic; MIR; Fourier Transforms; Pitch Detection; Independent Component Analysis; InstrumentDetection; Blind Signal Separation. 1 Introduction Blind Signal Separation (BSS) and Blind Audio SourceSeparation (BASS) have recently emerged as the sub- jects of intense work in the fields of Signal Analysis andMusic Information Retrieval. This paper focuses on theseparation of harmonic signals of musical instrumentsfrom a polyphonic domain for purpose of music infor-mation retrieval. First, it recognizes the state of the artin the fields of signal analysis. Particularly, Indepen-dent Component Analysis and Sparse Decompositions.Next it reviews music information retrieval systems that blindly  identify sound signals. Herein we first present anew approach to the separation of harmonic musical sig-nals in a polyphonic time-invariant music domain andthen secondly, the construction of new correlating sig-nals which include the inherent remaining noise. Thesesignalsrepresentnewobjectswhichwhenincludedinthedatabase, with continued growth, improve the accuracy  This paper was not presented at any IFAC meeting. Cor-responding author Zbigniew W. Ra´s, Tel. (704) 687-4567,Fax (704) 687-3516 Email addresses: (Rory A. Lewis), (Xin Zhang), (Zbigniew W. Ra´s). of the classifiers used for automatic indexing. 1.1 Signal Analysis  In 1986, Jutten and Herault proposed the concept of Blind Signal Separation 1 as a novel tool to capturecleanindividualsignalsfromnoisysignalscontainingun-known, multiple and overlapping signals [9]. The Juttenand Herault model comprised a recursive neural networkfor finding the clean signals based on the assumptionthat the noisy source signals were statistically indepen-dent. Researchers in the field began to refer to this noiseas the cocktail party  property, as in the undefinable buzzof incoherent sounds present at a large cocktail party.By the mid 1990’s researchers in neural computation, fi-nance, brain signal processing, general biomedical signalprocessing and speech enhancement, to name a few, em-braced the algorithm. Two models dominate the field;Independent Component Analysis (ICA) [3] and SparseDecompositions (SD) [19]. 1 See Appendix A. Preprint submitted to Automatica 11 March 2006   1.1.1 Independent Component Analysis  ICA srcinally began as a statistical method that ex-pressed a set of multidimensional observations as a com-bination of unknown latent variables [9]. The principleidea behind ICA is to reconstruct these latent, some-times called dormant  , signals as hypothesized indepen-dent sequences where k  = the unknown independentmixtures from the unobserved independent source sig-nals: x = f  (Θ ,s ) , (1)where x = ( x 1 ,x 2 ,...,x m ) is an observed vector and f  isa general unknown function with parameters Θ [2] thatoperates the variables listed in the vector s = ( s 1 ,...,s n ) s ( t ) = [ s 1 ( t ) ,...,s k ( t )] T  . (2)Here a data vector x(t) is observed at each time point t  ,such that given any multivariate data, ICA can decorre-late the srcinal noisy signal and produce a clean linearco-ordinate system using: x ( t ) = A s ( t ) , (3)where A is a n × k full rank scalar matrix. For in-stance (Fig. 1), if a microphone receives input from anoisy environment containing a jet fighter, an ambu-lance, people talking and a speaker-phone, then x i ( t ) = a i 1 ∗ s 1 ( t ) + a i 2 ∗ s 2 ( t ) + a i 3 ∗ s 3 ( t ) + a i 4 ∗ s 4 ( t ). In thiscase we are using i = 1 : 4 ratio. Rewriting it in a vec-tor notation, it becomes x = A ∗ s . For example, lookingat a two-dimensional vector x = [ x 1 x 2 ] T  ICA finds thedecomposition:  x 1 x 2  =  a 11 a 21  s 1 +  a 12 a 22  s 2 (4) x = a 1 s 1 + a 2 s 2 (5)where a 1 ,a 2 are basis vectors and s 1 ,s 2 are basis coeffi-cients. 1.1.2 Sparse Decomposition  Sparse decomposition was first introduced in the field of image analysis by Field and Olshausen[18]. Nowadays,the most general SD algorithm is probably Zibulevsky’swhere his resulting optimization is made on two factorsbased on the output vector’s entropy and sparseness.SimilartoICA,inSD,theresultingsignal x(t) isthesumof the unknown n × k matrix A and noise ξ  ( t ), where n  represents the sensors and k  represents the unknownscalar source signals.: x ( t ) = As ( t ) + ξ  ( t ) . (6) Fig. 1. A noisy cocktail party The signals are ”sparsely” represented in a signal dictio-nary [24]: s i ( t ) = k  k =1 C  ikϕk ( t ) , (7)where the ik and ϕk represent the atoms of the dictio-nary. 1.2 Music Information Retrieval  In the field of Music Information Retrieval systems, al-gorithms that analyze polyphonic time-invariant musicsignals systems operate in either the time domain [7],the frequency domain [21] or both the time and fre-quency domains simultaneously [13]. Kostek takes a dif-ferent approach and instead divides BSS algorithms intoeither those operating on multichannel or single chan-nel sources. Multichannel sources detect signals of vari-ous sensors whereas single channel sources are typicallyharmonic [6]. For clarity, let it be said that experimentsprovided herein switch between the time and frequencydomain, but more importantly, per Kostek’s approach,our experiments fall into the multichannel category be-cause,atthispointofexperimentationtwoharmonicsig-nals are presented for BSS. In the future, a polyphonicsignal containing a harmonic and a percussive may bepresented. 1.2.1 BSS in MIR, A Brief Review  In 2000, Fujinaga and MacMillan created a real timesystem for recognizing orchestral instruments using anexemplar-based learning system that incorporated a knearest neighbor classifier (k-NNC) [8] using a geneticalgorithm to recognize monophonic tones in a databaseof 39 timbres taken from 23 instruments. Also, in 2000,Eronen and Klapuri created a musical instrument recog-nition system that modeled the temporal and spectralcharacteristics of sound signals [11]. The classificationsystem used thirty-two spectral and temporal features2  and a signal processing algorithms that measured thefeatures of the acoustic signals. The Eronen system wasa step forward in BSS because the system was pitch in-dependent and it successfully isolated tones of musicalinstruments using the full pitch range of 30 orchestral in-struments played with different articulations. Also, bothhierarchic and direct forms of classification were evalu-atedusing1498testtonesobtainedfromtheMcGillUni-versity Masters Samples (MUMs) CDs including ”homemade” recordings from amateur musicians.In2001Zhangconstructedamulti-stagesystemthatseg-mented the music into it individual notes, estimated theharmonic partial estimation from a polyphonic sourceand then normalized the features for loudness, lengthand pitch [23]. The features included the 1) temporalfeatures accounting for rising speed, degree of sustain-ing, degree of vibration, and releasing speed, 2) spectralfeatures accounting for the spectral energy distributionbetween low, middle and high frequency sub-bands andthe partial harmonic such as brightness, inharmonic-ity, tristimulus, odd partial ratio, irregularity and dor-mant tones. Zhang’s system successfully identified in-struments playing in a polyphonic music pieces. In onethe polyphonic source contained 12 instruments includ-ing, cello, viola, violin, guitar, flute, horn, trumpet, pi-ano, organ, erhu, zheng, and sarod. The significance of Zhang’s system was in the manner it used artificial neu-ral networks to find the dominant instrument: First itsegmented each piece into notes and then categorizedthe music based on the what instrument played the mostnotes. It then weighted this number by the likelihoodvalue of each note when it is classified to this instru-ment. For example, if all the notes in the music piecewere grouped into K  subsets: I  1 ; I  2 ; ...I  K  , with I  i cor-responding to the ith instrument, then a score for eachinstrument was computed as: s I  i =  x ∈ I  i O i ( x ) , i = 1 ∼ k (8)where x denotes a note in the music piece, and O i ( x ) isthe likelihood that x will be classified to i  th instrument.Next,Zhangnormalizedthescoretosatisfythefollowingcondition: s I  i = k  i =1 s ( I  i ) = 1 (9)It is interesting to note the similarity between this andZibulevsky’s Eq.07 infra  . Zhang used 287 music mono-phonic and polyphonic pieces and he reached an accu-racy of 80 % success in identifying the dominant instru-ment and 90 % if intra-family confusions were able to bedismissed. Classification of the Zhang’s system incorpo-rated a Kohonen self-organizing map to select the opti-mal structure of each feature vector.In 2002, Wieczorkowska, collaborated with Slezak,Wr´oblewski and Synak [1] and used MPEG-7 based fea-tures to create a testing database for training classifiersused to identify musical instrument sounds. She usedseventeen MPEG-7 temporal and spectral descriptorsobserving the trends in evolution of the descriptors overthe duration of a musical tone, their combinations andother features. Wieczorkowska compared the classifica-tion performance of the kNNC and rough set classifiersusing various combinations of features. Her resultsshowed that the kNNC classifier outperformed, by far,the rough set classifiers.In 2003, Eronen and Agostini both tested, in separatetests,theviabilityofusingdecisiontreeclassifiersinMu-sic Information retrieval. They both found that decisiontree classifiers ruined the classification results: Eronen’ssystem recognized groups of musical instruments fromisolated notes using Hidden Markov Models [4]. Eronenclassified the instruments into groups such as stringsor woodwinds, not as individual instruments. Agostini’ssystem [16] tested a monophonic base of 27 instrumentsusing eighteen temporal and spectral features with anumber of classification procedures to determine whichprocedure worked most effectively. The experimentationused a number of classical methods including canonicaldiscriminant analysis, quadratic discriminant analysisand support vector machines. Agostini’s Support Vec-tor tests yielded a 70 % accuracy on individual instru-ments. Groups of instruments yielded 81% accuracy. Asin this paper’s experiments, Agostini’s classifiers wereMPEG-7 based. The experiments used 18 descriptors foreach tone to compute mean and standard deviation of 9features over the length of each tone. Agostini’s systemused a 46 ms window for the zero-crossing rate to pro-cure measurements directly from the waveform as thenumber of sign inversions. To obtain a useable numberof harmonics a pitch tracking algorithm controlled eachsignal by first analyzing it at a low-frequency and re-peating it at smaller resolutions until a sufficient num-ber of harmonics was estimated. Interestingly, they useda variable window size to obtain a frequency resolutionof at least 1/24 of octaves. The team evaluated the har-monic structure of their signals with FFT’s using half-overlapping windows.In 2004, Kostek developed a 3-stage classification sys-tem that successfully identified up to twelve instrumentsplayed under a diverse range of articulations [12]. Themanner in which Kostek designed her stages of signalpreprocessing, feature extraction and classification mayprove to be the standard in BSS MIR. In the prepro-cessing stage Kostek incorporates 1) the average mag-nitude difference function and 2) Schroeder’s histogramfor purposes of pitch detection. Her feature extractionstage extracts three distinct sets of features: FourteenFF1’ based features, MPEG-7 standard feature param-eters and wavelet analysis. In the final stage, for classifi-cation, Kostek incorporates a multi layer ANN classifier.3  Importantly, Kostek concluded that she retrieved thestrongest results when employing a combination of bothMPEG- 7 and wavelet features. Also the performancedeteriorated as the number of instruments increased. 2 Experiments Stepping back and reviewing Kostek, Zhang and Agos-tini, it became apparent to the authors that BSS worksdiametrically in opposition to the manner in whichtrained human listeners segment polyphonic sources of music. When presented with a polyphonic source sig-nal, trained humans overcome resonance, noise and thecomplexity of instruments playing simultaneously toidentify and isolate what instruments are playing andthen also identify what pitch each instrument is playing.The basis for the BSS system presented in this paperbegan by the authors thinking very carefully on how hu-mans, versus classical MIR systems, identify sounds inpolyphonic sources. Here a small, anecdotal test formedthe seed for the system presented herein: 2.0.2 Trained Human Being’s and BSS, a mini experi-ment  In the Fall of 2006, in order to get a sense of how humanslisten to music, one of the authors, Lewis, took an srci-nal piece of music he composed and performed with hisband, changed it slightly and tested the band membersas follows accordingly. Lewis knew these results wouldbe anecdotal and non scientific but he was intrigued bywhat the outcome would be. Lewis knew that each bandmember was very familiar with the song and with theinstrumentation of the song because they were presentwhenLewiscomposedthesong,theyrecordeditoverthecourse of weeks in a studio and they performed the songlive in front of audiences many hundreds of times. Es-sentially, each member knew the song intimately. Lewismadefournewversions:Version1omittedthekickdrumandsymbolondrumtracks.Version2changedbassnotesand omitted some bass notes. Version 3 swapped hornsections around and changed the pitch of the horn at sixsections. Finally in Version 4, Lewis extracted the gui-tar piece and inserted three never before played chordsinto the song. Lewis asked each member to listen to thethree versions of the song - except for the version inwhich Lewis changed the instrument in which the lis-tener played. For example, Version 3 contained changesto the horn section, here the horn player listened to Ver-sion1,2,and4,notversion3wherehewouldimmediatelyhere his horn solo’s were swapped. As the horn player lis-tened to versions 1,2 and 4 he began to get bored. Uponbeing asked to listen carefully to see what was changed,he could not here the missing drum tracks on version 1,the missing and changed bass guitar on version 2 or thechanged guitar tracks on version 4. In fact each memberof the band could not hear any changes to other instru-ments even when asked specifically to listen to them -except for one instance, the bass player identified one of the 14 changes in the guitar track and asked if it wasan ”earlier” version where Lewis played the guitar trackdifferently.The authors concluded that trained musicians practi-cally block out instruments they are not interested in.The bass player was interested in one particular guitarsections because he cued one of his solos off of the timingof the missing note. At this moment, he would tune intotheguitarandthenblockitoutasheplayedhissolo.Theissue became: How do musicians block out sound? Howdo New Yorker’s block out the constant horn honking,ambulance and police sirens so they can fall asleep, or,conversely, how do farmers block out animal sounds sothey can fall asleep? The answer, for purposes of this pa-per is, we do not know how humans block out sound butclearly - they do. More so, even with an in depth study of Kostek, Zhang an Agostini, the system developed trans-mute the signal into frequency domains and manipulateit but focusing on the dominant timbres, pitches, cep-strums, tristimuluses and frequencies, to name a few.The common factor in all of the above is that only thesrcinal sound source is used. In other words, non of theabove insert into the equation a foreign entity - as hu-mans probably do. Also, in non of the above approacheswe train the classifiers using artificial samples of musicobjects produced by MIR system. 2.1 Trained Human Being’s and new instruments  A human that has never heard a South African ZuluPenny Whistle, cannot - not hear it  until he or she hasheard it a few times. Typically Lewis’ band members,like most experienced musicians in bands, can hear asong, listen to the counter instruments playing in thesong and play it almost immediately. Except when thehumans have not heard an instrument that they nor-mally would block out. This became evident when Lewisbrought back to the USA, recordings of songs he pur-chased in Johannesburg. The band members were notable to focus on anything, let alone their own instrumentparts, because of the new instrument, the Zulu PennyWhistle could not be blocked out. Why?The authors believe the answer lies in the fact that be-cause the band member’s had never heard a Zulu PennyWhistle, they had no past data of Zulu Penny WhistleSounds that would be used to block them out and enablethem to focus on their counterpart in the song. Againthis lead the authors to believe that humans use a set of sounds in their heads to block out noise in a song so theycan focus on exactly the portion of the song they wantto listen to. The seminal question the authors asked isthe same question that lead them to develop the systempresented in this paper which is a system that uses a for-eign entities to block out signals in polyphonic signals.4  Fig. 2. 5C Piano @ 44,100Hz, 16 bit, stereo 2.2 Overview of the system  In short, when the system reads a polyphonic source, itidentifies a dominant aspect of the polyphonic source,finds its match in the database and inserts this foreignentityintothepolyphonicsource,todowhathumansdo,i.e.,blocktheportionoftheoriginalsoundnotinterestedin.To perform the experiments, the system analyzes 4 sep-arate versions of a polyphonic source ( see samples in  figures 3 to 6 below  ) containing two harmonic continu-ous signals obtained from the McGill University MastersSamples (MUMs) CDs. These samples contain a mix of samples one and two, with various levels of noise. Specif-ically, the first sample contains a C at octave 5 playedon a nine foot Steinway, recorded at 44,100HZ, in 16-bitstereo. (Fig. 2) The second sample contains an A at oc-tave 3 played on a B b Clarinet, recorded at 44,100HZ, in16-bit stereo. (Fig. 3) The third sample contains a mixof the first and second samples with no noise added, us-ing Sony’s Sound Forge 8.0 and containing a pure mixrecorded at 44,100HZ, in 16-bit stereo. (Fig. 4) Simi-larly, the fourth sample contains a mix of the first andsecond samples with noise added at -17.8 dB (-12.88%)(Fig. 5). The fifth sample contains a mix of the firstand second samples with noise added at -36.05 dB (-1.58%)(Fig. 6). Finally, the sixth sample contains a mix of the first and second samples with noise added at -8.5 dB(-37.58 %)(Fig. 7). 2.2.1 Formal Procedure  In explaining the system procedures reference will bemade to the two foreign samples housed in the database(Fig. 2) (Fig. 3) containing the piano 5c and clarinet 3a.The polyphonic input to the system will consist of thefour variations of the mix of piano 5c and clarinet 3a. Forthe purpose of this discussion it is also assumed that theclarinet 3a is the dominant feature of all four variationsof the mix. The system reads the input and uses an FFTto transform it into the frequency domain. In the fre-quency domain it determines that the fundamental fre- Fig. 3. 3A B b Clarinet @ 44,100Hz, 16 bit, stereoFig. 4. Piano and Clarinet @ 44,100Hz, 16 bit, stereo - NoNoiseFig. 5. Piano and Clarinet @ 44,100Hz, 16 bit, stereo - 01Noise at -17.8 dB (-12.88 %) quency of 3a with a woodwind-like timbre is dominantFig. 8). The system searches the database and first ex-tractsall3apitchesofeachinstrument.Nextitseparatesall woodwind-like sounds in the 3a temporary cache. Atthis point it uses the MPEG-7 descriptors based classi-fier to find 3a clarinet as close to the one identified. Hereit extracts the wave of 3a clarinet and performs a FFTon this, a foreign sound entity. It subtracts the resultantof the foreign entities FFT from the input entities FFTleaving an FFT, that when subjected to an IFFT pro-5
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!