Perceptual Harmonic Cepstral Coefficients as the Front-End for Speech Recognition

Signal Compression Laboratory Research Project

 

Researcher: Liang Gu
Faculty: Prof. Kenneth Rose
Research Focus: Mel-frequency cepstral coefficients (MFCC) are most commonly used as the standard front-end analysis technique for automatic speech recognition. In spite of their generally superior performance, two problems plague conventional MFCC techniques. The first is concerned with the vocal tract transfer function whose accurate description is crucial to good recognition (unlike the excitation details). In the MFCC approach, the spectrum envelope is computed from energy averaged over each mel-scaled filter. This may not work well for voiced sounds with quasi-periodic features, as the formant frequencies tend to be biased toward pitch harmonics, and formant bandwidth may be misestimated. Experiments show that this mismatch substantially increases the spectrum variance within the same utterance when compared to the harmonics-based spectrum. Another problem in standard MFCC is the lack of perceptual spectrum estimation, which suggests some inferiority to the perceptual linear predictive (PLP) analysis. However, the log power representation of MFCC is attractive because of its gain-invariance properties.

A new approach is proposed to overcome the above shortcomings, which is inspired by ideas borrowed from speech coding. Rather than average the energy within each filter, which results in a smoothed spectrum as in MFCC, we propose to derive harmonic cepstral coefficients (HCC) for voiced speech from the spectrum envelope sampled at pitch harmonic locations. The extraction of HCCs requires accurate and robust pitch estimation. We adopted a spectro-temporal auto-correlation (STA) method that had been previously developed for sinusoidal speech coders. The STA pitch estimation is based on weighted summation of the temporal and spectral auto-correlation values, and efficiently reduces multiple and sub-multiple pitch errors. The computed (weighted) correlation is further useful for voiced-unvoiced-transitional speech detection. For voiced speech, the harmonic locations are derived from the estimated pitch information, and a peak-picking algorithm is employed to find the actual harmonic points near the predicted positions. For transitional speech, a fixed pitch is used within the peak-picking process. The resulting harmonic spectrum is put through mel-scaled band-pass filters and transformed into cepstrum by the discrete cosine transform. . The HCC representation is further improved by applying the cubic-root amplitude compression within each filter, to obtain the “perceptual” HCC (PHCC) representation. Thanks to the psychophysical intensity-loudness power law, the spectral amplitude variation within each filter is reduced, without degradation of the desired gain-invariance properties, as the filter energy levels are still represented in logarithmic scale. We tested PHCC both on the OGI E-set database and the speaker-independent isolated Mandarin digit database to compare with standard MFCC. On E-set, with 7-state continuous HMMs, the test set recognition rate increased from 84.7% (MFCC) to 87.8% (PHCC), i.e. 20% error rate reduction. With 21- state tied-mixture HMMs (TMHMM), the accuracy improved from 92.7% to 93.8%, i.e. 15% error rate reduction. For the Mandarin digit database, the error rate based on 9-state TMHMMs is decreased from 2.1% to 1.1%, which translates into a considerable 48% error rate reduction.

Presentation:

Perceptual Harmonic Cepstral Coefficients as the Front-End for Speech Recognition