Objective:
Develop new techniques that allow the harmonic speech coder to achieve
toll-quality around 4 kb/s.
Problem Statement and Proposed Solutions:
(1) Speech Model:
The sinusoidal model based harmonic coder are well-suited for the representation
of quasi-periodic signals typical of voiced speech and the noise-like signals
typical of unvoiced speech. However, this model is ineffective for representing
speech in transition speech regions such as voicing onsets/offsets, plosives
and non-periodic pulses. To improve model accuracy for transition speech
segments, a frequency domain model was proposed to represent transition
speech segments. This model preserves the temporal information which is
important in the perception of the transition speech. The model is also
amenable to a closed-loop analysis-by-synthesis procedure for parameter
estimation. To further improve the harmonic model representation for the
voiced speech, we also proposed a new voicing model which includes the
modeling of the pitch jittering of the natural speech.
(2) Model Parameter Estimation:
Since harmonic coders belong to the parametric coding category, the
estimation errors for the model parameters would result in significant
degradation of the speech quality. One solution for improving the accuracy
and robustness of parameter estimations is to use closed-loop analysis-by-synthesis
techniques. However, when low-rate harmonic coders are used to synthesize
speech, no phase information is transmitted, which results in a loss of
time alignment between the original speech and the synthesized speech.
This loss of time alignment makes it difficult for the coder to perform
waveform matching, and interferes with time domain closed-loop parameter
estimation. We proposed a generalized time-domain analysis-by-synthesis
parameter estimation scheme in the harmonic coding framework. This scheme
uses a time scale signal modification technique to allow for waveform matching
in harmonic coding. This concept is demonstrated in our Analysis-by -Synthesis
Multimode Harmonic Coder (AbS-MHC) with a specific method for efficient
closed-loop pitch estimation and speech classification.
(3) Parameter Quantization:
Generally, parameters needed to be quantized in harmonic coders are
LPC parameters(LSFs), pitch, voicing information, harmonic spectral magnitudes,
and gains. Efficient quantization of LSFs is not a problem any more since
spectral distortion (SD) less than 1 dB using 22-24 bits/20ms for narrow
band speech is obtained in our experiments. In fact, among all the parameters
in a harmonic coder, quantization of the harmonic spectral magnitudes is
the most challenging task. Since the spectral magnitude vector in a harmonic
coder is obtained by sampling the speech magnitude spectrum or the LP residual
magnitude spectrum at multiples of the pitch frequency, the dimension of
the spectral magnitude vector varies as pitch varies from frame to frame.
Standard fixed-dimension VQ techniques are difficult to apply directly
to the quantization of spectral magnitude vectors. We studied a quantization
technique called Weighted Non-Square Transform Vector Quantization (WNSTVQ)
which addresses the problems associated with variable-dimension vector
quantization by combining a fixed-dimension vector quantizer with a variable-sized
non-square transform. We show that WNSTVQ is the generalized form of all
linear dimension conversion methods. We find that choice of transforms,
choice of the fixed dimension, and number of codebooks used provide the
tradeoffs between complexity, memory requirement, and performance for several
WNSTVQ systems.
Demos: (.wav files)
Female 1: sentence1, sentence2,
sentence3
Male 1: sentence1, sentence2,
sentence3
Female 2: sentence1, sentence2,
sentence3
Male 2: sentence1, sentence2,
sentence3
Note:
Sentence1: Original speech sentence (modified IRS filtered speech
sentence)
Sentence2: AbS-MHC coder output without quantization. Features of AbS-MHC coder include: an enhanced frequency domain transition model is used in conjunction with the sinusoidal model based harmonic coding of voiced/unvoiced speech signal; New voicing model is used for the voiced speech; Closed-loop pitch estimation using the combined time-domain and frequency-domain pitch candidates. Closed-loop pitch/classification using a time scale signal modification technique.
Sentence3: AbS-MHC coder output with quantized LSFs (24 bits/20ms), quantized spectral harmonic magnitudes (14 bits/10ms DCT-II transform based WNSTVQ quantization scheme), and quantized gain( 6 bits/10ms).