Sunday, August 23, 2009

Silence modelling in transducer

. General practice
Silence models are necessary in ASR to accommodate periods
of silence at the beginning and end of utterances, and between
words. Typically, any sound not included in the phone set of the
decoder is included in the definition of silence; it might better
be termed non-speech or noise. There may also be an element
of garbage modelling in a silence model. The key, however,
is the way the model is used. Silence is (trivially) placed at
the beginning and end of the grammar, and can be included in
G. Silence is also placed between words; the most convenient
way to enable this is to duplicate lexicon entries such that each
pronunciation has one unmodified phonetic spelling, and one
either beginning or ending in silence.
The HTK system, as described by Young et al. [3], advocates
the use of two silence models:
1. A silence model, sil, with the same structure as the
other phonetic models, and contextually a monophone.
i.e., silence acts as a context, but is context independent.
2. A short pause model, sp, that is essentially tied to the
silence model, but is context free, and has a ‘skip’ transition
that optionally omits any emitting states.
The silence model is used at the beginning and end of an utterance,
or when prescribed in the grammar by a specific token.
The short pause model is used in the lexicon at the end of every
word to allow optional silence states that do not break context.
The use of the short pause model at the end of each word
was revised by Hain et al. [6] to advocate the use of both short
pause and silence at the word ends. This gives the option of
either breaking context or not between words at decode time,
and leads to a small improvement in recognition accuracy. In
fact, Hain et al. distinguish the case where neither silence nor
short pause are used, although this is included in the short pause
skip structure.
2.2. Silence inWFSTs
Silence in WFSTs is discussed briefly by Allauzen et al. [7].
In that paper, silence is represented as a loop that can be placed
at the end of each word in the lexicon. Further, it is stressed
that the loop must be weighted to allow weight pushing in the
composition process. The silence class transducer, figure 4 in
[7], bears a close similarity to the short pause phone of the HTK
system. Both allow zero or more instances of a silence state
after each word, where the transition probabilities are learned
from data.
It follows that including silence in the grammar is often not
feasible as the grammar does not generally contain probability
information for silence. However, including a short pause
model after each word in the lexicon implements both the HTK
short pause model as described, and has the same effect as the
AT&T silence transducer. A silence model can be trivially included
after each word in the lexicon in the same manner as the
short pause model.
2.3. Juicer
For the common case of a three emitting state model, an HTK
HMM actually has five states with the first and last being nonemitting.
The short pause model is normally implemented with
a ‘skip’ transition from the first state to the last, allowing it to be
skipped completely. Juicer was designed to be compatible with
HTK style HMMs. However, for simplicity in the decoder, skip
transitions were not considered. Rather, a skip transition could
easily be included in the WFST at the lexicon level.
The AMI system uses the double silence method of Hain
et al. [6]. This was implemented in Juicer as shown in figure
1. The example is for just the word ‘NO’ in the lexicon WFST
L. Notice that the other symbols are standard in the WFST
literature: The symbol refers to an epsilon transition
and #1 is an auxiliary symbol to distinguish otherwise identical
pronunciations.

1 comment:

betitveno said...

oh really good information thx u for sharing