Sunday, August 23, 2009

Silence Models in Weighted Finite-State Transducers

Abstract
We investigate the effects of different silence modelling
strategies in Weighted Finite-State Transducers for Automatic
Speech Recognition. We show that the choice of silence models,
and the way they are included in the transducer, can have
a significant effect on the size of the resulting transducer; we
present a means to prevent particularly large silence overheads.
Our conclusions include that context-free silence modelling fits
well with transducer based grammars, whereas modelling silence
as a monophone and a context has larger overheads.
Index Terms: speech recognition, weighted finite-state transducer,
silence model
Introduction
A recent trend in Automatic Speech Recognition (ASR) research
has been to use decoders with precompiled grammars.
Such grammars are generated using the Weighted Finite-State
Transducer (WFST) methodology of Mohri et al. [1]. The advantage
over traditional decoders is that various optimisations
such as language model lookahead, prefixing and suffixing are
subsumed into generic WFST operations such as composition
and determinisation. This in turn can vastly reduce the complexity
required in the decoder itself. The composition process
typically deals with four transducers: A grammar, G, a lexicon,
L, a context dependency graph, C and a Hidden Markov Model
(HMM), H. The four transducers are composed into a single
transducer in an operation that is normally writtenH.C.L.G.
The decoder then (typically) only has to maximise the likelihood
of a path through the combined transducer given acoustic
observation likelihoods.
Juicer [2] is a WFST decoder developed at IDIAP. It is designed
to handle both HMMs typically produced by HTK [3],
and HMM/MLP hybrids where the observations are posterior
probabilities. Juicer uses a type of WFST denoted C . L . G;
that is, the HMMgraph is handled in the decoder, and theWFST
transduces from words to models (rather than to states or PDFs,
as would a H . C . L . G transducer). This type of transducer
is described by Mohri et al. [4], where the authors state that
the final transducer should have around 2.1 times the number of
arcs as G for a bigram grammar; 2.5 times for a trigram.
This paper is motivated by our work on using Juicer in
the AMI (Augmented Multi-party Interaction) system [5]. The
AMI language model is typically a 50,000 word trigram, pruned
to fit speed and memory constraints. In building even heavily
pruned language model WFSTs, however, we were finding
that the process was using several gigabytes of core memory
and producingWFSTs significantly larger than predicted in [4].
Although some of the difficulties could be alleviated by careful
tuning of the composition process, one significant problem
turned out to be to do with silence modelling. Our investigation
followed an initial observation that removal of the silence models
resulted in almost a 50% reduction in the size of the final
transducer.
In this paper, we discuss silence modelling in general and
in the context of WFSTs. We show how different silence modelling
strategies affect the size of the resulting WFSTs, and discuss
implications for the decoder, and for the ASR system in
general.

No comments: