Towards spoken-document retrieval for the enterprise: Approximate word-lattice indexing with text indexers
Seide, F.
Peng Yu
Yu Shi
Microsoft Res. Asia, Beijing;
This paper appears in: Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on
Publication Date: 9-13 Dec. 2007
On page(s): 629-634
Location: Kyoto,
ISBN: 978-1-4244-1746-9
INSPEC Accession Number: 9832827
Digital Object Identifier: 10.1109/ASRU.2007.4430185
Current Version Published: 2008-01-14
Abstract
Enterprise-scale search engines are generally designed for linear text. Linear text is suboptimal for audio search, where accuracy can be significantly improved if the search includes alternate recognition candidates, commonly represented as word lattices. We propose two methods to enable text indexers to approximately index lattices with little or no code change: "TMI" (Time-based Merging for Indexing) aims at lattice-index size reduction, and the "sausage"-like "TALE" (Time-Anchored Lattice Expansion) approximation requires no indexer-code or data-format changes at all. On four enterprise-type data sets (meetings, phone calls, lectures, and voicemail), TMI and TALE improve accuracy by 30-60% for multi-word phrase searches and by 130% for two-term AND queries, compared to indexing linear text.
Index
Terms
Available to subscribers and IEEE members.
References
Available to subscribers and IEEE members.
Citing Documents
Available to subscribers and IEEE members.