Sphinx 4 Architecture
Published by admin July 5th, 2006 in Notes
( diagram taken from Sphinx-4: A Flexible Open Source Framework
for Speech Recognition )
NOTE: This post is not complete.
The beauty of the Sphinx 4 architecture is its modularity and pluggability. Previously, speech recognition programs were built to fulfill specific roles: continuous speech vs. non-continuous, large vocabulary vs. smaller vocabulary, etc. Now with Sphinx 4, an xml-configuration file allows varied & dynamic behavior from the speech engine without a need for modifying the source code or recompiling.
The diagram above ( from this .pdf ) shows how an application plugs into the Sphinx framework without having to delve deeply into the code of Sphinx itself. The speech engine of Sphinx, called the Recognizer, consists of 3 main modules: the Front End, the Decoder, & the Linguist. The behavior of each of these modules can be individually configured in the xml configuration file. I will try and delve into these three parts in separate posts once I understand them more fully, but for now, let’s focus on the application basics and configuration file setup.
The Sphinx-4 Application Programmer’s Guide and Configuration Management for Sphinx-4 are must reads on this topic, but I’ll try to convey what I can here.
Any Java application incorporating Sphinx appears to go through 5 basic steps, as in the HelloDigits app demo that comes with Sphinx-4. (Code snippets below were taken from the Application Programmer’s Guide, except for the portion from the Transcriber demo.) Very concisely stated, here are those 5 steps in order:
1) Load in xml configuration file.
2) Create ConfigurationManager which interprets xml.
3) Use lookup() on ConfigurationManager to create Recognizer object (the speech engine) and audio input data stream.
4) Go into loop that is based on audio events where the Recognizer analyzes speech to return Results.
5) Call Result methods to convert speech analysis into text strings ( or do more advanced operations… )
1) First, the file path of the xml config file is specified and loaded into the application via the URL object.
2) This URL is handed to an object called the ConfigurationManager which interprets the xml and gets the Sphinx speech engine ready to behave as the config file specifies that it should.
try {
URL url;
if (args.length > 0) {
url = new File(args[0]).toURI().toURL();
} else {
url = HelloDigits.class.getResource(”hellodigits.config.xml”);
}
ConfigurationManager cm = new ConfigurationManager(url);
3) Once the Configuration Manager is created, we call lookup() on it in order to make a Recognizer ( the speech engine containing the Front End, Decoder, and Linguist previously mentioned ) and an audio input ( a microphone in the case of HelloDigits.)
Microphone microphone = (Microphone) cm.lookup(”microphone”);
Alternatively, a pre-recorded audio source can be converted into an audiostream and used as an input, as in the Transcriber demo:
URL audioURL;
if (args.length > 0) {
audioURL=new File(args[0]).toURI().toURL();
} else {
audioURL=Transcriber.class.getResource(”10001-90210-01803.wav”);
}
AudioInputStream ais = AudioSystem.getAudioInputStream(audioURL);
StreamDataSource reader = (StreamDataSource)
cm.lookup(”streamDataSource”);
reader.setInputStream(ais, audioURL.getFile());
4) Once the Recognizer and Front End audio input (either Microphone or StreamDataSource) are created, we allocate necessary memory resources to the Recognizer
recognizer.allocate();
and then the application can go into a loop where calling recognizer.recognize() will try to return Results while the audio input is available.
if (microphone.startRecording()) {
System.out.println
(”Say any digit(s): e.g. \”two oh oh four\”, ” +
“\”three six five\”.”);
while (true) {
System.out.println
(”Start speaking. Press Ctrl-C to quit.\n”);
/*
* This method will return when the end of speech
* is reached. Note that the endpointer will determine
* the end of speech.
*/
Result result = recognizer.recognize();
if (result != null) {
String resultText = result.getBestResultNoFiller();
System.out.println(”You said: ” + resultText + “\n”);
} else {
System.out.println(”I can’t hear what you said.\n”);
}
}
} else {
System.out.println(”Cannot start microphone.”);
recognizer.deallocate();
System.exit(1);
}
5) Results are the objects that the Recognizer returns when speech is detected. The Recognizer analyzes speech by hypothesizing on probable matches for what a user has said. The Result object actually contains all of the “search paths” that the Recognizer has traversed for a given block of speech. It contains paths that have reached their “final state” ( meaning that it’s probably at the end of a sentence or long pause ) as well as “active paths” that haven’t yet reached final state. Basically, the Result object is a collection of scored guesses that the computer makes about what has been said, and the object has methods for your application to mine through this collection in different ways.
The getBestFinalResultNoFiller() is the method used most in the demos, and it is used to avoid any partial sentences in the text output. Basically, the program waits until it is certain of a finished phrase before it hands off a textual guess. Another possible method, getBestResultNoFiller() seems a bit more forgiving (but perhaps less accurate) in that it attempts to return the highest scored result that has reached a final state, but if it doesn’t find a best final result, it is happy to return the active result with the highest score. There are many other methods for manipulating the Result returned by the Recognizer, including ways to dig into the search paths, words, and scores of Tokens to find the N-best results.
Resources:
For info into the architecture of Sphinx 4
Sphinx 4 for the Java platform
Architecture Notes
Sphinx-4: A Flexible Open Source Framework
for Speech Recognition
For info into configuration of Sphinx 4 applications
Sphinx-4 Application Programmer’s Guide
Configuration Management for Sphinx-4