speak and be turned into bits
As it comes with the Sphinx-4 package, the HelloWorld.jar example only recognizes the following words:
To expand this vocabulary, we need to modify a grammar file that the JSGFGrammar class imports, and then we need to rebuild the HelloWorld.jar using ant.
In the sphinx4-1.0beta/demo/sphinx/helloworld/ directory that came with the Sphinx-4 package, you should see a file called hello.gram. If you open this grammar file, you see the following code:
public <greet> = (Good morning | Hello)
( Bhiksha | Evandro | Paul | Philip | Rita | Will );
So, this grammar file will tell the JSGFGrammar in this application to look for phrases of two words. To change what words our application can recognize, we simply add to these word groups. For instance, if we want to be able to say congratulations to all these people, we simply change the public
Once this change is made, you may also want to mirror this change in the HelloWorld.java program so that it prints the new word possibilities to the terminal in its println statement. Then to run the application, simply change directory to the demo/helloworld/ in Terminal and type:
Your system must have ant installed for this to work, but this command finds the “build.xml” file in that directory which lets the machine know where all the necessary files are for building the HelloWorld.jar. Once the program has been built, you should be able to go to sphinx4-1.0beta/bin/ and run java -mx312m -jar HelloWorld.jar and say Congratulations Paul (or whoever).
Using grammars like this obviously provides little flexibility in what the program understands. There are instances where this could be advantageous, however, such as when we want menu navigation, a survey, etc. JSGF grammar files can be quite extensive though. Take a look at the developer’s guide to see how they can import rules from other grammar files, reference rules within rules, weight certain words above others, and other functionalities.
But, I don’t see the grammar file or the JSGFGrammar class in the HelloWorld.java file. Where do these get called?
All of this is specified in the “helloworld.config.xml” file that the application loads and the Configuration Manager takes action on. In that xml file, you will see in the section commented as “The Grammar Configuration”:
So, this xml tells the application where to find the hello.gram file in the .jar’s resource path as well as the grammarName that was specified by the line “grammar hello;” in the grammar file.

( diagram taken from Sphinx-4: A Flexible Open Source Framework
for Speech Recognition )
NOTE: This post is not complete.
The beauty of the Sphinx 4 architecture is its modularity and pluggability. Previously, speech recognition programs were built to fulfill specific roles: continuous speech vs. non-continuous, large vocabulary vs. smaller vocabulary, etc. Now with Sphinx 4, an xml-configuration file allows varied & dynamic behavior from the speech engine without a need for modifying the source code or recompiling.
The diagram above ( from this .pdf ) shows how an application plugs into the Sphinx framework without having to delve deeply into the code of Sphinx itself. The speech engine of Sphinx, called the Recognizer, consists of 3 main modules: the Front End, the Decoder, & the Linguist. The behavior of each of these modules can be individually configured in the xml configuration file. I will try and delve into these three parts in separate posts once I understand them more fully, but for now, let’s focus on the application basics and configuration file setup.
The Sphinx-4 Application Programmer’s Guide and Configuration Management for Sphinx-4 are must reads on this topic, but I’ll try to convey what I can here.
Any Java application incorporating Sphinx appears to go through 5 basic steps, as in the HelloDigits app demo that comes with Sphinx-4. (Code snippets below were taken from the Application Programmer’s Guide, except for the portion from the Transcriber demo.) Very concisely stated, here are those 5 steps in order:
1) Load in xml configuration file.
2) Create ConfigurationManager which interprets xml.
3) Use lookup() on ConfigurationManager to create Recognizer object (the speech engine) and audio input data stream.
4) Go into loop that is based on audio events where the Recognizer analyzes speech to return Results.
5) Call Result methods to convert speech analysis into text strings ( or do more advanced operations… )
1) First, the file path of the xml config file is specified and loaded into the application via the URL object.
2) This URL is handed to an object called the ConfigurationManager which interprets the xml and gets the Sphinx speech engine ready to behave as the config file specifies that it should.
3) Once the Configuration Manager is created, we call lookup() on it in order to make a Recognizer ( the speech engine containing the Front End, Decoder, and Linguist previously mentioned ) and an audio input ( a microphone in the case of HelloDigits.)
Alternatively, a pre-recorded audio source can be converted into an audiostream and used as an input, as in the Transcriber demo:
4) Once the Recognizer and Front End audio input (either Microphone or StreamDataSource) are created, we allocate necessary memory resources to the Recognizer
and then the application can go into a loop where calling recognizer.recognize() will try to return Results while the audio input is available.
System.out.println
(”Say any digit(s): e.g. \”two oh oh four\”, ” +
“\”three six five\”.”);
while (true) {
System.out.println
(”Start speaking. Press Ctrl-C to quit.\n”);
/*
* This method will return when the end of speech
* is reached. Note that the endpointer will determine
* the end of speech.
*/
Result result = recognizer.recognize();
if (result != null) {
String resultText = result.getBestResultNoFiller();
System.out.println(”You said: ” + resultText + “\n”);
} else {
System.out.println(”I can’t hear what you said.\n”);
}
}
} else {
System.out.println(”Cannot start microphone.”);
recognizer.deallocate();
System.exit(1);
}
5) Results are the objects that the Recognizer returns when speech is detected. The Recognizer analyzes speech by hypothesizing on probable matches for what a user has said. The Result object actually contains all of the “search paths” that the Recognizer has traversed for a given block of speech. It contains paths that have reached their “final state” ( meaning that it’s probably at the end of a sentence or long pause ) as well as “active paths” that haven’t yet reached final state. Basically, the Result object is a collection of scored guesses that the computer makes about what has been said, and the object has methods for your application to mine through this collection in different ways.
The getBestFinalResultNoFiller() is the method used most in the demos, and it is used to avoid any partial sentences in the text output. Basically, the program waits until it is certain of a finished phrase before it hands off a textual guess. Another possible method, getBestResultNoFiller() seems a bit more forgiving (but perhaps less accurate) in that it attempts to return the highest scored result that has reached a final state, but if it doesn’t find a best final result, it is happy to return the active result with the highest score. There are many other methods for manipulating the Result returned by the Recognizer, including ways to dig into the search paths, words, and scores of Tokens to find the N-best results.
Resources:
For info into the architecture of Sphinx 4
Sphinx 4 for the Java platform
Architecture Notes
Sphinx-4: A Flexible Open Source Framework
for Speech Recognition
For info into configuration of Sphinx 4 applications
Sphinx-4 Application Programmer’s Guide
Configuration Management for Sphinx-4
Bluetooth headsets could turn out to be the most widely available interface for speaking to the computer, so we wanted to do some tests to see if the demos that came with Sphinx-4 would work with a headset.
One useful thing to note is that I had to jump through some hoops to get the Bluetooth headset to pair with my Powerbook G4 that I purchased in 2003. Basically, I had to update my bluetooth software AND FIRMWARE for it to be able to pair with the NOKIA HDW-3 headset that we wanted to use. Dig into the Apple forums here and here for useful tips on this update process.
Once you have the headset paired with the computer, change your line-in settings in System Preferences / Sound to your bluetooth headset so that the computer is listening through the headset instead of its internal mic.
I tested several of the demo applications from Sphinx-4 with the Bluetooth headset and the mic built in to the laptop with very similar results. Neither input achieves total accuracy with the demos, but both are frequently recognizable by the computer. I’m not sure how to run the diagnostic applications yet to compare exactly what the difference is, but both seem to work to some degree. There’s something I don’t understand yet with the timing of when the computer is actually listening in the demos, so that is something to look into.
Sphinx is a HMM based speech recognition system developed at CMU. There are 4 versions of Sphinx. For the moment, I’m exploring Sphinx-4 because it is written in Java, and it would probably combine easiest with the Video Comments work being done at ITP.
Getting everything you need to run Sphinx 4
First, download the bin (or the source if you want to compile everything yourself) of Sphinx 4 at sourceforge.
Then, to run any speech applications in Java, whether they be speech recognition or text to speech, you must get the Java Speech API setup.
After downloading the Sphinx-4 bin, you need to “unpack” the jsapi.jar by signing a BCL license. Instructions for setting up the JSAPI 1.0 for UNIX and Windows systems are here, but I’ll repeat how I set up on my Mac right here for good measure:
In Terminal, change directory to the lib folder in the Sphinx-4 package that you downloaded where the jsapi.sh file sits.
cd sphinx4-1/sphinx4-1.0beta/lib
If you type ls, you should see a file called jsapi.sh in this directory.
Then type chmod +x ./jsapi.sh
Then type sh ./jsapi.sh
A long document that is the BCL license should show up. Scroll down to the end of it and agree to it by typing ‘y’ when prompted to do so. Then we you press enter, the jsapi.jar should be unpacked and ready in this same lib directory.
Move this jsapi.jar file to your Java Extensions folder ( yourComputer: System/Library/Java/Extensions ). Now your computer should know how to talk with Java!
Now run some demos!
To make sure everything is set up correctly, we can now run the demos that came with the Sphinx-4 download. In the bin directory, let’s run the HelloWorld.jar that came with the package. In this application, there are a fixed vocabulary of words that you can speak for your computer to hear.
Run it in terminal by changing directory to the bin folder ( sphinx4-1/sphinx4-1.0beta/bin ) and then launch the HelloWorld app by typing java -mx312m -jar HelloWorld.jar A more in depth tutorial about running the HelloWorld.jar and the app does is on the Spinx website.
I found one useful summary of what speech recognition is here. It details the types of speech recognizers, including speaker-independent, speaker-dependent, continuous speech recognition, isolated speech recognition, and vocabulary constrained system. As the technology exists now, it seems that one has to figure out a compromise between vocabulary size that the computer can recognize and flexibility of the system to recognize different speakers and natural ways of speaking. From what Shawn has told me about what we’re looking for, we definitely need a continuous speech program with a rather large vocabulary. Allowing natural ways of speaking seems to be the most important element since we want this system to maintain a conversational feel. So, the accuracy of the recognition doesn’t have to be perfect.
This site is a repository for research about speech recognition done for academic projects at ITP
|
|
|
|