The Speech Zone: 2008

vineri, 3 octombrie 2008

Store the audio data from the recognition into file

In a real world speech application, one might need to store the audio data from a recognition, to check after that if what the user said is the same as what the application recognized.

It's pretty easy to do this using just SAPI objects (no other external goop).

Apparently one can get the audio data just in case of a successful recognition.

From the point of view of the code, one should put the following code in the SPEI_RECOGNITION case from the previous post, and then one could use this functionality.

CComPtr pSpVoice;

//create a TTS object

hr = pSpVoice.CoCreateInstance(CLSID_SpVoice);

//check hr

CSpStreamFormat cAudioFmt;

hr = cAudioFmt.AssignFormat(SPSF_11kHz16BitMono);

//check hr

CComPtr pSpStream;

hr = SPBindToFile("D:\temp.wav" , SPFM_CREATE_ALWAYS, &pSpStream, &cAudioFmt.FormatId(),cAudioFmt.WaveFormatExPtr() );

//check hr

//set TTS output to wav file

hr = pSpVoice->SetOutput( pSpStream, TRUE );

//check hr

SPPHRASE* pPhrase = 0;

//get a phrase object (this contains audio data)

hr = pRecoResult->GetPhrase(&pPhrase);

//check hr

//get audio

hr = pRecoResult->GetAudio(0, pPhrase->Rule.ulCountOfElements, &pStreamFormat);

//check hr

//speak audio data to file

hr = pSpVoice->SpeakStream(pStreamFormat, SPF_DEFAULT, NULL);

//check hr

//clean up

pSpStream->Close();

pSpStream.Release();

pSpVoice.Release();

Have fun!

joi, 2 octombrie 2008

Simple ASR application

A basic speech recognition application should take the user input from the microphone (the audio signal) and recognize some text from a grammar. This is a typical scenario for command and control applications.

For this, one needs a recognizer object, a context object (used to listen for different events from the SR engine), an audio input object (you can get the default object or choose one of those available on the machine), and one or more grammars (which can be compiled or xml grammars - the SAPI documentation has more information about these).

Ok, if that is clear, it's time to post the code that does the job.

#include "sphelper.h"

#include "windows.h"

<CComPtr> pRecoEngine;

<CComPtr> pRecoContext;

<CComPtr> pRecoGrammar;

<CComPtr> pInputToken;

<CComPtr> pRecoResult;

int main(void)

{

HRESULT hr = S_OK;

hr = ::CoInitialize(NULL);

if(SUCCEEDED(hr))

{

//create an inproc recognizer - for this type of recognizer you have to set the audio input object manually

hr = pRecoEngine.CoCreateInstance(CLSID_SpInprocRecognizer);

if(FAILED(hr))

{

printf("--- FAILED to create InProcRecognizer \n");

return -1;

}

//create context

hr = pRecoEngine->CreateRecoContext(&pRecoContext);

if(FAILED(hr))

{

printf("--- FAILED to create Context \n");

return -1;

}

//create grammar

hr = pRecoContext->CreateGrammar(0, &pRecoGrammar);

if(FAILED(hr))

{

printf("--- FAILED to create Grammar \n");

return -1;

}

//set object which will receive notifications from the engine

hr = pRecoContext->SetNotifyWin32Event();

if(FAILED(hr))

{

printf("--- FAILED to SetNotifyWin32Event() \n");

return -1;

}

ULONGLONG events = SPFEI(SPEI_RECOGNITION)|

SPFEI(SPEI_FALSE_RECOGNITION)|

SPFEI(SPEI_PHRASE_START)|

SPFEI(SPEI_SOUND_START);

//set events we want to receive from the engine

hr = pRecoContext->SetInterest(events, events);

if(FAILED(hr))

{

printf("--- FAILED to SetInterest() \n");

return -1;

}

//get default audio input object

hr = SpGetDefaultTokenFromCategoryId(SPCAT_AUDIOIN, &pInputToken);

if(FAILED(hr))

{

printf("--- FAILED to Get default input token \n");

return -1;

}

else

{

//if input object got successfully, use it

hr = pRecoEngine->SetInput(pInputToken, FALSE);

if(FAILED(hr))

{

printf("--- FAILED to SET default input token \n");

}

GUID grammarGUID;

::CoCreateGuid(&grammarGUID);

//load a grammar - this is an example with proprietary grammars, which require GUIDs

//xml grammars do not require GUIDs, one could just use integer grammar ids

hr = pRecoGrammar->LoadCmdFromProprietaryGrammar(grammarGUID, L"digit", NULL, 0, SPLO_STATIC);

if(FAILED(hr))

{

printf("--- FAILED to Load Proprietary grammar \n");

return -1;

}

//set engine state to active

pRecoEngine->SetRecoState(SPRST_ACTIVE);

//enable context

pRecoContext->SetContextState(SPCS_ENABLED);

//enable grammar

pRecoGrammar->SetGrammarState(SPGS_ENABLED);

//activate grammar rules

pRecoGrammar->SetRuleState(NULL, NULL, SPRS_ACTIVE);

//these four steps are necessary for the SR engine to be able to start listening for user speech

bool bDone = false;

CSpEvent event;

while(!bDone)

{

//wait for 5 seconds for an event from the engine

hr = pRecoContext->WaitForNotifyEvent(5000);

if(hr == S_FALSE)

{

printf("--- Operation timeout \n");

}

if(hr == S_OK)

{

//get the event from the context

if(event.GetFrom(pRecoContext) == S_OK)

{

switch(event.eEventId)

{

case SPEI_SOUND_START:

{

printf("--- SPEI_SOUND_START \n");

}

break;

case SPEI_RECOGNITION:

{

//handle a successful recognition (print the result)

printf("--- SPEI_RECOGNITION \n");

WCHAR* pRecogStr = 0;

pRecoResult = event.RecoResult();

pRecoResult->GetText(SP_GETWHOLEPHRASE, SP_GETWHOLEPHRASE, TRUE, &pRecogStr, NULL);

printf("--- Result is: %ls \n", pRecogStr);

bDone = true;

}

break;

case SPEI_FALSE_RECOGNITION:

{

printf("--- SPEI_FALSE_RECOGNITION \n");

bDone = true;

}

break;

default:

break;

}

pRecoGrammar.Release();

pRecoContext.Release();

pRecoEngine.Release();

::CoUninitialize();

getchar();

return 0;

}

This should get one started with speech recognition. One should check the SAPI docs for more information on each function in the API.

Cheers!

luni, 29 septembrie 2008

Simple TTS application

Ok, so this is my first post on this blog dedicated to speech related topics, mainly, but it can degenerate by touching other topics also.

First of all, allow me to introduce myself: I am a software developer (working with embedded systems at the moment) that has some experience with SAPI, and since the documentation for SAPI is not maintained properly, I thought for a long time to make a webpage in which I can share my knowledge. And what better way to do this but by the use of blogs (it's somewhat fashionable).

I've had a lot of trouble using SAPI for some advanced stuff, lots and lots of tests, and I wouldn't wish for anyone else to go through these trials of patience and nerve crushing despair (ok, I'm exaggerating, but it was at least annoying at a moment in time).

Good, now that we're done with all that, it's time to get to the good stuff: SAPI (Speech Application Programming Interface) is an interface provided by Microsoft, which can be used to develop speech enabled applications. It provides a level of abstraction between the application and the speech engines (ASR and TTS - I'll explain what's with these later).

There are two ways for an application to become speech enabled: one way would be for the application to be able to speak a piece of text (TTS - Text To Speech), the other way would be for the application to be able to recognize the text from a spoken sentence (ASR - Automatic Speech Recognition). I'm not going to explain how these two work, if you want to find out more, you can use Google. It's enough to say that SAPI provides the tools to use these lower layers (speech engines) to create complex speech applications.

SAPI is a very cool thingy but don't expect to just call a few methods and get an excellent speech application, if you want to do something complex, there's a lot of stuff that the documentation just "omits", and you have to do a lot of tests to get to the bottom of it.

Ok, that's enough praising, it's time to do something practical.

First step would be, if you want to create something that contains speech, to download the Microsoft Speech SDK (it's free, so, no money needed). You can download it from the Microsoft site. It installs the binaries and header files for SAPI, some sample applications, a few voices for the TTS, and a recognition engine from Microsoft, not very good, but it does the job.

I'm getting bored here just talking about the interface, I'm going to post some code now. It's just a simple TTS application, console, just trying to speak something:

#include "windows.h"
#include "sphelper.h"

int main(void)
{
HRESULT hr = -1;

CComPtr cpVoice;

//initialize COM
hr = ::CoInitialize(NULL);
if(FAILED(hr))
{
printf("COInitialize FAILED \n");
return -1;
}

//create the ISpVoice object: this is the TTS object
hr = cpVoice.CoCreateInstance(CLSID_SpVoice);
if(FAILED(hr))
{
printf("CoCreateInstance FAILED \n");
return -1;
}

//set events that you are interested in
hr = cpVoice->SetInterest(SPFEI_ALL_TTS_EVENTS, SPFEI_ALL_TTS_EVENTS);
if(FAILED(hr))
{
printf("SetInterest FAILED \n");
}

//set object that will receive notifications from the TTS engine
hr = cpVoice->SetNotifyWin32Event();
if(FAILED(hr))
{
printf("SetNotifyWin32Event FAILED \n");
return -1;
}

//speak some text
hr = cpVoice->Speak(L"Hello World", NULL, NULL);
if(FAILED(hr))
{
printf("Speak FAILED \n");
return -1;
}

cpVoice.Release();

::CoUninitialize();

return 0;
}

Ok, that was all. This should speak "Hello World". Pretty simple, eh? Maybe tomorrow I'll have time to make a more complex application, with event handling, and other stuff.

The Speech Zone

vineri, 3 octombrie 2008

Store the audio data from the recognition into file

joi, 2 octombrie 2008

Simple ASR application

luni, 29 septembrie 2008

Simple TTS application

Arhivă blog

Despre mine