The Speech Zone: octombrie 2008

In a real world speech application, one might need to store the audio data from a recognition, to check after that if what the user said is the same as what the application recognized.

It's pretty easy to do this using just SAPI objects (no other external goop).

Apparently one can get the audio data just in case of a successful recognition.

From the point of view of the code, one should put the following code in the SPEI_RECOGNITION case from the previous post, and then one could use this functionality.

CComPtr pSpVoice;

//create a TTS object

hr = pSpVoice.CoCreateInstance(CLSID_SpVoice);

//check hr

CSpStreamFormat cAudioFmt;

hr = cAudioFmt.AssignFormat(SPSF_11kHz16BitMono);

//check hr

CComPtr pSpStream;

hr = SPBindToFile("D:\temp.wav" , SPFM_CREATE_ALWAYS, &pSpStream, &cAudioFmt.FormatId(),cAudioFmt.WaveFormatExPtr() );

//check hr

//set TTS output to wav file

hr = pSpVoice->SetOutput( pSpStream, TRUE );

//check hr

SPPHRASE* pPhrase = 0;

//get a phrase object (this contains audio data)

hr = pRecoResult->GetPhrase(&pPhrase);

//check hr

//get audio

hr = pRecoResult->GetAudio(0, pPhrase->Rule.ulCountOfElements, &pStreamFormat);

//check hr

//speak audio data to file

hr = pSpVoice->SpeakStream(pStreamFormat, SPF_DEFAULT, NULL);

//check hr

//clean up

pSpStream->Close();

pSpStream.Release();

pSpVoice.Release();

Have fun!

A basic speech recognition application should take the user input from the microphone (the audio signal) and recognize some text from a grammar. This is a typical scenario for command and control applications.

For this, one needs a recognizer object, a context object (used to listen for different events from the SR engine), an audio input object (you can get the default object or choose one of those available on the machine), and one or more grammars (which can be compiled or xml grammars - the SAPI documentation has more information about these).

Ok, if that is clear, it's time to post the code that does the job.

#include "sphelper.h"

#include "windows.h"

<CComPtr> pRecoEngine;

<CComPtr> pRecoContext;

<CComPtr> pRecoGrammar;

<CComPtr> pInputToken;

<CComPtr> pRecoResult;

int main(void)

{

HRESULT hr = S_OK;

hr = ::CoInitialize(NULL);

if(SUCCEEDED(hr))

{

//create an inproc recognizer - for this type of recognizer you have to set the audio input object manually

hr = pRecoEngine.CoCreateInstance(CLSID_SpInprocRecognizer);

if(FAILED(hr))

{

printf("--- FAILED to create InProcRecognizer \n");

return -1;

}

//create context

hr = pRecoEngine->CreateRecoContext(&pRecoContext);

if(FAILED(hr))

{

printf("--- FAILED to create Context \n");

return -1;

}

//create grammar

hr = pRecoContext->CreateGrammar(0, &pRecoGrammar);

if(FAILED(hr))

{

printf("--- FAILED to create Grammar \n");

return -1;

}

//set object which will receive notifications from the engine

hr = pRecoContext->SetNotifyWin32Event();

if(FAILED(hr))

{

printf("--- FAILED to SetNotifyWin32Event() \n");

return -1;

}

ULONGLONG events = SPFEI(SPEI_RECOGNITION)|

SPFEI(SPEI_FALSE_RECOGNITION)|

SPFEI(SPEI_PHRASE_START)|

SPFEI(SPEI_SOUND_START);

//set events we want to receive from the engine

hr = pRecoContext->SetInterest(events, events);

if(FAILED(hr))

{

printf("--- FAILED to SetInterest() \n");

return -1;

}

//get default audio input object

hr = SpGetDefaultTokenFromCategoryId(SPCAT_AUDIOIN, &pInputToken);

if(FAILED(hr))

{

printf("--- FAILED to Get default input token \n");

return -1;

}

else

{

//if input object got successfully, use it

hr = pRecoEngine->SetInput(pInputToken, FALSE);

if(FAILED(hr))

{

printf("--- FAILED to SET default input token \n");

}

GUID grammarGUID;

::CoCreateGuid(&grammarGUID);

//load a grammar - this is an example with proprietary grammars, which require GUIDs

//xml grammars do not require GUIDs, one could just use integer grammar ids

hr = pRecoGrammar->LoadCmdFromProprietaryGrammar(grammarGUID, L"digit", NULL, 0, SPLO_STATIC);

if(FAILED(hr))

{

printf("--- FAILED to Load Proprietary grammar \n");

return -1;

}

//set engine state to active

pRecoEngine->SetRecoState(SPRST_ACTIVE);

//enable context

pRecoContext->SetContextState(SPCS_ENABLED);

//enable grammar

pRecoGrammar->SetGrammarState(SPGS_ENABLED);

//activate grammar rules

pRecoGrammar->SetRuleState(NULL, NULL, SPRS_ACTIVE);

//these four steps are necessary for the SR engine to be able to start listening for user speech

bool bDone = false;

CSpEvent event;

while(!bDone)

{

//wait for 5 seconds for an event from the engine

hr = pRecoContext->WaitForNotifyEvent(5000);

if(hr == S_FALSE)

{

printf("--- Operation timeout \n");

}

if(hr == S_OK)

{

//get the event from the context

if(event.GetFrom(pRecoContext) == S_OK)

{

switch(event.eEventId)

{

case SPEI_SOUND_START:

{

printf("--- SPEI_SOUND_START \n");

}

break;

case SPEI_RECOGNITION:

{

//handle a successful recognition (print the result)

printf("--- SPEI_RECOGNITION \n");

WCHAR* pRecogStr = 0;

pRecoResult = event.RecoResult();

pRecoResult->GetText(SP_GETWHOLEPHRASE, SP_GETWHOLEPHRASE, TRUE, &pRecogStr, NULL);

printf("--- Result is: %ls \n", pRecogStr);

bDone = true;

}

break;

case SPEI_FALSE_RECOGNITION:

{

printf("--- SPEI_FALSE_RECOGNITION \n");

bDone = true;

}

break;

default:

break;

}

pRecoGrammar.Release();

pRecoContext.Release();

pRecoEngine.Release();

::CoUninitialize();

getchar();

return 0;

}

This should get one started with speech recognition. One should check the SAPI docs for more information on each function in the API.

Cheers!

The Speech Zone

vineri, 3 octombrie 2008

Store the audio data from the recognition into file

joi, 2 octombrie 2008

Simple ASR application

Arhivă blog

Despre mine