vineri, 3 octombrie 2008

Store the audio data from the recognition into file

In a real world speech application, one might need to store the audio data from a recognition, to check after that if what the user said is the same as what the application recognized.
It's pretty easy to do this using just SAPI objects (no other external goop).
Apparently one can get the audio data just in case of a successful recognition.
From the point of view of the code, one should put the following code in the SPEI_RECOGNITION case from the previous post, and then one could use this functionality.

CComPtr pSpVoice;
//create a TTS object
hr = pSpVoice.CoCreateInstance(CLSID_SpVoice);
//check hr

CSpStreamFormat cAudioFmt;
hr = cAudioFmt.AssignFormat(SPSF_11kHz16BitMono);
//check hr
CComPtr pSpStream;
hr = SPBindToFile("D:\temp.wav" ,  SPFM_CREATE_ALWAYS, &pSpStream, &cAudioFmt.FormatId(),cAudioFmt.WaveFormatExPtr() );
//check hr

//set TTS output to wav file
hr = pSpVoice->SetOutput( pSpStream, TRUE );
//check hr

SPPHRASE* pPhrase   = 0;
//get a phrase object (this contains audio data) 
hr = pRecoResult->GetPhrase(&pPhrase);
//check hr

//get audio
hr = pRecoResult->GetAudio(0, pPhrase->Rule.ulCountOfElements, &pStreamFormat);
//check hr

//speak audio data to file
hr = pSpVoice->SpeakStream(pStreamFormat, SPF_DEFAULT, NULL);
//check hr

//clean up
pSpStream->Close();
pSpStream.Release();
pSpVoice.Release();

Have fun!

joi, 2 octombrie 2008

Simple ASR application

A basic speech recognition application should take the user input from the microphone (the audio signal) and recognize some text from a grammar. This is a typical scenario for command and control applications.
For this, one needs a recognizer object, a context object (used to listen for different events from the SR engine), an audio input object (you can get the default object or choose one of those available on the machine), and one or more grammars (which can be compiled or xml grammars - the SAPI documentation has more information about these).
Ok, if that is clear, it's time to post the code that does the job.

#include "sphelper.h"
#include "windows.h"

<CComPtr> pRecoEngine;
<CComPtr> pRecoContext;
<CComPtr> pRecoGrammar;
<CComPtr> pInputToken;
<CComPtr> pRecoResult;


int main(void)
{
HRESULT hr = S_OK;

hr = ::CoInitialize(NULL);
if(SUCCEEDED(hr))
{
//create an inproc recognizer - for this type of recognizer you have to set the audio input  object manually
hr = pRecoEngine.CoCreateInstance(CLSID_SpInprocRecognizer);
if(FAILED(hr))
{
printf("--- FAILED to create InProcRecognizer \n");
return -1;
}
//create context
hr = pRecoEngine->CreateRecoContext(&pRecoContext);
if(FAILED(hr))
{
printf("--- FAILED to create Context \n");
return -1;
}
//create grammar
hr = pRecoContext->CreateGrammar(0, &pRecoGrammar);
if(FAILED(hr))
{
printf("--- FAILED to create Grammar \n");
return -1;
}
//set object which will receive notifications from the engine
hr = pRecoContext->SetNotifyWin32Event();
if(FAILED(hr))
{
printf("--- FAILED to SetNotifyWin32Event() \n");
return -1;
}
ULONGLONG events = SPFEI(SPEI_RECOGNITION)|
  SPFEI(SPEI_FALSE_RECOGNITION)|
  SPFEI(SPEI_PHRASE_START)|
  SPFEI(SPEI_SOUND_START);
//set events we want to receive from the engine
hr = pRecoContext->SetInterest(events, events);
if(FAILED(hr))
{
printf("--- FAILED to SetInterest() \n");
return -1;
}
//get default audio input object
hr = SpGetDefaultTokenFromCategoryId(SPCAT_AUDIOIN, &pInputToken);
if(FAILED(hr))
{
printf("--- FAILED to Get default input token \n");
return -1;
}
else
{
//if input object got successfully, use it
hr = pRecoEngine->SetInput(pInputToken, FALSE);
if(FAILED(hr))
{
printf("--- FAILED to SET default input token \n");
}
}
GUID grammarGUID;
::CoCreateGuid(&grammarGUID);

//load a grammar - this is an example with proprietary grammars, which require GUIDs
//xml grammars do not require GUIDs, one could just use integer grammar ids
hr = pRecoGrammar->LoadCmdFromProprietaryGrammar(grammarGUID, L"digit", NULL, 0, SPLO_STATIC);
if(FAILED(hr))
{
printf("--- FAILED to Load Proprietary grammar \n");
return -1;
}
//set engine state to active
pRecoEngine->SetRecoState(SPRST_ACTIVE);
//enable context
pRecoContext->SetContextState(SPCS_ENABLED);
//enable grammar
pRecoGrammar->SetGrammarState(SPGS_ENABLED);
//activate grammar rules
pRecoGrammar->SetRuleState(NULL, NULL, SPRS_ACTIVE);
//these four steps are necessary for the SR engine to be able to start listening for user speech

bool bDone = false;
CSpEvent event;
while(!bDone)
{
//wait for 5 seconds for an event from the engine
hr = pRecoContext->WaitForNotifyEvent(5000);
if(hr == S_FALSE)
{
printf("--- Operation timeout \n");
}
if(hr == S_OK)
{
//get the event from the context
if(event.GetFrom(pRecoContext) == S_OK)
{
switch(event.eEventId)
{
case SPEI_SOUND_START:
{
printf("--- SPEI_SOUND_START \n");
}
break;
case SPEI_RECOGNITION:
{
//handle a successful recognition (print the result)
printf("--- SPEI_RECOGNITION \n");

WCHAR* pRecogStr = 0;
pRecoResult = event.RecoResult();
pRecoResult->GetText(SP_GETWHOLEPHRASE, SP_GETWHOLEPHRASE, TRUE, &pRecogStr, NULL);
printf("--- Result is: %ls \n", pRecogStr);
bDone = true;
}
break;
case SPEI_FALSE_RECOGNITION:
{
printf("--- SPEI_FALSE_RECOGNITION \n");
bDone = true;
}
break;
default:
break;
}
}
}
}
}
pRecoGrammar.Release();
pRecoContext.Release();
pRecoEngine.Release();

::CoUninitialize();

getchar();

return 0;
}

This should get one started with speech recognition. One should check the SAPI docs for more information on each function in the API.

Cheers!