As we explain in previews post, Cognitive Services allows developers to quickly and easily create smarter applications. Speech enables the integration of speech processing capabilities into any app or service. You can convert spoken language into text, or produce natural sounding speech from text using standard or customizable voice fonts. Speech services can also be used to identify a speaker or translate between languages.

Speech services is broken into four different sections. Speech to Text, Text to Speech, Speaker recognition and Speech Translation.

Why should you consider to include Speech To Text into your applications?

Covid-19 is pushing a lot of metaphorical buttons right now, the world as it was, where mobility of people has been all but taken for granted — regardless of the environmental costs of so much commuting and indulged wanderlust — may never return to “business as usual.”

Coronavirus crisis offers an opportunity to rethink how we structure our societies and economies.

This article is about taking advantage of Voice recognition, a key feature that is now in every major mobile operating system. “Artificial intelligence has cracked the code on voice”, but we need to consider Speech recognition not just like a service to help us “type faster” or “translate”. But we need to start thinking “hands-free” as a superior form of tech interaction.

Speech to text can be used to:

  • Hear and understand customers
  • Using custom acoustic models can make customers in hard to understand environments easy to understand.
  • If you take the time to capture your customer interactions and then analyze the findings, you will be able to pinpoint the issue that people need the most help with to make the customer support process much more optimized.
  • Speech to text allows for a different interaction point for end-user applications.
  • Allow your users ability to dictate their inputs to your application through an intuitive and natural process.
  • Making your application accessible for the hearing impaired.

Speech to text provides advanced speech recognition technology, the same used by Office, Cortana and other MS products. It allows to translate audio into text in real time. You can create an application that can receive intermediate results of the words that have been recognized so far. The service also automatically recognizes the end of the incoming speech.  Users can also choose additional formatting options, including capitalization and punctuation, profanity masking and inverse text normalization. You have more than 30 languages and dialects available at the moment. Speech to text allows customizing both the language and the acoustic models, this allows you to tailor your application to your users´ domain vocabulary, speaking environment and way of speaking.


What are the Speech to Text SDK Capabilities


Transcribe a short utterance in a console app.

1. Create a new c# console app in VStudio.

2. Right click your project and select the option “Manage Nuget Packages”


3. Install the Microsoft.CognitiveServices.Speech package.


4. Open Program.cs and replace the content with the following code.

class Program
         public static async Task RecognizeSpeechAsync()
             var config = SpeechConfig.FromSubscription(“key”, “region”); replace key and region with the information provided by
             using (var recognizer = new SpeechRecognizer(config))
                 Console.WriteLine(“Say something…”);
                 var result = await recognizer.RecognizeOnceAsync();

                if (result.Reason == ResultReason.RecognizedSpeech)
                     Console.WriteLine($”Text Recognized:{result.Text}”);
                 else if (result.Reason == ResultReason.NoMatch)
                     Console.WriteLine($”No speech recognized”);
                 else if (result.Reason == ResultReason.Canceled)
                     var cancellationDetails = CancellationDetails.FromResult(result);
                     Console.WriteLine($”Speech recognition canceled: {cancellationDetails.Reason}”);
                     if (cancellationDetails.Reason == CancellationReason.Error)
                         Console.WriteLine($”Error {cancellationDetails.ErrorCode}”);
                         Console.WriteLine($”ErrorDetails {cancellationDetails.ErrorDetails}”);

         static void Main(string[] args)
             Console.WriteLine(“Please press a key to continue”);

5. The result.


Transcribe a wav file using REST API.

1. Install postman and open it.

2. In the URL paste the following : (make sure you update the region with the one provided by )

3. Include the following parameters.


4. Include the following header.


5. In body update option to POST, binary and upload a wav. file.


5. The result.


Transcribe a short utterance in a Mixed Reality App.

1. Open Unity, Create a new project.

2. Set up your project to work with MRTK using this tutorial.

3. With the MRTK selected in your scene, select DefaultHololens2ConfigurationProfile then clone it.


Select Spatial Awareness option and check the Enable Spatial Awareness System option. Then Clone the profile, open up the Windows mixed reality spatial mesh observer and also clone it.


Once this is done, at the bottom of the options, change the Display Option to Occlusion.


Select Input, clone the Profile and clone the Speech profile, then set Start Behavior to “Manual start”.


4. Import the following assets.

5. Select MRTK.Tutorials.AzureSpeechServices – prefabs – Lunarcom and drag it into your scene.


6. Assign the script LunarcomController to the Lunarcom object. Then update the fields with the following information:

API Key: replace key with the information provided by

Region: replace region with the information provided by

Terminal: Drag and drop “Terminal” Object located inside Lucarcom

ConnectionLight: Drag and drop “Connection Light” Object located inside Terminal

Buttons – Size: 1 we are using only one button

Buttons – Element 0: Drag and drop MicButton Object located inside Buttons.

your project should look like this.


7. Locate LunarcomSpeechRecognizer script and assign it also to Lunarcom object


8. Select Edit – Project Settings – Player – XR Settings, and add “Windows Mixed Reality” place the depth format to 16bit.


9. Select Audio – Uncheck “Disable Unity Audio” and select MS HRTF Spatializer under Spatializer Plugin.


10. The result.


And finally we can use this same project and using this tutorial export it to different platforms.

Microsoft Azure Speech to Text: Plans and pricing

Using Microsoft Azure Speech to Text, you can transcribe up to five hours of audio for free and create one custom voice model per month. However, with the free plan, only a single concurrent audio request is available at a time, meaning this option isn’t viable for most businesses.

If you want to transcribe more than one speech clip at once, you’ll need to upgrade to the standard Azure pricing system. This costs $1 per hour of audio and supports up to 20 concurrent requests. Additional charges are involved if you need to use a custom audio model or transcribe multichannel sound files. These extra services cost $1.40 and $2.10 per audio hour, respectively.

Although Microsoft lists its prices in a “per audio hour” format, as is the industry standard, billing is actually split into one-second increments so you won’t pay for more processing time than required.


Cognitive services solve business problems from unstructured data that regular development algorithms cannot via AI and machine learning. Speech services can convert spoken audio into text or written text into natural sounding audio, enable the use of voice for verification, or add speaker recognition to an app, allin a variety of languages and voices.

It is very helpful for companies. Aside from the fact that they’re not paying much for the service done by Azure, it also makes the company stand out for making great software applications. The applications produced will also be worth marketing.

The competition

Amazon Transcribe, Google Cloud Speech-to-Text, and Watson Speech to Text are direct competitors to Microsoft Azure. These three platforms are also all capable of performing high-volume batch transcriptions accurately. Google Cloud is the only close competitor capable of working with more languages than Azure, but it is more expensive, with a starter rate of just $0.006 per 15 seconds, compared to Azure’s $0.017 per minute ($0.00425 per 15 seconds).