Learn How to Convert Audio to Text

Converting speech or audio to text has a large number of applications and can bring advanced capabilities to applications.

Image you are running a call center with thousands of simultaneous calls. You would like to identify some trends/analytics such as if the callers are having problems with a particular product or feature. Or if the callers sound frustrated or unhappy about something.

You might also be looking for particular words in the conversation that are being repeated and also need to know the frequency. Being able to analyze such information is vital to businesses. For example, if you identified that callers sound frustrated and the word “broken” is repeated all the time – you can take actions to improve the user experience. First, you can quickly teach the support team how to help with this particular problem, offer a solution or a workaround. Second, you can fix or improve the product.

In general, almost any audio can be converted to text, where the text is then analyzed for trends, analytics that are important to you. One tool that you can use to analyze text is the Watson Tone Analyzer service.

Speech to Text API

How do you transcribe audio to text? One option is to use Watson Speech to Text API. The API is easy to use, you point to an audio file and as output get JSON with text and some additional metadata information. That’s the simplest and fastest way to use the API.

You can can also use more advanced features such as uploading a custom model. A custom model helps to transcribe audio from a specific domain. For example, let’s say you need to transcribe audio from a medical field. The audio might use field or domain specific words (such as disease names) that the out-of-the-box API might not fully understand. By uploading a custom model, you can teach the API to transcribe better and most importantly, correctly.

Let’s look at a few examples and I will cover a custom model in another blog post.

Creating Speech to Text Service

In this section you will create a new Watson Speech to Text service.

  1. Register for a free IBM Cloud account or sign into your existing account
  2. Navigate to the Services Catalog
  3. From the left menu, click on AI
  4. Locate the Text to Speech API box and click on it
  5. On the next page you will see the service name (you can change it if you want). Click Create to create a Text to Speech service.

Once the service is created you will see the following page. You can click on Show link to display the service credentials.

Speech to Text service
Speech to Text service

Now that you have created a service, it’s time to try it!

Running the Speech to Text Service

The fastest way to run this service is from a command line using the cURL program. You will do that next. Keep in mind that Watson offers 10 SDK for various languages that you can use. You can see and try the SDK on this GitHub page.

You first need an audio file. For testing you can download this file.

From a terminal window, navigate to the directory where you saved the file and run the following cURL command. You need to replace the username and password with the information from you service. You can see this information by clicking the Show link.

curl -X POST -u {username}:{password} \
 --header "Content-Type: audio/flac" \
 --data-binary @audio-file.flac \
 "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize"

This example is not using any extra parameters.

You should see the following output:

{
 "results": [
    {
       "alternatives": [
          {
             "confidence": 0.889,
             "transcript": "several tornadoes touch down as a 
                           line of severe thunderstorms swept 
                           through Colorado on Sunday "
          }
       ],
       "final": true
   }
 ],
 "result_index": 0
}
  • The transcript field is the text that was transcribed
  • The confidence field is the service’s confidence in the transcript in the range of 0 to 1. The closer the number to 1, the more confident the service that the transcription is correct
  • The alternatives field might show alternative transcriptions (none in this example)

If you want to try another example, you can download a longer audio file from here and then run the command again (note that I renamed the file):

curl -X-X POST -u {username}:{password} \
 --header "Content-Type: audio/flac" \
 --data-binary @Tim.oga \
 "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize"

The output in this case would be:

 {
    "results": [
       {
          "alternatives": [
             {
                "confidence": 0.845, 
                "transcript": "what is not a replacement for the web the web continues but when you think of the files on your computer the the documents into them the email messages and letters and things are things that you can put on the web now but then but their data files on their light kept calendars and downloaded spread sheets and things which you can't really put on the web because if you put a condom on the way we have to put it up on this document and "
             }
          ], 
          "final": true
       }, 
       {
          "alternatives": [
             {
                "confidence": 0.787, 
                "transcript": "with a computer you got all the things you need to do with data live with the kind of the need to get a look at in a debut month you need to compare the other calendars and see what you're doing at same time so the problem is that the moment that the data that's out there isn't in a form that we can actually post as it and use it so we not using it powerful enough and it's sort of in Dayton form for day to day life but also its but also for scientists and people use lots of data "
             }
          ], 
       "final": true
       }
    ], 
    "result_index": 0
}

The output from this call has two transcriptions and two slightly different confidence score.

The cURL command is developer’s best friend for running and testing APIs. But, if you want to use a more visual interface, I recommend you download and install the Postman program.

This is how Postman looks running the same request above:

Postman client
Postman client

Note that I don’t have the username/password in the URL. Username and password are entered on the Authentication tab (just below the service URL).  As Speech to Text uses Basic Authentication, once you enter the username and password, you can switch to Headers tab and see the Authorization header value there.

These examples are fun to try but let’s look at a more real world example.

Voice Recording and Transcription with Nexmo

Nexmo is a Communication as a Service platform that offers services such as Voice, Messaging and Authentication to make it easy to build applications with built-in communication.

Michael Heap, Nexmo Developer Advocate, published a very nice tutorial on how to record calls with Nexmo Voice API and then transcribe the calls with Speech to Text API.  I encourage to read the blog post and try the Voice API.

Here is a short excerpt and then use the link below to jump to the complete blog post:

As part of our Voice API offering, Nexmo allows you to record parts (or all) of a call and fetch the audio once the call has completed. Today, we’re happy to announce a new enhancement to this functionality: split recording. Split recording makes common tasks such as call transcription even easier.

When split recording is enabled, the downloaded recording will contain participant A (let’s call her Alice) in the left channel, and participant B (let’s call him Bob) in the right channel. This allows you to work with the audio from a single participant easily.

In this post, we’re going to walk through a simple use case. Alice calls the bank to find out information about her account, and Bob is the customer support agent who answers the call.

Continue reading on the Nexmo blog.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.