OpenAI Launches New Audio Models To Power Voice Agents

OpenAI has launched new speech-to-text and text-to-speech audio models. The new models aim to help developers to build more intelligence voice agents with increased power and more customization options.

Customization options include instructing the model to speak from a specific perspective, such as a sympathetic customer service agent.

The new models outperform existing models on key benchmarks including reliability and accuracy. They are able to adapt quickly to new accents, noisy environments and pace of speech.

The new audio models help to improve the reliability of transcription as well as playing a part in automating jobs in call centers.

This comes as part of the new agentic era, with recent agent releases including Operator, Deep Research, Computer-Using Agents, and the Responses API.

What is Gpt-4o-Transcribe?

Gpt-4o-transcribe is OpenAI’s new speech-to-text model. It features key improvements to language recognition and word error rates than previous models.

The improved Word Error Rate (WER) was born directly out of innovations in reinforcement learning and extensive mid-training with varied audio datasets.

Gpt-4o-transcribe better captures nuances of speech, including accents and challenging environment sounds.

What is Gpt-4o-mini-tts?

Gpt-4o-mini-tts is OpenAI’s new text to speech model. It aims to provide "steerable voice output," meaning there's increased control over the style and tone of the generated speech.

This allows for more personalized voice agents and applications.

It's designed for developers looking to integrate high-quality text-to-speech functionality into their applications.

How to Use Gpt-4o-Transcribe and Gpt-4o-mini-tts?

The two new OpenAI models, Gpt-4o-Transcribe and Gpt-4o-mini-tts, are available to all developers through the API.

Obtain an OpenAI API Key. You can do this by create an account on the OpenAI website.
Generate an API key from your account settings.
Install the OpenAI Python Library (if using Python):
Open your terminal or command prompt.
Run the command: pip install openai
Prepare Your Audio File. Ensure your audio file is in a supported format (e.g., mp3, wav, mp4, etc.).
Write Your Python and Import the openai library.
Set your API key using openai.api_key = "YOUR_API_KEY".
Open your audio file in binary read mode ("rb").
Use the openai.Audio.transcriptions.create() function.
Run your code to retrieve the transcription.