OpenAI's Realtime API (powering ChatGPT Advanced Voice Mode) vs Moshi API
Introduction
In this blog, we are going to explore and compare the latest advancements in Conversational AI APIs.
But first, let's consider the importance of Conversational AI. Typing with our thumbs on a 6" screen is neither the most intuitive nor efficient way for us, compared to communicating vocally - something that we have done effortlessly since birth. Thus, it would seem logical that when it comes to AI applications, we as a species would naturally pick speaking over typing as the preferred mode of interaction.
In fields such as customer support, personalized education, language training, and healthcare therapy, Conversational AI is the key to building intuitive, hands-free, emotionally rich products that the public would likely adopt.
Thus, we have decided to work on this comparison blog, to examine the two most promising audio AI APIs that can become the backbone of such applications, namely, OpenAI's Realtime API, and Moshi API.
Prior to Voice Native AI Model...
If we turn the clock back to the days before Realtime API and/or Moshi API, let's examine how a typical Conversational AI application would work, and we call this workflow "transcribe-reason-text2speech":user provides an audio input;
- 1. the application transcribes the input audio into text using ASR (Automatic Speech Recognition) model (for example, the Whisper model from OpenAI);
- 2. the application passes the transcribed text to text-to-text model (for example GPT-4o) for reasoning;
- 3. the application plays the textual output using a TTS (Text-to-Speech) model (for example, the audio models from OpenAI).
As we can see, this workflow will most definitely result in long latency, which is a significantly barrier to a smooth user experience in a conversational AI application, thus rendering the workflow unusable in a majority of use cases.
Thus next, let's start examining the newest conversational AI APIs, and the improvements they bring.
What is Realtime API?
First, let's look at Realtime API from OpenAI. The API was first released on October 1st 2024 by OpenAI to assist developers to build smooth speech-to-speech applications. With six presets voices, and based on the GPT-4o model, developers can use this low-latency, multimodal API to build real-time voice-activated applications.
What is ChatGPT's Advanced Voice Mode?
Since a lot of users get confused about the difference between the ChatGPT's Advanced Voice Mode vs. the Realtime API, we thought it would be good to clarify.
The former is a feature within ChatGPT (which is a chatbot product developed by OpenAI - one of the most popular in the world) based on the model GPT-4o's multimodal ability. This feature is rolling out to Plus and Team users first, and it also has a standard version available to all ChatGPT users through IOS and Android Apps. The standard version works by adopting the traditional aforementioned transcribe-reason-text2speech approach.
On the other hand, the Realtime API is a multimodal, voice-native API that can assist developers building out their respective applications, not a feature on another chatbot product.
Audio Input and Output in the Chat Completion API
As part of the Realtime API's release, OpenAI also mentioned that they would be soon releasing audio input and output in the Chat Completion API for usecases where low-latency is not a concern. The API's input could be text or audio and API output could be text, audio or both.
What is exciting for developers is that, previously one would have to stitch together multiple models to achieve the "transcribe-reason-text2speech" workflow powering the conversational experience. Now, one would just to make a single API call from Chat Completion API and the API system would take care of the rest.
For the purpose of this blog, we are going to focus on comparing only Realtime API from OpenAI with the Moshi API, and deliberating leaving out the Advanced Voice Mode and the Audio Input and Output feature in the Chat Completion API, in order to have a more related and aligned analysis.
What is Moshi API?
After providing the basic context on Realtime API (and other products) from OpenAI, let's move onto Moshi API. The Moshi voice-native model was first released by the French Kyutai Team in July 2024,. With its state of the art Mimi codec processing two streams of audio (user's input and Moshi's output) simultaneously, the model significantly improves the quality of its next-token prediction. In October 2024, PiAPI is planning to soon release the Moshi API to developers to help build realtime-dialogue applications.
Moshi API vs. Realtime API Comparison
After we've provided the necessary context on Voice Native Models, let's now start the comparison between the two APIs on different aspects which would affect the development process.
Speed & Latency
Based on public forums, the Realtime API offers extremely fast token generation. Although OpenAI did not provide any official information on its speed, it seems like developers are finding the token generation process is so fast that the interruption function call are deemed redundant since all the text tokens are generated before they are finished playing as audio output.
Moshi on the other hand, is known for using its Mimi codec achieving sychronized text/audio output. Their GitHub repository does not explicitly state Moshi's token generation rate, but it does mention that audio is processed at "a frame size of 80ms" and "a theoretical latency of 160ms" is achieved, with "a practical overall latency as low as 200ms on an L4 GPU". For real-time application, 200ms of latency is generally considered to be quite good, as it is below the threshold that most humans would notice as a significant delay in conversations.
It is also worth to note about network latency, given Moshi is a open source model, inference providers such as PiAPI can provide servers that are geographically close to the client server and proivde custom CDN infrastructure to further reduce latency caused within the network.
Reasoning Coherency & Accuracy
In terms of model accuracy there are several aspects we need to look at.
Firstly, the Realtime API uses GPT-4o as the underlying inference model and GPT-4o is a much bigger and more complex model compared to the open source Moshi model. Thus it is reasonable to say that the Realtime API will have superior overall reasoning compared to the Moshi API.
However, given the open source nature of Moshi, development team is able to fine-tune the base model as per specific usecases. Moreover, if we look at the potentially popular usecases for conversational AI (ex. customer support, education, healthcare, etc.), they are all areas where domain-specific or organization-specific information are crucial to achieve satisfactory output quality.
Flexibility & Customizations
As part of comparing flexibility and customizations, we'd have to again bring up fine-tuning as a major advantage for Moshi, allowing it to be tailored for specific usecases. Futhermore, communities of fine-tuned or trained models might form and give birth to the Civitai equivalent of the Moshi ecosystem, allowing models with different emphasis to be shared across users, rapidly increasing innovation and development progress.
On this note, OpenAI can also provide fine-tuning like functions for the Realtime API, and develop the infrastructure for users to share their usecase-specific models just like their GPTs endeavour. However, whether they chose to spend their development resource on that, and whether it would succeed, remain to be seen.
In terms of function call and RAG (Retrieval Augmented Generation), Realtime API already supports function call and the Moshi ecosystem should support it in the future as well given the open source nature of the model. Both models are expected to have tools to build out the relavant RAG workflow.
Rate Limits & Context Lengths
In terms of rate limits for Realtime API, it is currently rate limited to approximately 100 simultaneous sessions for Tier 5 developers, with lower limits for Tiers 1-4. However OpenAI also mentioned that they will increase these limits over time to support larger deployments, and that the Realtime API will also support GPT-4o mini in upcoming versions of that model.
For Moshi, rate limits would like not be a problem since clients total daily usage and peak throughput could be roughly calculated and extra inference instances could be added to accommodate each specific load profile.
For context length, the Realtime API allegedly can hold upto 8k tokens. Moshi on the other hand, depending on its configurations and fine-tuning results, its context length will likely vary.
Pricing
Given the significantly larger size and complexity of the GPT-4o model, the Realtime API naturally would use more hardware resources for inference and memory compared to Moshi, and this would appear quite evident in their respective API pricing. For Realtime API, its pricings are as follows:
- • Text Input tokens: $5 per 1M tokens
- • Text Output token: $20 per 1M tokens
- • Audio Input tokens: $100 per 1M tokens ($0.06 / min of audio)
- • Audio Output tokens: $200 per 1M tokens ($0.24 / min of audio)
For Moshi API, PiAPI is estimated to able to achieve the following pricing:
- • Audio Input & Output tokens: $0.02 / min of audio
Conclusion
Given the comparison provided above, hopefully it was informative as to which conversational AI API might suit your needs better. We here at PiAPI are indifferent about developers' decision since most likely we will support both. However, given the open source nature of Moshi and the impressive results we've seen so far, we are very excited to see how far the open source ecosytem will go and where it will take all of us to!
Also, if you are interested in other AI API that PiAPI provides, please check them out!