Comparing ASR Solutions
Date: 2025-06-10
To create a cost-effective transcription project, I researched the common Automatic Speech Recognition (ASR) solutions available on the market. They are:
- Whisper API from OpenAI
- gpt-4o-mini-transcribe and gpt-4o-transcribe from OpenAI
- Microsoft Azure STT services
- Cloudflare AI
- whisper and whisper-large-v3-turbo by Cloudflare AI
- Self hosted whisper-asr-webservice. On Ryzen 5 3600 and Ampere 4C. ( I should really get an Nvidia graphics card. )
- Whisper v3 by lemonfox.ai
Price
| Service | Cost per Minute (USD) | Notes |
|---|---|---|
| Selfhost | $0.00 | Free, hardware cost not included |
| Cloudflare whisper-large-v3-turbo | $0.00051 | Free for ~4.5 hours/day (10,000 neurons) |
| lemonfox.ai | $0.00278 | |
| Azure | $0.003 | Free for 5 audio hours/month |
| gpt-4o-mini-transcribe | $0.003 | |
| gpt-4o-transcribe | $0.006 | |
| OpenAI Whisper1 | $0.006 |
Quality
In order to compare the quality and speed of the solution, I ran this sample auido (2:57) in postman for all the solutions, and the results are here.
I chose this sample audio because I encountered trouble with Cloudflare's Whisper-V3. It usually performs the same or a bit worse than other V3 models.
Reason:
- Test ability to recognize language (Yue and Zh)
- Test robustness, the speech is not very clear and the model can make corrections to it.
All transcription are done on Postman, to ensure same environment across all services.
| Rank | ASR Service | Quality | Comments |
|---|---|---|---|
| 0 | OpenAI - gpt-4o-mini-transcribe with prompt | Excellent | ✅ Minor Error (2) ❌ Output Cantonese |
| 2 | OpenAI - gpt-4o-mini-transcribe | Excellent | ✅ Output Cantonese ❌ No punctuation, not even space ❌ Minor errors (2) |
| 1 | OpenAI - gpt-4o-transcribe | Excellent | ✅ Have punctuation ❌ Doesn't output original language ❌ Minor Error(4) |
| 3 | Azure Transcriptions | Very Good | ✅ Have punctuation ❌ Simplified Chinese Only ❌ Minor Error(4) |
| 3 | LemonFox | Very Good | ❌ No punctuation ❌ Written Chinese ❌ Minor Error (4) |
| 4 | OpenAI - Whisper-1 | Good | ❌ No punctuation ❌ Minor error(4) |
| 5 | Self-hosted, large, faster-whisper | Good | ❌ No punctuation ❌ Hallucination ❌ similar to Whisper-1, minor errors |
| 6 | Self-hosted, base, faster-whisper | Fair | ❌ More errors from word with similar pronunciation |
| 7 | Cloudflare - whisper-large-v3-turbo | Poor | ❌ Simplifies Chinese ❌ No punctuation ❌ Some what non-sense |
| 8 | Cloudflare - whisper | Unusable | ❌ nonsensical and garbled, Inf loop of "我們去飲品" ❌ 2MB imit |
| Rank (by Speed) | ASR Service | Run Time |
|---|---|---|
| 1 | LemonFox | 6.08s |
| 2 | OpenAI - gpt-4o-transcribe | 7.76s |
| 3 | OpenAI - Whisper-1 | 7.77s |
| 4 | OpenAI - gpt-4o-mini-transcribe | 8.42s |
| 5 | Azure Transcriptions - Transcribe | 8.54s |
| 6 | Cloudflare - whisper-large-v3-turbo | 8.79s |
| 7 | Self-hosted, base, faster-whisper | 16.61s |
| 8 | Cloudflare - whisper | 17.8s |
| 9 | Self-hosted, large, faster-whisper | 5m 13.62s |
Speed
Since the run time of previous samples are too close, I will run a 20mins sample to test the run time of each ASR services. The Sample I use is : Will legal challenges end the trade war? (16:15)
The results are in Sample 2.
| Rank (by Speed) | ASR Service | Run Time |
|---|---|---|
| 1 | LemonFox | 17.27s |
| 2 | Azure Transcriptions - Transcribe | 22.49s |
| 3 | OpenAI - gpt-4o-mini-transcribe | 27.41s |
| 4 | OpenAI - Whisper-1 | 36.48s |
| 5 | OpenAI - gpt-4o-transcribe | 36.57s |
| 6 | Cloudflare - whisper-large-v3-turbo | 44.44s |
| 7 | Self-hosted, base, faster-whisper | 1m37.62s |
| X | Cloudflare - whisper | Skipped |
| X | Self-hosted, large, faster-whisper | Skipped |
Ease of Use / Functionality
| Services | Model | vtt/srt | WordTimeStamp | SizeLimit | Other |
|---|---|---|---|---|---|
| OpenAI | gpt-4o-transcribe | ✅ | ✅ | 25MB | 1. Realtime transcription 2. Auto Chunking by VAD |
| OpenAI | gpt-4o-mini-transcribe | ✅ | ✅ | 25MB | 1. Realtime transcription 2. Auto Chunking |
| OpenAI | Whisper-1 | ✅ | ✅ | 25MB | 1. Realtime transcription 2. Auto Chunking |
| Azure | N/A | ❌ | ✅ | 2hr / 250MB | speaker diarization |
| Cloudflare | whisper-large-v3-turbo | ✅ | ✅ | N/A | |
| Cloudflare | whisper | ✅ | ✅ | 2MB Chunk | |
| Selfhost | customizable | ✅ | ✅ | N/A | |
| Lemonfox | Whisper large-v3 | ✅ | ✅ | N/A | speaker diarization |
Supported formats
| Services | Model | Supported Format |
|---|---|---|
| OpenAI | gpt-4o-transcribe | flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. |
| OpenAI | gpt-4o-mini-transcribe | flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. |
| OpenAI | Whisper-1 | flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. |
| Azure | N/A | not mentioned |
| Cloudflare | whisper-large-v3-turbo | not mentioned |
| Cloudflare | whisper | not mentioned |
| Selfhost | customizable | mp3 / steamtables. Doesn't support m4a in test. |
| Lemonfox | Whisper large-v3 | mp3, wav, flac, aac, opus, ogg, m4a, mp4, mpeg, mov, webm, and more. |