Comparing ASR Solutions

Date: 2025-06-10

To create a cost-effective transcription project, I researched the common Automatic Speech Recognition (ASR) solutions available on the market. They are:

Price

ServiceCost per Minute (USD)Notes
Selfhost$0.00Free, hardware cost not included
Cloudflare whisper-large-v3-turbo$0.00051Free for ~4.5 hours/day (10,000 neurons)
lemonfox.ai$0.00278
Azure$0.003Free for 5 audio hours/month
gpt-4o-mini-transcribe$0.003
gpt-4o-transcribe$0.006
OpenAI Whisper1$0.006

Quality

In order to compare the quality and speed of the solution, I ran this sample auido (2:57) in postman for all the solutions, and the results are here.

I chose this sample audio because I encountered trouble with Cloudflare's Whisper-V3. It usually performs the same or a bit worse than other V3 models.

Reason:

  1. Test ability to recognize language (Yue and Zh)
  2. Test robustness, the speech is not very clear and the model can make corrections to it.

All transcription are done on Postman, to ensure same environment across all services.

RankASR ServiceQualityComments
0OpenAI - gpt-4o-mini-transcribe with promptExcellent✅ Minor Error (2)
❌ Output Cantonese
2OpenAI - gpt-4o-mini-transcribeExcellent✅ Output Cantonese
❌ No punctuation, not even space
❌ Minor errors (2)
1OpenAI - gpt-4o-transcribeExcellent✅ Have punctuation
❌ Doesn't output original language
❌ Minor Error(4)
3Azure TranscriptionsVery Good✅ Have punctuation
❌ Simplified Chinese Only
❌ Minor Error(4)
3LemonFoxVery Good❌ No punctuation
❌ Written Chinese
❌ Minor Error (4)
4OpenAI - Whisper-1Good❌ No punctuation
❌ Minor error(4)
5Self-hosted, large, faster-whisperGood❌ No punctuation
❌ Hallucination
❌ similar to Whisper-1, minor errors
6Self-hosted, base, faster-whisperFair❌ More errors from word with similar pronunciation
7Cloudflare - whisper-large-v3-turboPoor❌ Simplifies Chinese
❌ No punctuation
❌ Some what non-sense
8Cloudflare - whisperUnusable❌ nonsensical and garbled, Inf loop of "我們去飲品"
❌ 2MB imit
Rank (by Speed)ASR ServiceRun Time
1LemonFox6.08s
2OpenAI - gpt-4o-transcribe7.76s
3OpenAI - Whisper-17.77s
4OpenAI - gpt-4o-mini-transcribe8.42s
5Azure Transcriptions - Transcribe8.54s
6Cloudflare - whisper-large-v3-turbo8.79s
7Self-hosted, base, faster-whisper16.61s
8Cloudflare - whisper17.8s
9Self-hosted, large, faster-whisper5m 13.62s

Speed

Since the run time of previous samples are too close, I will run a 20mins sample to test the run time of each ASR services. The Sample I use is : Will legal challenges end the trade war? (16:15)

The results are in Sample 2.

Rank (by Speed)ASR ServiceRun Time
1LemonFox17.27s
2Azure Transcriptions - Transcribe22.49s
3OpenAI - gpt-4o-mini-transcribe27.41s
4OpenAI - Whisper-136.48s
5OpenAI - gpt-4o-transcribe36.57s
6Cloudflare - whisper-large-v3-turbo44.44s
7Self-hosted, base, faster-whisper1m37.62s
XCloudflare - whisperSkipped
XSelf-hosted, large, faster-whisperSkipped

Ease of Use / Functionality

ServicesModelvtt/srtWordTimeStampSizeLimitOther
OpenAIgpt-4o-transcribe25MB1. Realtime transcription
2. Auto Chunking by VAD
OpenAIgpt-4o-mini-transcribe25MB1. Realtime transcription
2. Auto Chunking
OpenAIWhisper-125MB1. Realtime transcription
2. Auto Chunking
AzureN/A2hr / 250MBspeaker diarization
Cloudflarewhisper-large-v3-turboN/A
Cloudflarewhisper2MB Chunk
SelfhostcustomizableN/A
LemonfoxWhisper large-v3N/Aspeaker diarization

Supported formats

ServicesModelSupported Format
OpenAIgpt-4o-transcribeflac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
OpenAIgpt-4o-mini-transcribeflac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
OpenAIWhisper-1flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
AzureN/Anot mentioned
Cloudflarewhisper-large-v3-turbonot mentioned
Cloudflarewhispernot mentioned
Selfhostcustomizablemp3 / steamtables. Doesn't support m4a in test.
LemonfoxWhisper large-v3mp3wavflacaacopusoggm4amp4mpegmovwebm, and more.

Quick Links