azure-speech-to-text-rest-py
Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.
- risk
- unknown
- source
- community
- date added
- 2026-02-27
Azure Speech to Text REST API for Short Audio
Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.
Prerequisites
- Azure subscription - Create one free
- Speech resource - Create in Azure Portal
- Get credentials - After deployment, go to resource > Keys and Endpoint
Environment Variables
# Required AZURE_SPEECH_KEY=<your-speech-resource-key> AZURE_SPEECH_REGION=<region> # e.g., eastus, westus2, westeurope # Alternative: Use endpoint directly AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com
Installation
pip install requests
Quick Start
import os import requests def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict: """Transcribe short audio file (max 60 seconds) using REST API.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"] url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1" headers = { "Ocp-Apim-Subscription-Key": api_key, "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000", "Accept": "application/json" } params = { "language": language, "format": "detailed" # or "simple" } with open(audio_file_path, "rb") as audio_file: response = requests.post(url, headers=headers, params=params, data=audio_file) response.raise_for_status() return response.json() # Usage result = transcribe_audio("audio.wav", "en-US") print(result["DisplayText"])
Audio Requirements
| Format | Codec | Sample Rate | Notes |
|---|---|---|---|
| WAV | PCM | 16 kHz, mono | Recommended |
| OGG | OPUS | 16 kHz, mono | Smaller file size |
Limitations:
- Maximum 60 seconds of audio
- For pronunciation assessment: maximum 30 seconds
- No partial/interim results (final only)
Content-Type Headers
# WAV PCM 16kHz "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000" # OGG OPUS "Content-Type": "audio/ogg; codecs=opus"
Response Formats
Simple Format (default)
params = {"language": "en-US", "format": "simple"}
{ "RecognitionStatus": "Success", "DisplayText": "Remind me to buy 5 pencils.", "Offset": "1236645672289", "Duration": "1236645672289" }
Detailed Format
params = {"language": "en-US", "format": "detailed"}
{ "RecognitionStatus": "Success", "Offset": "1236645672289", "Duration": "1236645672289", "NBest": [ { "Confidence": 0.9052885, "Display": "What's the weather like?", "ITN": "what's the weather like", "Lexical": "what's the weather like", "MaskedITN": "what's the weather like" } ] }
Chunked Transfer (Recommended)
For lower latency, stream audio in chunks:
import os import requests def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict: """Stream audio in chunks for lower latency.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"] url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1" headers = { "Ocp-Apim-Subscription-Key": api_key, "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000", "Accept": "application/json", "Transfer-Encoding": "chunked", "Expect": "100-continue" } params = {"language": language, "format": "detailed"} def generate_chunks(file_path: str, chunk_size: int = 1024): with open(file_path, "rb") as f: while chunk := f.read(chunk_size): yield chunk response = requests.post( url, headers=headers, params=params, data=generate_chunks(audio_file_path) ) response.raise_for_status() return response.json()
Authentication Options
Option 1: Subscription Key (Simple)
headers = { "Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"] }
Option 2: Bearer Token
import requests import os def get_access_token() -> str: """Get access token from the token endpoint.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"] token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken" response = requests.post( token_url, headers={ "Ocp-Apim-Subscription-Key": api_key, "Content-Type": "application/x-www-form-urlencoded", "Content-Length": "0" } ) response.raise_for_status() return response.text # Use token in requests (valid for 10 minutes) token = get_access_token() headers = { "Authorization": f"Bearer {token}", "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000", "Accept": "application/json" }
Query Parameters
| Parameter | Required | Values | Description |
|---|---|---|---|
language | Yes | en-US, de-DE, etc. | Language of speech |
format | No | simple, detailed | Result format (default: simple) |
profanity | No | masked, removed, raw | Profanity handling (default: masked) |
Recognition Status Values
| Status | Description |
|---|---|
Success | Recognition succeeded |
NoMatch | Speech detected but no words matched |
InitialSilenceTimeout | Only silence detected |
BabbleTimeout | Only noise detected |
Error | Internal service error |
Profanity Handling
# Mask profanity with asterisks (default) params = {"language": "en-US", "profanity": "masked"} # Remove profanity entirely params = {"language": "en-US", "profanity": "removed"} # Include profanity as-is params = {"language": "en-US", "profanity": "raw"}
Error Handling
import requests def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None: """Transcribe with proper error handling.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"] url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1" try: with open(audio_path, "rb") as audio_file: response = requests.post( url, headers={ "Ocp-Apim-Subscription-Key": api_key, "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000", "Accept": "application/json" }, params={"language": language, "format": "detailed"}, data=audio_file ) if response.status_code == 200: result = response.json() if result.get("RecognitionStatus") == "Success": return result else: print(f"Recognition failed: {result.get('RecognitionStatus')}") return None elif response.status_code == 400: print(f"Bad request: Check language code or audio format") elif response.status_code == 401: print(f"Unauthorized: Check API key or token") elif response.status_code == 403: print(f"Forbidden: Missing authorization header") else: print(f"Error {response.status_code}: {response.text}") return None except requests.exceptions.RequestException as e: print(f"Request failed: {e}") return None
Async Version
import os import aiohttp import asyncio async def transcribe_async(audio_file_path: str, language: str = "en-US") -> dict: """Async version using aiohttp.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"] url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1" headers = { "Ocp-Apim-Subscription-Key": api_key, "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000", "Accept": "application/json" } params = {"language": language, "format": "detailed"} async with aiohttp.ClientSession() as session: with open(audio_file_path, "rb") as f: audio_data = f.read() async with session.post(url, headers=headers, params=params, data=audio_data) as response: response.raise_for_status() return await response.json() # Usage result = asyncio.run(transcribe_async("audio.wav", "en-US")) print(result["DisplayText"])
Supported Languages
Common language codes (see full list):
| Code | Language |
|---|---|
en-US | English (US) |
en-GB | English (UK) |
de-DE | German |
fr-FR | French |
es-ES | Spanish (Spain) |
es-MX | Spanish (Mexico) |
zh-CN | Chinese (Mandarin) |
ja-JP | Japanese |
ko-KR | Korean |
pt-BR | Portuguese (Brazil) |
Best Practices
- Use WAV PCM 16kHz mono for best compatibility
- Enable chunked transfer for lower latency
- Cache access tokens for 9 minutes (valid for 10)
- Specify the correct language for accurate recognition
- Use detailed format when you need confidence scores
- Handle all RecognitionStatus values in production code
When NOT to Use This API
Use the Speech SDK or Batch Transcription API instead when you need:
- Audio longer than 60 seconds
- Real-time streaming transcription
- Partial/interim results
- Speech translation
- Custom speech models
- Batch transcription of many files
Reference Files
| File | Contents |
|---|---|
| references/pronunciation-assessment.md | Pronunciation assessment parameters and scoring |
When to Use
This skill is applicable to execute the workflow or actions described in the overview.