Azure OpenAI Realtime client_secrets returns 500 when input_audio_transcription is included (Sweden Central)

James Morgan 0 Reputation points
2026-06-02T11:21:18.4166667+00:00

We are seeing a consistent server-side failure in Azure OpenAI Realtime when requesting client secrets with input_audio_transcription enabled.

Environment

  • Region: Sweden Central
  • Resource: LBBD-OpenAI-Sweden-Dev
  • Subscription: 52a9fd6d-324a-4cc7-861d-b17e4cf9c219
  • API path: /openai/v1/realtime/client_secrets
  • Auth: Managed Identity (DefaultAzureCredential)
  • Deployment tested: gpt-4o-mini-transcribe-sweden-dev-v2 (fresh deployment name)

Observed behavior

  1. Request WITH input_audio_transcription in session payload -> HTTP 500
  2. Same request WITHOUT input_audio_transcription -> HTTP 200

This is reproducible both directly against the endpoint and through our app route that mints realtime tokens.

What we already checked

  • payload structure
  • deployment recreation with new name
  • same auth and api-version across both requests
  • retries and fallback path

Question

Is there a known region-specific issue or feature-gating requirement for input_audio_transcription in Realtime session creation on Azure OpenAI? If not, what exact prerequisites are required for this field to work?

Azure Speech in Foundry Tools

2 answers

Sort by: Most helpful
  1. Thanmayi Godithi 10,820 Reputation points Microsoft External Staff Moderator
    2026-07-02T11:10:55.73+00:00

    Hi James Morgan,

    Thanks for the exceptionally clean repro — the paired request IDs and the "remove one block → 200" contrast make this easy to reason about. Here's where it lands.

    Root cause

    Your payload is valid, and the value you're passing for input_audio_transcription.model (gpt-4o-mini-transcribe-sweden-dev-v2) is a deployment name — which is exactly what Azure OpenAI requires. Azure deviates from OpenAI here and expects the deployment name, not a raw model ID like whisper-1. So that part is correct.

    Because the identical call returns 200 the moment the input_audio_transcription block is removed, this is a service-side (500) failure in the /client_secrets mint path when that block is present — not a validation, feature-gating, or configuration problem on your side. There's no documented region-specific prerequisite or feature flag for input_audio_transcription beyond using a supported transcription model referenced by deployment name.

    Immediate workaround (unblocks you now)

    Configure transcription after the connection instead of at token-mint time:

    1. Mint the ephemeral token without input_audio_transcription (this returns 200).
    2. Open the Realtime WebSocket/WebRTC connection using that token.
    3. Send a session.update event that enables transcription:
    {
      "type": "session.update",
      "session": {
        "type": "realtime",
        "input_audio_transcription": { "model": "gpt-4o-mini-transcribe-sweden-dev-v2" }
      }
    }
    

    The server replies session.updated with the transcription model applied, and input-transcription events start flowing. This post-connect path is confirmed working for this exact scenario, so it's a reliable way to proceed while the mint-time issue is addressed.

    Is this a bug?

    Effectively yes — it's a backend fault in the /client_secrets mint path, not something you can fix by changing config:

    • The payload is valid and the deployment-name usage is correct.
    • Removing one optional block flips 500 → 200.
    • The same transcription configuration is accepted cleanly on the post-connect session.update path — proving the content is supported; only the mint-time handling faults.

    There's precedent for transient service-side 500s on Sweden Central transcribe models that Microsoft mitigated on the backend, so it's also worth confirming whether this is persistent or intermittent right now.

    To get this investigated (please share)

    1. Is the 500 100% reproducible right now, or intermittent?
    2. The exact JSON of the failing /client_secrets request (with and without the block), plus 2–3 fresh failing apim-request-ids + UTC timestamps (you've already given c0d6e5c5-6e40-4039-a4e1-cd21f985cae1).
    3. Does it also fail with API-key auth (to rule out anything Managed-Identity-specific)?
    4. Does the same call succeed in another region (e.g., East US 2) with an equivalent transcription deployment?
    5. Confirm you're on the GA endpoint without an api-version query param.

    If the 500 persists with a valid payload and a supported transcription deployment, let us know the details requested over Private message.

    Please give the post-connect workaround a try and let me know if it unblocks your flow.

    If this helps, consider marking it as accepted so others hitting the same 500 can find it.

    Was this answer helpful?

    0 comments No comments

  2. AI answer

    2026-06-02T11:21:39.95+00:00

    There is no region-specific or feature-gating prerequisite for input_audio_transcription documented for Azure OpenAI Realtime beyond using the correct models and schema. A 500 response when adding input_audio_transcription indicates a service-side error rather than a validation or configuration issue on the client.

    From the Realtime REST reference:

    • input_audio_transcription is an optional configuration object on the Realtime session. It “defaults to off and can be set to null to turn off once on.”
    • It is not tied to a special feature flag in the public documentation.
    • The supported transcription models for this property are:
      • whisper-1
      • gpt-4o-mini-transcribe
      • gpt-4o-mini-transcribe-2025-12-15
      • gpt-4o-transcribe
      • gpt-4o-transcribe-diarize

    The only documented prerequisites for using input_audio_transcription are therefore:

    1. Use the GA Realtime API endpoints and protocol
      • Path: POST {endpoint}/openai/v1/realtime/client_secrets
      • Optional api-version query: v1 or preview (defaults to v1 if omitted).
    2. Provide a valid Realtime session payload
      • session.type must be set appropriately for a Realtime session.
      • If input_audio_transcription is included, its model value must be one of the supported transcription models listed above.
      • Optional fields like language and prompt follow the documented schema.
    3. Use a supported Realtime model deployment
      • For WebRTC/Realtime, use one of the Realtime models listed for the region (for example gpt-4o-mini-realtime-preview, gpt-4o-realtime-preview, gpt-realtime, or gpt-realtime-mini as documented for East US 2 and Sweden Central).
      • Ensure the deployment name in the session configuration matches the actual deployment.

    Given that:

    • The same request succeeds (200) when input_audio_transcription is omitted, and
    • The documentation does not list any additional region-specific requirements or feature flags for input_audio_transcription,

    this pattern is consistent with a backend issue in the Sweden Central Realtime service rather than a missing prerequisite in the request. The next steps that align with the documentation are:

    1. Verify that the input_audio_transcription.model is one of the supported transcription models listed above.
    2. Confirm that the Realtime model deployment being used is supported in Sweden Central and that the deployment name is correct.
    3. If the 500 persists with a valid payload and supported model, treat it as a service-side incident and raise it via Azure support, including:
      • Region (Sweden Central), resource name, and subscription ID.
      • The exact POST /openai/v1/realtime/client_secrets payload with and without input_audio_transcription.
      • Timestamps and apim-request-id values from the failing 500 responses.

    There is no documented feature-gating or extra configuration required beyond the above; a persistent 500 in this scenario requires service investigation.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.