Slow performance of gpt-5.4-mini in PTUs

Antoine 0 Reputation points
2026-06-18T11:41:48.2566667+00:00

Setup: model gpt-5.4-mini, deployment on 15 PTUs (EU).
We measure the latency of each chat-completion call in our voice-screening flow — reasoning_effort='none', ~26K-char system prompt, streaming.

What we measured: initially the deployment ran around p50 ~1.1s but with p95 ~3s and spikes up to ~4s.

After we increased the PTUs, p95 settled to ~1.9s (p50 ~1.1s, max ~2s). But when we then dropped from 30 to 15 PTUs, the latency didn't change at all — p95 stayed ~1.9s — so adding provisioned capacity isn't improving latency.

For reference, our same-resource gpt-4.1-mini also on 15 PTUs is faster: p50 ~0.9s, p95 ~1.35s.

What the benchmark says: per Artificial Analysis (a third-party LLM benchmark tracker), gpt-5.4-mini in non-reasoning mode — which matches our config — runs at roughly 180 tokens/sec output with ~1.2s TTFT on Azure, versus ~79 tokens/sec for gpt-4.1-mini. So 5.4-mini should generate about 2.3× faster and be at least as fast as 4.1-mini end-to-end. Source: https://artificialanalysis.ai/models/comparisons/gpt-5-4-mini-vs-gpt-4-1?models=gpt-5-4[…]ing%2Cgpt-4-1-mini%2Cgpt-4-1&speed=latency-vs-output-speed

Screenshot 2026-06-18 at 13.34.25

Why do we observe worse performance of gpt-5.4-mini in Azure (even with PTUs)?

Foundry Models
Foundry Models

A catalog of AI models in Microsoft Foundry that you can discover, compare, and deploy using Azure’s built‑in tools for evaluation, fine‑tuning, and inference


2 answers

Sort by: Most helpful
  1. SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator
    2026-06-22T11:19:06.9+00:00

    Hello @Antoine

    Thank you for reaching out to Microsoft Q&A.

    I understand why this behavior appears unexpected. Based on the benchmark data you referenced, GPT-5.4-mini operating with reasoning_effort='none' would appear to offer higher output throughput than GPT-4.1-mini, and increasing PTUs would typically be expected to improve performance consistency. However, the measurements you've shared suggest that the observed latency is likely influenced by factors other than available provisioned throughput alone.

    Provisioned Throughput Units (PTUs) primarily control throughput and concurrency, rather than reducing the intrinsic latency of an individual request.

    In practice:

    • PTUs help absorb concurrent traffic and reduce queuing delays.

    PTUs do not directly make a single inference request execute faster once the deployment is operating below saturation.

    If a deployment is not constrained by available capacity, increasing PTUs may have little or no effect on p50 or p95 latency.

    The pattern you observed aligns with this behavior:

    At 15 PTUs, the deployment may initially have experienced capacity pressure, resulting in p95 values around 3–4 seconds.

    Increasing to 30 PTUs likely reduced queuing and stabilized latency.

    After returning to 15 PTUs, latency remained unchanged because the workload was no longer saturating the deployment.

    This generally indicates that the current ~1.9s p95 is more likely related to model processing characteristics and workload composition rather than PTU availability.

    Benchmark results versus real-world workloads

    The Artificial Analysis benchmark is useful for directional comparison, but benchmark measurements are typically collected under highly controlled conditions and do not necessarily reflect production workloads.

    Your workload differs from benchmark testing in several significant ways:

    Streaming responses

    Voice-screening workflow

    Approximately 26,000-character system prompt

    Custom application logic

    Specific Azure region and deployment configuration

    Real-world traffic patterns and concurrency

    Metrics such as:

    Output Tokens per Second

    Time to First Token (TTFT)

    cannot always be directly translated into expected end-to-end application latency.

    Impact of the large system prompt

    One factor that stands out is the size of the system prompt.

    With a system prompt of approximately 26K characters, the model must:

    Receive the prompt.

    Tokenize the input.

    Process the entire context.

    Execute safety and orchestration checks.

    Generate the first token.

    Even with:

    reasoning_effort = none
    

    the prompt still needs to be fully processed before generation can begin.

    For many workloads, especially those with relatively short responses, latency is often dominated by prompt processing (prefill time) rather than token generation speed.

    Why GPT-5.4-mini may appear slower than GPT-4.1-mini

    Although GPT-5.4-mini may achieve higher output token generation rates in benchmark scenarios, different models have different inference characteristics.

    A model with higher steady-state generation throughput does not necessarily produce lower end-to-end latency.

    In particular:

    GPT-5.4-mini and GPT-4.1-mini use different architectures and optimizations.

    TTFT can differ significantly between models.

    Larger prompts increase prefill time.

    Streaming workloads are often TTFT-dominated rather than generation-speed dominated.

    This means GPT-4.1-mini can legitimately outperform GPT-5.4-mini for certain latency-sensitive workloads, even if GPT-5.4-mini produces tokens faster once generation begins.

    According to Azure OpenAI performance guidance, latency can also be affected by:

    Model type

    Prompt token count

    Completion token count

    Regional service load

    Request concurrency

    Traffic bursts

    Content filtering processing

    Workload mixing across deployments

    Azure OpenAI does not provide a fixed latency guarantee, and response times can vary depending on overall system conditions.

    Recommended validation steps

    To better isolate the source of latency, we recommend the following checks.

    1. Review Azure Monitor metrics

    Please compare GPT-5.4-mini and GPT-4.1-mini using Azure Monitor and split by deployment name.

    Useful metrics include:

    Time to Response (TTFT)

    Time Between Tokens

    Processed Prompt Tokens

    Generated Completion Tokens

    Request Count

    Provisioned-Managed Utilization V2

    Any throttling or rate-limit indicators

    If Utilization remains well below 100% at 15 PTUs, that would strongly indicate the workload is not capacity-bound and that additional PTUs are unlikely to improve latency.

    2. Compare prompt and completion token counts

    Please validate:

    • Actual prompt token count rather than character count
    • Average completion token count
    • Whether GPT-5.4-mini generates longer responses than GPT-4.1-mini

    3. Measure TTFT separately from total response time

    It would be helpful to determine whether the increase is occurring in:

    Time to First Token (TTFT)

    Token generation phase

    Overall request duration

    If TTFT accounts for most of the latency difference, prompt processing is likely the primary contributor.

    4. Test with a smaller system prompt

    As a controlled test, try reducing the system prompt from approximately 26K characters to a significantly smaller version and compare:

    TTFT

    End-to-end latency

    Output throughput

    This is often the most effective way to identify prompt-processing overhead.

    If minimizing latency is the primary goal, the following optimizations are likely to have the greatest impact:

    • Reduce the size of the system prompt where possible.
    • Consider prompt caching for static prompt content.
    • Continue using streaming responses.
    • Set max_tokens appropriately for the workload.
    • Avoid generating unnecessary completion tokens.
    • Separate different workload types onto dedicated deployments.
    • Review content filtering configuration if appropriate for the use case.

    Please refer this

    https://learn.microsoft.com/azure/ai-foundry/openai/how-to/latency (Performance and latency, throughput vs latency, factors affecting latency)

    https://learn.microsoft.com/azure/ai-foundry/openai/concepts/provisioned-throughput (PTU concepts)

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!

    Was this answer helpful?

    0 comments No comments

  2. Jerald Felix 15,370 Reputation points Volunteer Moderator
    2026-06-20T06:12:45.4966667+00:00

    Hello Antoine,

    Greetings! Thanks for raising this question in Q&A forum.

    What you are seeing is a common misunderstanding about what PTUs actually control. PTUs guarantee reserved capacity so your calls do not get throttled or queued behind other tenants, they do not directly make a single call's raw compute time faster. Your actual latency per call is mostly driven by two separate things, prompt processing time (which depends heavily on your 26K character system prompt) and per token generation time. Since adding or removing PTUs did not change your p95 once you were no longer saturated, this points to a fixed cost from prompt processing and model behavior rather than a capacity problem. Comparing raw token-per-second numbers from a third party benchmark is also not the full picture, since those benchmarks usually run with much shorter prompts and different conditions than your production setup.

    Here is how I would approach narrowing this down.

    Separate TTFT from per token speed In the Azure portal, go to your Azure OpenAI resource, then Monitoring, then Metrics. Add Time To First Token (AzureOpenAITimeToResponse) and Time Between Tokens (AzureOpenAINormalizedTBTInMS), split by ModelDeploymentName. This tells you whether the slowness is coming from prompt processing (TTFT) or from token generation speed (TBT). Given your long system prompt, I would expect TTFT to be the bigger factor for gpt-5.4-mini compared to gpt-4.1-mini.

    Check Processed Prompt Tokens Add the ProcessedPromptTokens metric as well. Larger prompts increase TTFT, and your 26K character system prompt is a meaningful prefill cost on every single call regardless of how many PTUs you have allocated.

    Confirm whether prompt caching is helping If your 26K character system prompt is identical across most calls, make sure it is positioned first in your message list so Azure OpenAI can reuse cached prefix tokens. Cached tokens do not consume PTU capacity and significantly reduce prefill time, this matters a lot for long static system prompts like yours.

    Watch Provisioned-managed Utilization Even though you said increasing PTUs helped once, check the Provisioned-managed Utilization V2 metric around the time of your tests. It is possible that scaling from 15 to 30 PTUs actually triggered a backend reallocation to a less congested placement, and that improvement carried over even after you scaled back down to 15, rather than the PTU count itself being the deciding factor.

    Treat the benchmark numbers as a rough signal, not a guarantee Third party benchmarks like Artificial Analysis usually test with short prompts and ideal conditions, so a 2.3x raw generation speed difference does not always translate into end to end latency improvement once a long system prompt and reasoning_effort settings are involved.

    For background on how Azure OpenAI splits up latency, this is useful: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/latency

    If after checking TTFT and confirming prompt caching is working you still see gpt-5.4-mini meaningfully slower than gpt-4.1-mini at the same PTU count, I would recommend opening a support ticket with your Request IDs and the metric screenshots, since at that point it becomes a backend or model specific behavior that needs the product team to investigate on their side.

    If this answer helps you kindly accept the answer which will help others who have similar questions

    Best Regards,

    Jerald Felix.

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.