A catalog of AI models in Microsoft Foundry that you can discover, compare, and deploy using Azure’s built‑in tools for evaluation, fine‑tuning, and inference
Hello @Antoine
Thank you for reaching out to Microsoft Q&A.
I understand why this behavior appears unexpected. Based on the benchmark data you referenced, GPT-5.4-mini operating with reasoning_effort='none' would appear to offer higher output throughput than GPT-4.1-mini, and increasing PTUs would typically be expected to improve performance consistency. However, the measurements you've shared suggest that the observed latency is likely influenced by factors other than available provisioned throughput alone.
Provisioned Throughput Units (PTUs) primarily control throughput and concurrency, rather than reducing the intrinsic latency of an individual request.
In practice:
- PTUs help absorb concurrent traffic and reduce queuing delays.
PTUs do not directly make a single inference request execute faster once the deployment is operating below saturation.
If a deployment is not constrained by available capacity, increasing PTUs may have little or no effect on p50 or p95 latency.
The pattern you observed aligns with this behavior:
At 15 PTUs, the deployment may initially have experienced capacity pressure, resulting in p95 values around 3–4 seconds.
Increasing to 30 PTUs likely reduced queuing and stabilized latency.
After returning to 15 PTUs, latency remained unchanged because the workload was no longer saturating the deployment.
This generally indicates that the current ~1.9s p95 is more likely related to model processing characteristics and workload composition rather than PTU availability.
Benchmark results versus real-world workloads
The Artificial Analysis benchmark is useful for directional comparison, but benchmark measurements are typically collected under highly controlled conditions and do not necessarily reflect production workloads.
Your workload differs from benchmark testing in several significant ways:
Streaming responses
Voice-screening workflow
Approximately 26,000-character system prompt
Custom application logic
Specific Azure region and deployment configuration
Real-world traffic patterns and concurrency
Metrics such as:
Output Tokens per Second
Time to First Token (TTFT)
cannot always be directly translated into expected end-to-end application latency.
Impact of the large system prompt
One factor that stands out is the size of the system prompt.
With a system prompt of approximately 26K characters, the model must:
Receive the prompt.
Tokenize the input.
Process the entire context.
Execute safety and orchestration checks.
Generate the first token.
Even with:
reasoning_effort = none
the prompt still needs to be fully processed before generation can begin.
For many workloads, especially those with relatively short responses, latency is often dominated by prompt processing (prefill time) rather than token generation speed.
Why GPT-5.4-mini may appear slower than GPT-4.1-mini
Although GPT-5.4-mini may achieve higher output token generation rates in benchmark scenarios, different models have different inference characteristics.
A model with higher steady-state generation throughput does not necessarily produce lower end-to-end latency.
In particular:
GPT-5.4-mini and GPT-4.1-mini use different architectures and optimizations.
TTFT can differ significantly between models.
Larger prompts increase prefill time.
Streaming workloads are often TTFT-dominated rather than generation-speed dominated.
This means GPT-4.1-mini can legitimately outperform GPT-5.4-mini for certain latency-sensitive workloads, even if GPT-5.4-mini produces tokens faster once generation begins.
According to Azure OpenAI performance guidance, latency can also be affected by:
Model type
Prompt token count
Completion token count
Regional service load
Request concurrency
Traffic bursts
Content filtering processing
Workload mixing across deployments
Azure OpenAI does not provide a fixed latency guarantee, and response times can vary depending on overall system conditions.
Recommended validation steps
To better isolate the source of latency, we recommend the following checks.
1. Review Azure Monitor metrics
Please compare GPT-5.4-mini and GPT-4.1-mini using Azure Monitor and split by deployment name.
Useful metrics include:
Time to Response (TTFT)
Time Between Tokens
Processed Prompt Tokens
Generated Completion Tokens
Request Count
Provisioned-Managed Utilization V2
Any throttling or rate-limit indicators
If Utilization remains well below 100% at 15 PTUs, that would strongly indicate the workload is not capacity-bound and that additional PTUs are unlikely to improve latency.
2. Compare prompt and completion token counts
Please validate:
- Actual prompt token count rather than character count
- Average completion token count
- Whether GPT-5.4-mini generates longer responses than GPT-4.1-mini
3. Measure TTFT separately from total response time
It would be helpful to determine whether the increase is occurring in:
Time to First Token (TTFT)
Token generation phase
Overall request duration
If TTFT accounts for most of the latency difference, prompt processing is likely the primary contributor.
4. Test with a smaller system prompt
As a controlled test, try reducing the system prompt from approximately 26K characters to a significantly smaller version and compare:
TTFT
End-to-end latency
Output throughput
This is often the most effective way to identify prompt-processing overhead.
If minimizing latency is the primary goal, the following optimizations are likely to have the greatest impact:
- Reduce the size of the system prompt where possible.
- Consider prompt caching for static prompt content.
- Continue using streaming responses.
- Set
max_tokensappropriately for the workload. - Avoid generating unnecessary completion tokens.
- Separate different workload types onto dedicated deployments.
- Review content filtering configuration if appropriate for the use case.
Please refer this
https://learn.microsoft.com/azure/ai-foundry/openai/how-to/latency (Performance and latency, throughput vs latency, factors affecting latency)
https://learn.microsoft.com/azure/ai-foundry/openai/concepts/provisioned-throughput (PTU concepts)
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, please do click Accept Answer and Yes for was this answer helpful.
Thank you!