Fresh deployed Foundry model gets 429 rate limit essentially with zero call usage yet.

Allen Zhang 0 Reputation points
2026-06-09T00:04:53.72+00:00

On a lightly used Azure subscription, I created a new foundry project and deployed DeepSeek-v4-flash model. Connected via Foundry kit extension in VS Code to Agent. With zero usage from this model (No other project or model deployed since 6/1), I am getting 429 error from a simple "testing" CoPilot agent prompt.

Unable to call the (az-proj-eus2)DeepSeek-V4-Flash inference endpoint due to 429. Your requests to DeepSeek-V4-Flash for DeepSeek-V4-Flash in eastus2 have exceeded rate limit. Please check if the input or configuration is correct.

The quota shows full 20RPM with 0% rate limiting, and yet no Agent mode request is going thru.

User's image

Foundry Models
Foundry Models

A catalog of AI models in Microsoft Foundry that you can discover, compare, and deploy using Azure’s built‑in tools for evaluation, fine‑tuning, and inference

0 comments No comments

3 answers

Sort by: Most helpful
  1. Thanmayi Godithi 10,820 Reputation points Microsoft External Staff Moderator
    2026-07-02T10:03:19.8433333+00:00

    Hey Allen Zhang! That feels weird at first (“zero calls yet I’m getting 429”), but with Azure AI Foundry/Foundry Models throttling there are a few known reasons a 429 can happen even when RPM/TPM in the UI looks like it should be fine.

    Based on the provided info, here are the most likely causes and what to try next.

    1. Check quota at the deployment-level (TPM/RPM), not just subscription-level

    The docs call out that you can be “approved” at a high level but still hit 429s because quota isn’t effectively allocated to the specific deployment receiving traffic.

    What to do

    • In Azure AI Foundry, open the deployment (not just the project/subscription quota view)
    • Confirm the deployment’s effective quota/TPM allocation for that model + region
    • Look specifically for whether the deployment has any token/per-minute allocation configured
    1. Transient throttling / backend scaling (429 even when you’re “under quota”)

    Even if you’re not exceeding configured quotas, Azure can still return 429 during backend scaling/adjustments. In that scenario:

    • the error can occur even with very low actual usage
    • retrying later (honoring retry-after-ms) is expected to resolve it
    • the throttling can affect effective rate limits temporarily

    What to do

    • Implement retry with backoff for 429s (prefer SDK built-in retry)
    • If you’re seeing the HTTP headers in the response, compare the effective limits (for example, x-ratelimit-limit-tokens) against your configured TPM to confirm whether there’s a temporary adjustment
    1. max_tokens (and similar request parameters) can consume rate-limit budget

    Rate-limit calculations can include the request parameters (like max_tokens), not just the eventual billed tokens. So even “small prompts” can trigger throttling if max_tokens is set high.

    What to do

    • Reduce max_tokens in the agent/tool request (if you can control it)
    • Avoid best_of (if applicable)
    1. Wait/refresh and retry the deployment path

    There’s guidance that refreshing and retrying can resolve transient issues related to loading/handling deployments.

    What to do

    • Refresh the Foundry page (or the relevant UI)
    • Retry the call after refresh

    If you continue getting sustained 429s while you believe you’re below effective limits, share the details requested over Private message.

    Kindly let us know if the above helps or you need further assistance on this issue.

    If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Was this answer helpful?

    0 comments No comments

  2. Jerald Felix 15,370 Reputation points Volunteer Moderator
    2026-06-09T01:47:02.5866667+00:00

    Hello Allen Zhang,

    Greetings! Thanks for raising this question in Q&A forum.

    This is a very common and confusing behavior with Azure AI Foundry model deployments. The portal showing 0% rate limiting and full 20 RPM quota does not mean the 429 cannot occur and here is why.

    Rate limiting estimates the request's maximum processed tokens at request time, including max_tokens, and RPM enforcement looks at short windows inside the minute. This means your app can throttle itself even when Azure Monitor usage looks calm. In other words, the portal usage gauge reflects consumed tokens over the full minute, but the actual enforcement happens in very short sub-second intervals.

    Additionally, due to the approximate nature of the rate limit token calculation, it is expected behavior that a rate limit can be triggered prior to what might be expected when compared to an exact token count measurement for each request.

    Here are the steps to diagnose and resolve this:

    Check how the VS Code Foundry Kit Agent is setting max_tokens The estimated max-processed-token count is added to a running token count of all requests that resets each minute. If the TPM rate limit value is reached at any point during that minute, further requests receive a 429 response code until the counter resets. Even a single test prompt can consume the full 20 RPM budget in one burst if the agent is sending multiple internal requests behind the scenes (e.g., tool calls, context retrieval, or planning steps). Check the agent settings and reduce max_tokens to the minimum needed for testing.

    Check for burst behavior from the agent The Copilot agent in VS Code may be firing multiple rapid sub-requests in the background for a single user prompt. RPM rate limits are evaluated over a small period of time, typically 1 or 10 seconds. If the number of requests received during that time exceeds what would be expected at the set RPM limit, new requests receive a 429 response code until the next evaluation period. At only 20 RPM, that is approximately 1 request every 3 seconds — which an agent can easily exceed during initialization.

    Inspect the response headers for retry-after-ms When a 429 is returned, check the response headers for retry-after-ms or Retry-After. This tells you exactly how long to wait before retrying. If your client is not honoring this header and is retrying immediately, it will keep hitting 429 in a loop.

    Request a quota increase for the DeepSeek-V4-Flash deployment If you encounter 429 errors, you can visit the Azure portal to request a higher quota for your subscription. For latency-critical or high-volume workloads, consider upgrading to Provisioned Throughput Units (PTU), which provide dedicated resources, guaranteed capacity, and predictable latency even at scale. For a testing scenario, even a modest quota increase above 20 RPM should give you headroom to work comfortably.

    Try the model directly via the Foundry portal playground (not the agent) As a quick test, go to your Foundry project in the Azure portal, navigate to the deployment, and send a test prompt directly from the Test tab — bypassing the VS Code agent entirely. If that succeeds, the issue is specifically with how the agent is batching or sending requests, not with the deployment itself.

    If 429s persist even in the playground, open an Azure Support ticket This would indicate a regional capacity constraint or a backend throttle applied to your subscription for East US 2. In the ticket, include your subscription ID, the deployment name, the region (eastus2), and a note that 429s occur with zero prior usage on a fresh deployment.

    If this answer helps you kindly accept the answer which will help others who have similar questions.

    Best Regards,

    Jerald Felix.

    Was this answer helpful?

    0 comments No comments

  3. AI answer

    2026-06-09T00:05:21.03+00:00

    The 429 on a freshly deployed Foundry model with apparently zero usage is most likely caused by backend rate‑limit or capacity behavior rather than actual request volume from the new deployment.

    From the quota and limits behavior for Foundry Models and Azure OpenAI:

    1. 429 can occur even when portal usage looks empty
      429 “Too Many Requests” is returned whenever the effective rate limit is exceeded or the system can’t process the request at that time. This includes:
      • Standard/global deployments sharing a regional capacity pool where system capacity throttling or temporary rate‑limit adjustment is in effect.
      • Scenarios where the effective rate limit is temporarily reduced below the configured quota for reliability.
      In these cases, the portal can still show the full configured RPM/TPM and 0% rate limiting, while the service responds with 429.
    2. Types of 429s to distinguish
      For a new deployment that has not sent traffic, the relevant 429 scenarios are:
      • System capacity throttling – backend capacity constrained; message text often indicates the service is temporarily unable to process the request. Recommended action: retry using retry-after-ms and exponential backoff; consider Provisioned Throughput (PTU) if persistent.
      • Temporary rate limit adjustment – the x-ratelimit-limit-tokens header is lower than the configured TPM; this is a temporary protective reduction for shared pools. Recommended action: respect retry-after-ms and back off; typically resolves within a few hours.
    3. Required client behavior
      Ensure the calling client (VS Code Foundry kit / Agent) is using proper retry logic:
      • Automatically retry on 429 using retry-after-ms when present, otherwise exponential backoff with jitter and a bounded number of retries.
      • Avoid rapid repeated retries without backoff, which will worsen throttling.
    4. Check deployment‑level allocation vs. subscription quota
      Even when subscription quota is available, 429 can occur if the deployment‑level allocation or the regional capacity pool is effectively constrained. Confirm that:
      • The DeepSeek‑V4‑Flash deployment in East US 2 has the expected RPM/TPM allocation.
      • No other deployments in the same region/model family are consuming the shared allocation.
    5. If the condition persists
      If 429s continue for many hours on a new deployment despite minimal traffic and correct client retry behavior, this can indicate a regional capacity issue similar to other quota/capacity anomalies. In that case, recommended actions are:
      • Try a different supported region for the same model, or another deployment type (for example, Provisioned Throughput) if available.
      • Monitor usage and rate‑limit metrics in the Quota view to confirm whether the effective limits change.
      • If the region remains unusable, open a support case so the backend allocation state and regional capacity can be investigated.

    These behaviors are expected for shared, global‑standard capacity pools and are not necessarily tied to actual token usage from the new deployment.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.