AI using Azure Foundry with budget subscription policy

Question

AI using Azure Foundry with budget subscription policy

Stavros Koureas 16

Using Azure API Management Service connected with Azure Foundry (aka AI Gateway), we can define the available APIs like completions, responses, messages, etc and configure polices.

One policy to restrict tokens (input and output) per montly quota is the following:

<llm-token-limit counter-key="@($"subscription/{context.Subscription?.Key}")" token-quota="1000" token-quota-period="Monthly" estimate-prompt-tokens="true" remaining-quota-tokens-variable-name="remainingTotalTokens" tokens-consumed-variable-name="consumedTotalTokens" />

This policy has two major gaps. First, it does not define input/output variables or usage constraints, even though these factors have a substantial impact on cost. Second, it does not translate those variables into an actual cost calculation based on the specific deployment or model being used.

The only way so far is to build a custom external app that does this but this require lot of efforts and there is risk into cost translation, as there also cache tokens or token price could change.

Also it would be great if there was an option into the Developer portal so each subscription can show quota and consumption in cost.

Is there any other solution?

Rakesh Mishra 10,280 Reputation points Microsoft External Staff Moderator

2026-06-23T12:39:29.31+00:00
Hello Stavros,

Thank you for reaching out on Microsoft Q&A.

You have rightly identified a current limitation. The <llm-token-limit> policy is designed exclusively for token-based tracking and throttling. According to the official documentation, "The llm-token-limit policy prevents API usage spikes on a per key basis by limiting the consumption of LLM tokens to a specified number per minute." It does not currently support dynamic cost calculations that account for the differing prices of input/output tokens or caching behavior.

Additionally, the APIM Developer Portal does not currently have a native feature to display financial consumption (in USD/EUR) per subscription key.

To achieve cost visibility and enforcement without building a complex external application from scratch, the recommended architecture combines Azure Cost Management with automated enforcement workflows.

Recommended Architecture

Enforce Token Limits (Real-time): Continue using the <llm-token-limit> policy to establish an upper bound on token consumption per subscription to prevent sudden abuse.

Track Monetary Budgets: Utilize Azure Cost Management to monitor actual spend. As noted in the documentation: "You can create budgets to manage costs and create alerts that automatically notify stakeholders of spending anomalies and overspending risks."

Automate Cost Enforcement (Low-Code): To enforce a hard monetary limit, you do not need a custom application. Instead, configure an Azure Budget Alert to trigger an Action Group. You can route this Action Group to an Azure Logic App or Azure Function that automatically suspends the specific APIM subscription key once the budget is breached.

While this introduces a slight delay (as Cost Management evaluation is not real-time per millisecond like APIM policies), it provides a native, serverless way to enforce financial limits without managing custom caching or token-to-cost translation logic.

Let me know if you would like detailed steps on how to configure an Action Group to automatically suspend an APIM subscription. Also, if you want you can provide feedback on https://feedback.azure.com or on GitHub. And follow the product roadmap on Azure Updates (https://azure.microsoft.com/updates/)
Stavros Koureas 16 Reputation points

2026-07-01T09:54:47.65+00:00
Hi @Rakesh Mishra , I think you are mixing two things here, one Azure Foundry cost monitoring / alerts and the controlling token consumption on APIM, but from the moment this policy llm-token-limit counts both input and output tokens setting for example this value to 10M tokens per month produces two major issues for text models:

users consuming lot of input tokens will be throttles very soon (but this is cheap), for example gpt models

users consuming lot of output tokens will bring more that expected cost, for example claude models

And it brings more challenges for other models like:

image, which charges per tokens and image

audio, which charhes per tokens and seconds

video, which charhes per tokens and seconds

For this reason I have created these custom policies to translate the consumption to cost, but this increases the maitenance for each api and model plus the fact that it lucks from cost caluclation on details like text cached tokens, or image sizes, or video sizes.
https://github.com/koureasstavros/AzureAIGatewayCostMechanism

1 answer

Your answer

Stavros Koureas 16 Reputation points

2026-07-01T09:54:47.65+00:00

Hi @Rakesh Mishra , I think you are mixing two things here, one Azure Foundry cost monitoring / alerts and the controlling token consumption on APIM, but from the moment this policy llm-token-limit counts both input and output tokens setting for example this value to 10M tokens per month produces two major issues for text models:

users consuming lot of input tokens will be throttles very soon (but this is cheap), for example gpt models

users consuming lot of output tokens will bring more that expected cost, for example claude models

And it brings more challenges for other models like:

image, which charges per tokens and image

audio, which charhes per tokens and seconds

video, which charhes per tokens and seconds

For this reason I have created these custom policies to translate the consumption to cost, but this increases the maitenance for each api and model plus the fact that it lucks from cost caluclation on details like text cached tokens, or image sizes, or video sizes.
https://github.com/koureasstavros/AzureAIGatewayCostMechanism

Answer 1

Jerald Felix 15,370 Volunteer Moderator

Hello Stavros Koureas,

Greetings! Thanks for raising this question in Q&A forum.

You have correctly identified a real gap in the llm-token-limit policy. It enforces a raw token count quota but has no awareness of model-specific pricing, so it cannot translate consumption into actual cost or apply separate input/output token budgets. There is no native cost-budget policy in APIM today, but there is a practical way to get close using the existing policy set without building a fully external app.

Here is the approach that covers both of your gaps:

Separate input and output token tracking using the variables already exposed by llm-token-limit. The tokens-consumed-variable-name captures total tokens after the response returns, and you can read the usage.prompt_tokens and usage.completion_tokens fields from the response body in an outbound policy to split them. Add this in your outbound section:

<set-variable name="promptTokens" value="@(((IResponse)context.Response).Body.As<JObject>()["usage"]["prompt_tokens"].Value<int>())" />
<set-variable name="completionTokens" value="@(((IResponse)context.Response).Body.As<JObject>()["usage"]["completion_tokens"].Value<int>())" />

Compute an approximate cost inline using a policy expression. Once you have prompt and completion tokens separated, multiply by the per-token rate for your deployed model. For example for gpt-4o at current pricing:

<set-variable name="estimatedCost" value="@{
    int prompt = (int)context.Variables["promptTokens"];
    int completion = (int)context.Variables["completionTokens"];
    double cost = (prompt / 1000000.0 * 2.50) + (completion / 1000000.0 * 10.00);
    return cost.ToString("F6");
}" />

You would update the rates in this expression whenever Microsoft changes model pricing. Yes, this requires a policy update when prices change, but it is far simpler than maintaining a full external pricing service.

Emit the cost and split token counts to Application Insights using llm-emit-token-metric with custom dimensions. The llm-emit-token-metric policy sends custom metrics to Application Insights about LLM token consumption and in preview now includes cached, reasoning, and thinking token categories in addition to prompt and completion tokens. Add it like this:

<llm-emit-token-metric namespace="CostTracking">
    <dimension name="SubscriptionId" value="@(context.Subscription?.Key)" />
    <dimension name="PromptTokens" value="@(context.Variables["promptTokens"].ToString())" />
    <dimension name="CompletionTokens" value="@(context.Variables["completionTokens"].ToString())" />
    <dimension name="EstimatedCostUSD" value="@(context.Variables["estimatedCost"].ToString())" />
</llm-emit-token-metric>

Note that Azure Monitor currently limits you to 10 dimension keys per metric and 50,000 total active time series per region in a 12-hour period, so plan your dimensions carefully.

Build a cost budget enforcement gate using a named value or external cache. Store a monthly cost budget per subscription key in APIM's named values or in an Azure Cache for Redis entry. In your inbound policy, read the accumulated cost for the current subscription and return a 403 if the budget is exceeded:

<cache-lookup-value key="@($"cost:{context.Subscription?.Key}:{DateTime.UtcNow:yyyy-MM}")" variable-name="accumulatedCost" />
<choose>
    <when condition="@((double)context.Variables.GetValueOrDefault("accumulatedCost", 0.0) >= 50.0)">
        <return-response>
            <set-status code="403" reason="Monthly cost budget exceeded" />
        </return-response>
    </when>
</choose>

Then in the outbound policy, increment and store the updated cost back to cache after each successful call.

For the Developer Portal quota visibility you mentioned, this is not natively supported today as a cost view. APIM does emit token metrics via the llm-emit-token-metric policy and you can add custom dimensions to filter the metric in Azure Monitor, so you can build an Azure Workbook or Application Insights dashboard that shows per-subscription token consumption and estimated cost and share that link with your API consumers through the Developer Portal's custom content pages.

At Build 2026, Microsoft expanded token metrics to track reasoning, cached, and audio tokens across providers, which helps FinOps teams building cost dashboards and budget alerts capture how current models actually behave.

If this answer helps you kindly accept the answer which will help others who have similar questions.

Best Regards,

Jerald Felix.

Stavros Koureas 16

Hi @Jerald Felix ,

Thank you for the very informative details.

I have created and tested a full policy according to your details that can limit input and output tokens.

One correction is that according to my tests there is the need to read response stream only once as second time it would be empty because it already consumed and therefore it will result into 500 error.

<policies>
    <inbound>
        <base />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
        <!-- 1. Read body ONCE into a variable -->
        <set-variable name="responseBody" value="@{
            return ((IResponse)context.Response).Body.As<JObject>(preserveContent: true);
        }" />
        <!-- 2. Extract token usage from the cached variable -->
        <set-variable name="actualInputTokens" value="@{
            return (int)((JObject)context.Variables["responseBody"])["usage"]["input_tokens"];
        }" />
        <set-variable name="actualOutputTokens" value="@{
            return (int)((JObject)context.Variables["responseBody"])["usage"]["output_tokens"];
        }" />
        <!-- 3. Load existing counters from cache -->
        <cache-lookup-value key="@($"tokens/{context.Subscription.Key}/input")" variable-name="storedInputTokens" />
        <cache-lookup-value key="@($"tokens/{context.Subscription.Key}/output")" variable-name="storedOutputTokens" />
        <!-- 4. Increment counters -->
        <set-variable name="newInputTotal" value="@{
            int previous = 0;
            if (context.Variables.ContainsKey("storedInputTokens")) {
                previous = (int)context.Variables["storedInputTokens"];
            }
            int used = (int)context.Variables["actualInputTokens"];
            return previous + used;
        }" />
        <set-variable name="newOutputTotal" value="@{
            int previous = 0;
            if (context.Variables.ContainsKey("storedOutputTokens")) {
                previous = (int)context.Variables["storedOutputTokens"];
            }
            int used = (int)context.Variables["actualOutputTokens"];
            return previous + used;
        }" />
        <!-- 5. Store updated counters back into cache (30 days) -->
        <cache-store-value key="@($"tokens/{context.Subscription.Key}/input")" value="@((int)context.Variables["newInputTotal"])" duration="2592000" />
        <cache-store-value key="@($"tokens/{context.Subscription.Key}/output")" value="@((int)context.Variables["newOutputTotal"])" duration="2592000" />
        <!-- 6. Enforce limits -->
        <choose>
            <!-- Input token limit -->
            <when condition="@((int)context.Variables["newInputTotal"] > 1000)">
                <return-response>
                    <set-status code="429" reason="Input Token Limit Exceeded" />
                    <set-body>Input token quota exceeded</set-body>
                </return-response>
            </when>
            <!-- Output token limit -->
            <when condition="@((int)context.Variables["newOutputTotal"] > 20000)">
                <return-response>
                    <set-status code="429" reason="Output Token Limit Exceeded" />
                    <set-body>Output token quota exceeded</set-body>
                </return-response>
            </when>
        </choose>
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

This can be enhanced even further with cached input and cached output tokens, but while having multiple deployments, how can I translate into cost? Do I have to make these multiplications with every possible deployment?

Stavros Koureas 16

Hi @Jerald Felix ,

Thank you for the very informative details.

I have created and tested a full policy according to your details that can limit input and output tokens.

One correction is that according to my tests there is the need to read response stream only once as second time it would be empty because it already consumed and therefore it will result into 500 error.

<policies>
    <inbound>
        <base />
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
        <!-- 1. Read body ONCE into a variable -->
        <set-variable name="responseBody" value="@{
            return ((IResponse)context.Response).Body.As<JObject>(preserveContent: true);
        }" />
        <!-- 2. Extract token usage from the cached variable -->
        <set-variable name="actualInputTokens" value="@{
            return (int)((JObject)context.Variables["responseBody"])["usage"]["input_tokens"];
        }" />
        <set-variable name="actualOutputTokens" value="@{
            return (int)((JObject)context.Variables["responseBody"])["usage"]["output_tokens"];
        }" />
        <!-- 3. Load existing counters from cache -->
        <cache-lookup-value key="@($"tokens/{context.Subscription.Key}/input")" variable-name="storedInputTokens" />
        <cache-lookup-value key="@($"tokens/{context.Subscription.Key}/output")" variable-name="storedOutputTokens" />
        <!-- 4. Increment counters -->
        <set-variable name="newInputTotal" value="@{
            int previous = 0;
            if (context.Variables.ContainsKey("storedInputTokens")) {
                previous = (int)context.Variables["storedInputTokens"];
            }
            int used = (int)context.Variables["actualInputTokens"];
            return previous + used;
        }" />
        <set-variable name="newOutputTotal" value="@{
            int previous = 0;
            if (context.Variables.ContainsKey("storedOutputTokens")) {
                previous = (int)context.Variables["storedOutputTokens"];
            }
            int used = (int)context.Variables["actualOutputTokens"];
            return previous + used;
        }" />
        <!-- 5. Store updated counters back into cache (30 days) -->
        <cache-store-value key="@($"tokens/{context.Subscription.Key}/input")" value="@((int)context.Variables["newInputTotal"])" duration="2592000" />
        <cache-store-value key="@($"tokens/{context.Subscription.Key}/output")" value="@((int)context.Variables["newOutputTotal"])" duration="2592000" />
        <!-- 6. Enforce limits -->
        <choose>
            <!-- Input token limit -->
            <when condition="@((int)context.Variables["newInputTotal"] > 1000)">
                <return-response>
                    <set-status code="429" reason="Input Token Limit Exceeded" />
                    <set-body>Input token quota exceeded</set-body>
                </return-response>
            </when>
            <!-- Output token limit -->
            <when condition="@((int)context.Variables["newOutputTotal"] > 20000)">
                <return-response>
                    <set-status code="429" reason="Output Token Limit Exceeded" />
                    <set-body>Output token quota exceeded</set-body>
                </return-response>
            </when>
        </choose>
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

This can be enhanced even further with cached input and cached output tokens, but while having multiple deployments, how can I translate into cost? Do I have to make these multiplications with every possible deployment?

Stavros Koureas 16 Reputation points

2026-06-28T14:57:53.3466667+00:00

Hi @Jerald Felix ,

Just noticed that even we read one time the response stream, several applications like VSCode GitHub Copilot which support streaming will be unable to continue working with the following line:

<set-variable name="responseBody" value="@{ return ((IResponse)context.Response).Body.As<JObject>(preserveContent: true); }" />
Stavros Koureas 16 Reputation points

2026-07-01T09:58:17.7633333+00:00

Hi @Jerald Felix ,

I managed to find a way to read the steam without consuming it for streaming apis and used sample of your code to create some policies per api.

I have created these custom policies to translate the consumption to cost, but this increases the maitenance for each api and model plus the fact that it lucks from cost caluclation on details like text cached tokens, or image sizes, or video sizes. Also i have scripts for consolidated cost report and code for user cost widget.

It would be nice to had this cost dimension from Azure internal mechanism in apim for api calls to backends that are charging per api calls. I think this can be implemented by microsoft as they are doing it already into Azure Foundry but for all requests.

https://github.com/koureasstavros/AzureAIGatewayCostMechanism

AI using Azure Foundry with budget subscription policy

Recommended Architecture

1 answer

Your answer