Deploy a flow to online endpoint for real-time inference with CLI

Warning

Prompt flow in Microsoft Foundry and Azure Machine Learning will be retired on April 20, 2027. Prompt flow is no longer recommended for new development. Migrate existing Prompt flow applications and deployments to Microsoft Agent Framework before April 20, 2027.

Prompt flow container images are no longer receiving updates, including security and package updates. This applies to Prompt flow runtime images, including promptflow-runtime, promptflow-runtime-stable, and promptflow-python.

After April 20, 2027, Prompt flow, including the web authoring experience in Microsoft Foundry and Azure Machine Learning, the VS Code extensions, and related Prompt flow container images, will no longer be supported or available.

If your application depends on Prompt flow deployments or runtime images, plan to move those workloads to supported alternatives such as Microsoft Agent Framework before the retirement date. For migration guidance, see the Prompt flow migration guide and migration code samples.

In this article, you learn how to deploy your flow to a managed online endpoint or a Kubernetes online endpoint for use in real-time inferencing by using Azure Machine Learning v2 CLI.

Before you begin, make sure that you test your flow properly and feel confident that it's ready to be deployed to production. To learn more about testing your flow, see test your flow. After testing your flow, you learn how to create managed online endpoint and deployment, and how to use the endpoint for real-time inferencing.

This article covers how to use the CLI experience.
The Python SDK isn't covered in this article. See the GitHub sample notebook instead. To use the Python SDK, you must have The Python SDK v2 for Azure Machine Learning. To learn more, see Install the Python SDK v2 for Azure Machine Learning.

Important

Items marked (preview) in this article are currently in public preview. The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Prerequisites

The Azure CLI and the Azure Machine Learning extension to the Azure CLI. For more information, see Install, set up, and use the CLI (v2).
An Azure Machine Learning workspace. If you don't have one, use the steps in the Quickstart: Create workspace resources article to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure Machine Learning. To perform the steps in this article, your user account must be assigned the owner or contributor role for the Azure Machine Learning workspace, or a custom role allowing "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/". If you use studio to create and manage online endpoints and deployments, you need another permission "Microsoft.Resources/deployments/write" from the resource group owner. For more information, see Manage access to an Azure Machine Learning workspace.

Note

Managed online endpoint only supports managed virtual network. If your workspace is in custom virtual network, you can deploy to Kubernetes online endpoint, or deploy to other platforms such as Docker.

Virtual machine quota allocation for deployment

For managed online endpoints, Azure Machine Learning reserves 20% of your compute resources for performing upgrades. Therefore, if you request a given number of instances in a deployment, you must have a quota for ceil(1.2 * number of instances requested for deployment) * number of cores for the VM SKU available to avoid getting an error. For example, if you request 10 instances of a Standard_DS3_v2 VM (that comes with four cores) in a deployment, you should have a quota for 48 cores (12 instances four cores) available. To view your usage and request quota increases, see View your usage and quotas in the Azure portal.

Get the flow ready for deploy

Each flow has a folder that contains codes, prompts, definition, and other artifacts of the flow. If you develop your flow by using the UI, you can download the flow folder from the flow details page. If you develop your flow by using CLI or SDK, you already have the flow folder.

This article uses the sample flow "basic-chat" as an example to deploy to Azure Machine Learning managed online endpoint.

Important

If you use additional_includes in your flow, first use pf flow build --source <path-to-flow> --output <output-path> --format docker to get a resolved version of flow folder.

Set default workspace

Use the following commands to set the default workspace and resource group for the CLI.

az account set --subscription <subscription ID>
az configure --defaults workspace=<Azure Machine Learning workspace name> group=<resource group>

Register the flow as a model (optional)

In the online deployment, you can either refer to a registered model or specify the model path (where to upload the model files from) inline. Register the model and specify the model name and version in the deployment definition. Use the form model:<model_name>:<version>.

The following example shows a model definition for a chat flow.

Note

If your flow isn't a chat flow, you don't need to add these properties.

$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: basic-chat-model
path: ../../../../examples/flows/chat/basic-chat
description: register basic chat flow folder as a custom model
properties:
  # In AuzreML studio UI, endpoint detail UI Test tab needs this property to know it's from prompt flow
  azureml.promptflow.source_flow_id: basic-chat
  
  # Following are properties only for chat flow 
  # endpoint detail UI Test tab needs this property to know it's a chat flow
  azureml.promptflow.mode: chat
  # endpoint detail UI Test tab needs this property to know which is the input column for chat flow
  azureml.promptflow.chat_input: question
  # endpoint detail UI Test tab needs this property to know which is the output column for chat flow
  azureml.promptflow.chat_output: answer

Use az ml model create --file model.yaml to register the model to your workspace.

Define the endpoint

To define an endpoint, specify the following values:

Endpoint name: The name of the endpoint. It must be unique in the Azure region. For more information on the naming rules, see endpoint limits.
Authentication mode: The authentication method for the endpoint. Choose between key-based authentication and Azure Machine Learning token-based authentication. A key doesn't expire, but a token does expire. For more information on authenticating, see Authenticate to an online endpoint. Optionally, add a description and tags to your endpoint.
Optionally, add a description and tags to your endpoint.
If you want to deploy to a Kubernetes cluster (AKS or Arc enabled cluster) that you attach to your workspace, you can deploy the flow as a Kubernetes online endpoint.

The following example shows an endpoint definition that uses system-assigned identity by default.

Managed online endpoint
Kubernetes online endpoint

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: basic-chat-endpoint
auth_mode: key
properties:
# this property only works for system-assigned identity.
# if the deploy user has access to connection secrets, 
# the endpoint system-assigned identity will be auto-assigned connection secrets reader role as well
  enforce_access_to_default_secret_stores: enabled

$schema: https://azuremlschemas.azureedge.net/latest/kubernetesOnlineEndpoint.schema.json
name: basic-chat-endpoint
compute: azureml:<Kubernetes compute name>
auth_mode: key

Important

Key	Description
`$schema`	(Optional) The YAML schema. To see all available options in the YAML file, you can view the schema in the preceding code snippet in a browser.
`name`	The name of the endpoint.
`auth_mode`	Use `key` for key-based authentication. Use `aml_token` for Azure Machine Learning token-based authentication. To get the most recent token, use the `az ml online-endpoint get-credentials` command.
`property: enforce_access_to_default_secret_stores` (preview)	- By default the endpoint uses system-asigned identity. This property only works for system-assigned identity. - This property means if you have the connection secrets reader permission, the endpoint system-assigned identity is auto-assigned Azure Machine Learning Workspace Connection Secrets Reader role of the workspace, so that the endpoint can access connections correctly when performing inferencing. - By default this property is `disabled``.

If you create a Kubernetes online endpoint, you need to specify the following attributes:

Key	Description
`compute`	The Kubernetes compute target to deploy the endpoint to.

For more configurations of endpoint, see managed online endpoint schema.

Important

If your flow uses Microsoft Entra ID based authentication connections, no matter you use system-assigned identity or user-assigned identity, you always need to grant the managed identity appropriate roles of the corresponding resources so that it can make API calls to that resource. For example, if your Azure OpenAI connection uses Microsoft Entra ID based authentication, you need to grant your endpoint managed identity Cognitive Services OpenAI User or Cognitive Services OpenAI Contributor role of the corresponding Azure OpenAI resources.

Use user-assigned identity

By default, when you create an online endpoint, the system automatically generates a system-assigned managed identity for you. You can also specify an existing user-assigned managed identity for the endpoint.

To use a user-assigned identity, specify the following attributes in the endpoint.yaml file:

identity:
  type: user_assigned
  user_assigned_identities:
    - resource_id: user_identity_ARM_id_place_holder

Also, specify the Client ID of the user-assigned identity under environment_variables in the deployment.yaml file as shown in the following example. You can find the Client ID in the Overview of the managed identity in the Azure portal.

environment_variables:
  AZURE_CLIENT_ID: <client_id_of_your_user_assigned_identity>

Important

You need to give the following permissions to the user-assigned identity before creating the endpoint so that it can access the Azure resources to perform inference. For more information, see how to grant permissions to your endpoint identity.

Scope	Role	Why it's needed
Azure Machine Learning Workspace	Azure Machine Learning Workspace Connection Secrets Reader role OR a customized role with "Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action"	Get workspace connections
Workspace container registry	ACR pull	Pull container image
Workspace default storage	Storage Blob Data Reader	Load model from storage
(Optional) Azure Machine Learning Workspace	Workspace metrics writer	After you deploy the endpoint, if you want to monitor the endpoint related metrics like CPU/GPU/Disk/Memory utilization, you need to give this permission to the identity.

Define the deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing.

The following example shows a deployment definition. The model section refers to the registered flow model. You can also specify the flow model path in line.

Managed online endpoint
Kubernetes online endpoint

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: basic-chat-endpoint
model: azureml:basic-chat-model:1
  # You can also specify model files path inline
  # path: examples/flows/chat/basic-chat
environment: 
  image: mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
  # inference config is used to build a serving container for online deployments
  inference_config:
    liveness_route:
      path: /health
      port: 8080
    readiness_route:
      path: /health
      port: 8080
    scoring_route:
      path: /score
      port: 8080
instance_type: Standard_E16s_v3
instance_count: 1
environment_variables:
  # for pulling connections from workspace
  PRT_CONFIG_OVERRIDE: deployment.subscription_id=<subscription_id>,deployment.resource_group=<resource_group>,deployment.workspace_name=<workspace_name>,deployment.endpoint_name=<endpoint_name>,deployment.deployment_name=<deployment_name>

  # (Optional) When there are multiple fields in the response, using this env variable will filter the fields to expose in the response.
  # For example, if there are 2 flow outputs: "answer", "context", and I only want to have "answer" in the endpoint response, I can set this env variable to '["answer"]'.
  # If you don't set this environment, by default all flow outputs will be included in the endpoint response.
  # PROMPTFLOW_RESPONSE_INCLUDED_FIELDS: '["category", "evidence"]'

$schema: https://azuremlschemas.azureedge.net/latest/kubernetesOnlineDeployment.schema.json
name: blue
type: kubernetes
endpoint_name: basic-chat-endpoint
model: azureml:basic-chat-model:1
  # You can also specify model files path inline
  # path: examples/flows/chat/basic-chat
environment: 
  image: mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
  # inference config is used to build a serving container for online deployments
  inference_config:
    liveness_route:
      path: /health
      port: 8080
    readiness_route:
      path: /health
      port: 8080
    scoring_route:
      path: /score
      port: 8080
instance_type: <kubernetes custom instance type>
instance_count: 1
environment_variables:

  # for pulling connections from workspace
  PRT_CONFIG_OVERRIDE: deployment.subscription_id=<subscription_id>,deployment.resource_group=<resource_group>,deployment.workspace_name=<workspace_name>,deployment.endpoint_name=<endpoint_name>,deployment.deployment_name=<deployment_name>

  # (Optional) When there are multiple fields in the response, using this env variable will filter the fields to expose in the response.
  # For example, if there are 2 flow outputs: "answer", "context", and I only want to have "answer" in the endpoint response, I can set this env variable to '["answer"]'.
  # If you don't set this environment, by default all flow outputs will be included in the endpoint response.
  # PROMPTFLOW_RESPONSE_INCLUDED_FIELDS: '["category", "evidence"]'

Attribute	Description
Name	The name of the deployment.
Endpoint name	The name of the endpoint to create the deployment under.
Model	The model to use for the deployment. This value can be either a reference to an existing versioned model in the workspace or an inline model specification.
Environment	The environment to host the model and code. It contains: - `image` - `inference_config`: is used to build a serving container for online deployments, including `liveness route`, `readiness_route`, and `scoring_route` .
Instance type	The VM size to use for the deployment. For the list of supported sizes, see Managed online endpoints SKU list.
Instance count	The number of instances to use for the deployment. Base the value on the workload you expect. For high availability, set the value to at least `3`. The service reserves an extra 20% for performing upgrades. For more information, see limits for online endpoints.
Environment variables	Set the following environment variables for endpoints deployed from a flow: - (required) `PRT_CONFIG_OVERRIDE`: for pulling connections from workspace - (optional) `PROMPTFLOW_RESPONSE_INCLUDED_FIELDS:`: When there are multiple fields in the response, using this env variable filters the fields to expose in the response. For example, if there are two flow outputs: "answer", "context", and if you only want to have "answer" in the endpoint response, you can set this env variable to '["answer"]'.

Important

If your flow folder has a requirements.txt file that contains the dependencies needed to execute the flow, follow the deploy with a custom environment steps to build the custom environment including the dependencies.

If you create a Kubernetes online deployment, specify the following attributes:

Attribute	Description
Type	The type of the deployment. Set the value to `kubernetes`.
Instance type	The instance type you created in your Kubernetes cluster to use for the deployment. It represents the request and limit compute resource of the deployment. For more detail, see Create and manage instance type.

Deploy your online endpoint to Azure

To create the endpoint in the cloud, run the following code:

az ml online-endpoint create --file endpoint.yml

To create the deployment named blue under the endpoint, run the following code:

az ml online-deployment create --file blue-deployment.yml --all-traffic

Note

This deployment might take more than 15 minutes.

Tip

If you prefer not to block your CLI console, add the flag --no-wait to the command. However, this flag stops the interactive display of the deployment status.

Important

The --all-traffic flag in the previous az ml online-deployment create command allocates 100% of the endpoint traffic to the newly created blue deployment. Though this allocation is helpful for development and testing purposes, for production, you might want to open traffic to the new deployment through an explicit command. For example, az ml online-endpoint update -n $ENDPOINT_NAME --traffic "blue=100".

Check status of the endpoint and deployment

To check the status of the endpoint, run the following code:

az ml online-endpoint show -n basic-chat-endpoint

To check the status of the deployment, run the following code:

az ml online-deployment get-logs --name blue --endpoint basic-chat-endpoint

Invoke the endpoint to score data by using your model

Create a sample-request.json file:

{
  "question": "What is Azure Machine Learning?",
  "chat_history":  []
}

az ml online-endpoint invoke --name basic-chat-endpoint --request-file sample-request.json

You can also call the endpoint by using an HTTP client, such as curl:

ENDPOINT_KEY=<your-endpoint-key>
ENDPOINT_URI=<your-endpoint-uri>

curl --request POST "$ENDPOINT_URI" --header "Authorization: Bearer $ENDPOINT_KEY" --header 'Content-Type: application/json' --data '{"question": "What is Azure Machine Learning?", "chat_history":  []}'

Get your endpoint key and your endpoint URI from the Azure Machine Learning workspace in Endpoints > Consume > Basic consumption info.

Advanced configurations

Deploy with different connections from flow development

You might want to override connections of the flow during deployment.

For example, if your flow.dag.yaml file uses a connection named my_connection, you can override it by adding environment variables of the deployment yaml like following:

Option 1: override connection name

environment_variables:
  my_connection: <override_connection_name>

If you want to override a specific field of the connection, you can override by adding environment variables with naming pattern <connection_name>_<field_name>. For example, if your flow uses a connection named my_connection with a configuration key called chat_deployment_name, the serving backend attempts to retrieve chat_deployment_name from the environment variable 'MY_CONNECTION_CHAT_DEPLOYMENT_NAME' by default. If the environment variable isn't set, it uses the original value from the flow definition.

Option 2: override by referring to asset

environment_variables:
  my_connection: ${{azureml://connections/<override_connection_name>}}

Note

You can only refer to a connection within the same workspace.

Deploy with a custom environment

This section shows you how to use a Docker build context to specify the environment for your deployment, assuming you have knowledge of Docker and Azure Machine Learning environments.

In your local environment, create a folder named image_build_with_reqirements that contains the following files:
```
|--image_build_with_reqirements
|  |--requirements.txt
|  |--Dockerfile
```
- The requirements.txt file, inherited from the flow folder, tracks the dependencies of the flow.
- The Dockerfile with content similar to the following example:
```
FROM mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
COPY ./requirements.txt .
RUN pip install -r requirements.txt
```

Replace the environment section in the deployment definition YAML file with the following content:

environment: 
  build:
    path: image_build_with_reqirements
    dockerfile_path: Dockerfile
  # deploy prompt flow is BYOC, so we need to specify the inference config
  inference_config:
    liveness_route:
      path: /health
      port: 8080
    readiness_route:
      path: /health
      port: 8080
    scoring_route:
      path: /score
      port: 8080

Use FastAPI serving engine (preview)

By default, prompt flow serving uses the FLASK serving engine. Starting from prompt flow SDK version 1.10.0, FastAPI-based serving engine is supported. You can use the fastapi serving engine by specifying an environment variable PROMPTFLOW_SERVING_ENGINE.

environment_variables:
  PROMPTFLOW_SERVING_ENGINE=fastapi

Configure concurrency for deployment

When you deploy your flow to online deployment, configure two environment variables for concurrency: PROMPTFLOW_WORKER_NUM and PROMPTFLOW_WORKER_THREADS. You also need to set the max_concurrent_requests_per_instance parameter.

The following example shows how to configure these settings in the deployment.yaml file.

request_settings:
  max_concurrent_requests_per_instance: 10
environment_variables:
  PROMPTFLOW_WORKER_NUM: 4
  PROMPTFLOW_WORKER_THREADS: 1

PROMPTFLOW_WORKER_NUM: This parameter sets the number of workers (processes) that start in one container. The default value equals the number of CPU cores, and the maximum value is twice the number of CPU cores.
PROMPTFLOW_WORKER_THREADS: This parameter sets the number of threads that start in one worker. The default value is 1.

Note

When you set PROMPTFLOW_WORKER_THREADS to a value greater than 1, make sure your flow code is thread-safe.
max_concurrent_requests_per_instance: The maximum number of concurrent requests per instance allowed for the deployment. The default value is 10.

The suggested value for max_concurrent_requests_per_instance depends on your request time:
- If your request time is greater than 200 ms, set max_concurrent_requests_per_instance to PROMPTFLOW_WORKER_NUM * PROMPTFLOW_WORKER_THREADS.
- If your request time is less than or equal to 200 ms, set max_concurrent_requests_per_instance to (1.5-2) * PROMPTFLOW_WORKER_NUM * PROMPTFLOW_WORKER_THREADS. This setting can improve total throughput by allowing some requests to queue on the server side.
- If you're sending cross-region requests, you can change the threshold from 200 ms to 1 s.

While tuning these parameters, monitor the following metrics to ensure optimal performance and stability:

Instance CPU and memory utilization for this deployment
Non-200 responses (4xx, 5xx)
- If you receive a 429 response, this status code typically indicates that you need to either retune your concurrency settings following the preceding guide or scale your deployment.
Azure OpenAI throttle status

Monitor endpoints

Collect general metrics

You can view general metrics of online deployment (request numbers, request latency, network bytes, CPU/GPU/Disk/Memory utilization, and more).

Collect tracing data and system metrics during inference time

You can collect tracing data and prompt flow deployment specific metrics (token consumption, flow latency, and more) during inference time to workspace linked Application Insights by adding a property app_insights_enabled: true in the deployment yaml file. For more information, see trace and metrics of prompt flow deployment.

You can specify prompt flow specific metrics and trace to other Application Insights rather than the workspace linked one. You can specify an environment variable in the deployment yaml file as following. You can find the connection string of your Application Insights in the Overview page in Azure portal.

environment_variables:
  APPLICATIONINSIGHTS_CONNECTION_STRING: <connection_string>

Note

If you only set app_insights_enabled: true but your workspace doesn't have a linked Application Insights, your deployment doesn't fail but no data is collected. If you specify both app_insights_enabled: true and the preceding environment variable at the same time, the tracing data and metrics are sent to workspace linked Application Insights. To specify a different Application Insights, keep only the environment variable.

Common errors

Upstream request timeout issue when consuming the endpoint

This error usually happens because of a timeout. By default, the request_timeout_ms value is 5,000 milliseconds. You can set it up to 5 minutes, which is 300,000 milliseconds. The following example shows how to specify the request timeout in the deployment YAML file. For more information about the deployment schema, see Managed online deployment schema.

request_settings:
  request_timeout_ms: 300000

Important

The 300,000 ms timeout only works for managed online deployments from prompt flow. The maximum timeout for a non-prompt flow managed online endpoint is 180 seconds.

To indicate that this deployment is from prompt flow, add properties for your model as follows (either inline model specification in the deployment YAML or standalone model specification YAML).

properties:
  # indicate a deployment from prompt flow
  azureml.promptflow.source_flow_id: <value>

Next steps

Learn more about managed online endpoint schema and managed online deployment schema.
Learn more about how to test the endpoint in UI and monitor the endpoint.
Learn more about how to troubleshoot managed online endpoints.
Troubleshoot prompt flow deployments.
To deploy an improved version of your flow by using a safe rollout strategy, see Safe rollout for online endpoints.
Learn more about deploy flows to other platforms, such as a local development service, Docker container, Azure APP service, etc.

Feedback

Was this page helpful?

Last updated on 2026-07-01

Deploy a flow to online endpoint for real-time inference with CLI

Prerequisites

Virtual machine quota allocation for deployment

Get the flow ready for deploy

Set default workspace

Register the flow as a model (optional)

Define the endpoint

Use user-assigned identity

Define the deployment

Deploy your online endpoint to Azure

Check status of the endpoint and deployment

Invoke the endpoint to score data by using your model

Advanced configurations

Deploy with different connections from flow development

Deploy with a custom environment

Use FastAPI serving engine (preview)

Configure concurrency for deployment

Monitor endpoints

Collect general metrics

Collect tracing data and system metrics during inference time

Common errors

Upstream request timeout issue when consuming the endpoint

Next steps

Feedback

Additional resources