Improving the speed of speech recognition processing in Disconnected Azure Speech Containers

tomoe 110 Reputation points
2026-05-29T09:06:11.2333333+00:00

I am using a disconnected container for Azure Speech.

Please let me know if there is a way to improve the response of the speech-to-text processing.

The current system returns the final results 3 seconds after a 5-second speech segment when performing stereo recognition processing (2 channels) on a single call.

However, in actual operation, multiple calls need to be processed simultaneously, and it can sometimes take 30 seconds or more from speech input to response.

Is there anything that can be improved to enable real-time text conversion?

Upgrading the server specifications and increasing the resources allocated to the containers did not improve performance.

Please let me know if there are any other settings I can adjust.

Thank you.

<sever spec>vcpu=48、memory=96

<container> ※1 container is running on 1 server.

docker run --name "azurestt-container01" \

-itd \

--restart always -p 5000:5000 \

--memory 60g \

--cpus 42 \

-v /home/ca-stt/STT/license:/path/to/license/directory \

-v /home/ca-stt/STT/output:/path/to/output/directory \

-e Speech:Concurrency=100 \

-e DECODER_MAX_COUNT=40 \

-e Eula=accept \

-e Mounts:License=/path/to/license/directory \

-e Mounts:Output=/path/to/output/directory \

-e Logging:Disk:Format=json \

-e Logging:Disk:LogLevel:Default=Information \

mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:5.2.0-amd64-ja-jp

Azure Speech in Foundry Tools
0 comments No comments

Answer accepted by question author

kagiyama yutaka 3,925 Reputation points
2026-05-29T11:41:23.7866667+00:00

I think lowering Concurrency and DECODER_MAX_COUNT and running several STT containers in parallel is the adjustment that reduces latency, and putting more load on one container is not shown as a way to make it faster.

Was this answer helpful?

1 person found this answer helpful.

Answer accepted by question author

AI answer

2026-05-29T09:06:33.2366667+00:00

For disconnected Azure Speech to text containers, performance and concurrency are constrained by decoder capacity and CPU/core usage rather than just container CPU limits.

From the documented guidance:

  1. Each recognition is CPU‑intensive
    • For batch/file mode, one recognition typically consumes close to two CPU cores and the decoder itself can process about 3× real time.
    • For microphone/streaming mode, one recognition uses about one core at 1× real time.
    • With stereo (2‑channel) recognition and many concurrent calls, a single container can easily become CPU‑bound even if Docker limits are high.
  2. Recommended concurrency per container
    • For batch processing, guidance is no more than two concurrent recognitions per container instance for reliability and latency.
    • Higher decoder counts (DECODER_MAX_COUNT) and high Speech:Concurrency values increase throughput but also increase contention, which raises latency and reduces real‑time performance.
  3. Scaling pattern
    • To achieve near real‑time behavior with many calls, the recommended pattern is:
      • Keep concurrency per container relatively low (for example, a small number of decoders per container).
      • Run more container instances on the same host or across multiple hosts, and place them behind a load balancer.
    • As an example, 1000 hours of audio were processed within 24 hours using 4–5 VMs and 10 container instances/decoders per VM.
  4. Practical adjustments for this setup
    • Reduce DECODER_MAX_COUNT and Speech:Concurrency to limit concurrent recognitions per container (for example, start with a small number and measure latency).
    • Instead of one large container using most of the host (42 CPUs, 60 GB), run multiple containers on the same 48‑vCPU host, each with a smaller CPU/memory limit and lower concurrency, then distribute calls across them.
    • If more capacity is needed, add more hosts and containers rather than increasing concurrency in a single container.

These changes align with the documented guidance that high concurrency in a single container increases latency and harms real‑time performance, whereas multiple containers with lower per‑container concurrency improve both reliability and response time.


References:

AI-generated content may be incorrect. Read our transparency notes for more information.

Was this answer helpful?

1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.