Improving the speed of speech recognition processing in Disconnected Azure Speech Containers

Question

Improving the speed of speech recognition processing in Disconnected Azure Speech Containers

tomoe 110

I am using a disconnected container for Azure Speech.

Please let me know if there is a way to improve the response of the speech-to-text processing.

The current system returns the final results 3 seconds after a 5-second speech segment when performing stereo recognition processing (2 channels) on a single call.

However, in actual operation, multiple calls need to be processed simultaneously, and it can sometimes take 30 seconds or more from speech input to response.

Is there anything that can be improved to enable real-time text conversion?

Upgrading the server specifications and increasing the resources allocated to the containers did not improve performance.

Please let me know if there are any other settings I can adjust.

Thank you.

<sever spec>vcpu=48、memory=96

<container> ※1 container is running on 1 server.

docker run --name "azurestt-container01" \

-itd \

--restart always -p 5000:5000 \

--memory 60g \

--cpus 42 \

-v /home/ca-stt/STT/license:/path/to/license/directory \

-v /home/ca-stt/STT/output:/path/to/output/directory \

-e Speech:Concurrency=100 \

-e DECODER_MAX_COUNT=40 \

-e Eula=accept \

-e Mounts:License=/path/to/license/directory \

-e Mounts:Output=/path/to/output/directory \

-e Logging:Disk:Format=json \

-e Logging:Disk:LogLevel:Default=Information \

mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:5.2.0-amd64-ja-jp

0 comments

Answer accepted by question author

0 additional answers

Your answer

Answer 1

kagiyama yutaka 3,925

I think lowering Concurrency and DECODER_MAX_COUNT and running several STT containers in parallel is the adjustment that reduces latency, and putting more load on one container is not shown as a way to make it faster.

tomoe 110 Reputation points

2026-06-03T04:36:44.9366667+00:00

Hello. @kagiyama yutaka

Thank you for your reply.

So, it sounds like the better approach is to design the system to support a distributed container architecture.

My understanding is that, for this improvement, we need to develop a mechanism where the client sending requests (the VAD server using the SDK) effectively utilizes distributed targets in a configuration where a single server hosts multiple containers, and maintains requests that specify the target port during operations such as retries.

I’d like to use the AI’s response as a reference and explore improvements in various areas.

Answer 2

For disconnected Azure Speech to text containers, performance and concurrency are constrained by decoder capacity and CPU/core usage rather than just container CPU limits.

From the documented guidance:

Each recognition is CPU‑intensive
- For batch/file mode, one recognition typically consumes close to two CPU cores and the decoder itself can process about 3× real time.
- For microphone/streaming mode, one recognition uses about one core at 1× real time.
- With stereo (2‑channel) recognition and many concurrent calls, a single container can easily become CPU‑bound even if Docker limits are high.
Recommended concurrency per container
- For batch processing, guidance is no more than two concurrent recognitions per container instance for reliability and latency.
- Higher decoder counts (DECODER_MAX_COUNT) and high Speech:Concurrency values increase throughput but also increase contention, which raises latency and reduces real‑time performance.
Scaling pattern
- To achieve near real‑time behavior with many calls, the recommended pattern is:
  - Keep concurrency per container relatively low (for example, a small number of decoders per container).
  - Run more container instances on the same host or across multiple hosts, and place them behind a load balancer.
- As an example, 1000 hours of audio were processed within 24 hours using 4–5 VMs and 10 container instances/decoders per VM.
Practical adjustments for this setup
- Reduce DECODER_MAX_COUNT and Speech:Concurrency to limit concurrent recognitions per container (for example, start with a small number and measure latency).
- Instead of one large container using most of the host (42 CPUs, 60 GB), run multiple containers on the same 48‑vCPU host, each with a smaller CPU/memory limit and lower concurrency, then distribute calls across them.
- If more capacity is needed, add more hosts and containers rather than increasing concurrency in a single container.

These changes align with the documented guidance that high concurrency in a single container increases latency and harms real‑time performance, whereas multiple containers with lower per‑container concurrency improve both reliability and response time.

References: