An Azure service that integrates speech processing into apps and services.
Hello @Ming-Li Lin
Thank you for the detailed repro and for testing multiple variations
You’re observing that:
Paralinguistic tags like [laughter], [coughing], etc. produce no effect
Styles (e.g., gentle, angry, fearful) result in little to no audible difference
This is happening with zh-CN voices such as zh-CN-Xiaochen:DragonHDLatestNeural in Azure AI Speech
Expected behavior
1. Paralinguistic tags ([laughter], [coughing], etc.)
- These are not supported SSML constructs in Azure Text-to-Speech
- When included in the input, they are treated as plain text, not as instructions
As a result, no paralinguistic audio (e.g., laughter, cough) will be generated
Additionally Paralinguistic effects are currently:
- Available only for select voices (primarily English) that explicitly support them
- Not exposed for zh-CN voices, including DragonHD variants
2. Styles (mstts:express-as)
- Styles are supported, but:
- Only those listed in the voice’s styleList are applicable
- The effect is language-, voice-, and content-dependent
For zh-CN voices:
- Style variations may be subtle
- Strong differences are not guaranteed
- The model applies style based on input semantics and context
Why you’re seeing no noticeable difference
Bracketed tags are ignored
Input text is very short → limited context for style application
Content is neutral, not strongly aligned with emotional styles
DragonHDLatestNeural voices are optimized for:
- Naturalness
- Stability
- Not exaggerated expressiveness
Recommendations
Use valid SSML for styles
Ensure styles are applied using <mstts:express-as> with supported values:
<speak version="1.0" xml:lang="zh-CN"
xmlns:mstts="http://www.w3.org/2001/mstts">
<voice name="zh-CN-XiaomoNeural">
<mstts:express-as style="angry">
你為什麼要這樣做?我真的很生氣!
</mstts:express-as>
</voice>
</speak>
Use longer, expressive content
Provide text with clear emotional context to observe differences.
For paralinguistic sounds
Since zh-CN voices don’t support them directly:
- Use
<audio>tag to insert sound clips - Or use a voice that explicitly supports these events
Please refer this
Improve synthesis with SSML: https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp
Language & voice support (styles & roles): https://docs.microsoft.com/azure/ai-services/speech-service/language-support?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#voice-styles-and-roles
I Hope this helps. Do let me know if you have any further queries.
Thank you!