zh-CN voices: mstts:express-as styles and paralinguistic tags produce identical output regardless of value

Ming-Li Lin 20 Reputation points
2026-05-05T19:20:38.75+00:00
curl --location --request POST "https://${SPEECH_REGION}.tts.speech.microsoft.com/cognitiveservices/v1" \
--header "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" \
--header 'Content-Type: application/ssml+xml' \
--header 'X-Microsoft-OutputFormat: audio-16khz-128kbitrate-mono-mp3' \
--header 'User-Agent: curl' \
--data-raw '
<speak version="1.0" xml:lang="en-US" xmlns:mstts="http://www.w3.org/2001/mstts">
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[laughter] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[coughing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[throat_clearing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[breathing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[sighing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[yawning] 拜託別說出去
</voice>
</speak>
' > styleParalingZh.mp3

The paralinguistic tags are not working at all.

According to the wiki page, it should work: Styles are available on all English content for all voices. Style results are strongly relevant to the input content: the model adapts style application based on the semantic meaning of the text. Paralingsuitics are available on all voices with all languages.

I tried to fetch voice details, and use the style listed in the styleList. Still no difference.

<speak version="1.0" xml:lang="zh-CN" xmlns:mstts="

makes no difference for all 3 sentences.

Azure Speech in Foundry Tools

Answer accepted by question author

SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator
2026-05-06T13:11:49.9366667+00:00

Hello @Ming-Li Lin

Thank you for the detailed repro and for testing multiple variations

You’re observing that:

Paralinguistic tags like [laughter], [coughing], etc. produce no effect

Styles (e.g., gentle, angry, fearful) result in little to no audible difference

This is happening with zh-CN voices such as zh-CN-Xiaochen:DragonHDLatestNeural in Azure AI Speech

Expected behavior

1. Paralinguistic tags ([laughter], [coughing], etc.)

  • These are not supported SSML constructs in Azure Text-to-Speech
  • When included in the input, they are treated as plain text, not as instructions

As a result, no paralinguistic audio (e.g., laughter, cough) will be generated

Additionally Paralinguistic effects are currently:

  • Available only for select voices (primarily English) that explicitly support them
  • Not exposed for zh-CN voices, including DragonHD variants

2. Styles (mstts:express-as)

  • Styles are supported, but:
    • Only those listed in the voice’s styleList are applicable
    • The effect is language-, voice-, and content-dependent

For zh-CN voices:

  • Style variations may be subtle
  • Strong differences are not guaranteed
  • The model applies style based on input semantics and context

Why you’re seeing no noticeable difference

Bracketed tags are ignored

Input text is very short → limited context for style application

Content is neutral, not strongly aligned with emotional styles

DragonHDLatestNeural voices are optimized for:

  • Naturalness
  • Stability
  • Not exaggerated expressiveness

Recommendations

Use valid SSML for styles

Ensure styles are applied using <mstts:express-as> with supported values:

<speak version="1.0" xml:lang="zh-CN"
       xmlns:mstts="http://www.w3.org/2001/mstts">
  <voice name="zh-CN-XiaomoNeural">
    <mstts:express-as style="angry">
      你為什麼要這樣做?我真的很生氣!
    </mstts:express-as>
  </voice>
</speak>

Use longer, expressive content

Provide text with clear emotional context to observe differences.

For paralinguistic sounds

Since zh-CN voices don’t support them directly:

  • Use <audio> tag to insert sound clips
  • Or use a voice that explicitly supports these events

Please refer this

Improve synthesis with SSML: https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp

Language & voice support (styles & roles): https://docs.microsoft.com/azure/ai-services/speech-service/language-support?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#voice-styles-and-roles

I Hope this helps. Do let me know if you have any further queries.

Thank you!

Was this answer helpful?

1 person found this answer helpful.
0 comments No comments

1 additional answer

Sort by: Most helpful
  1. AI answer

    2026-05-05T19:20:51.16+00:00

    Paralinguistic tags and speaking styles are only honored by voices that explicitly support them. For Chinese (zh-CN), support is voice- and style-specific, and not all Dragon HD / Dragon HD Flash voices implement paralinguistics or the same style set.

    From the language support tables:

    • zh-CN-Xiaochen:DragonHDFlashLatestNeural lists supported styles (cheerful, debating, empathetic, live-commercial, poetry-reading, sad, sorry) and explicitly notes Roles: Not supported.
    • Other zh-CN voices list different style sets, and only some support roles (for example, zh-CN-XiaomoNeural and zh-CN-YunyeNeural support roles like Boy, Girl, YoungAdultFemale, etc.).
    • The documentation for SSML styles and roles shows that paralinguistics and roles are controlled via mstts:express-as and role attributes, not by inserting bracketed tokens like [laughter] in the plain text.

    In the provided SSML, the text uses literal tokens like [laughter] instead of SSML elements. For Azure Speech, paralinguistic effects are not triggered by bracketed words; they require supported SSML constructs (for example, mstts:express-as with a supported style value, or role where the voice supports roles). Since zh-CN-Xiaochen:DragonHDLatestNeural is not listed with paralinguistic support in the tables and the text does not use SSML paralinguistic elements, the synthesized audio will be identical regardless of the bracketed markers.

    To get audible differences:

    1. Use a zh-CN voice that lists the desired styles or roles in the language support table (for example, zh-CN-XiaomoNeural, zh-CN-YunyeNeural, or other zh-CN voices with rich style sets).
    2. Apply styles via mstts:express-as with supported style values (such as cheerful, sad, serious, etc.) and, where available, role attributes as documented in the SSML customization page.
    3. Do not rely on bracketed tokens like [laughter] in the text; instead, use the documented SSML mechanisms for styles, paralinguistics, and roles.

    If a specific Dragon HD voice does not list paralinguistic or role support in the language support tables, style and paralinguistic tags will not change the output for that voice.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.