zh-CN voices: mstts:express-as styles and paralinguistic tags produce identical output regardless of value

Question

zh-CN voices: mstts:express-as styles and paralinguistic tags produce identical output regardless of value

Ming-Li Lin 20

curl --location --request POST "https://${SPEECH_REGION}.tts.speech.microsoft.com/cognitiveservices/v1" \
--header "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" \
--header 'Content-Type: application/ssml+xml' \
--header 'X-Microsoft-OutputFormat: audio-16khz-128kbitrate-mono-mp3' \
--header 'User-Agent: curl' \
--data-raw '
<speak version="1.0" xml:lang="en-US" xmlns:mstts="http://www.w3.org/2001/mstts">
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[laughter] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[coughing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[throat_clearing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[breathing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[sighing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[yawning] 拜託別說出去
</voice>
</speak>
' > styleParalingZh.mp3

The paralinguistic tags are not working at all.

According to the wiki page, it should work: Styles are available on all English content for all voices. Style results are strongly relevant to the input content: the model adapts style application based on the semantic meaning of the text. Paralingsuitics are available on all voices with all languages.

I tried to fetch voice details, and use the style listed in the styleList. Still no difference.

<speak version="1.0" xml:lang="zh-CN" xmlns:mstts="

makes no difference for all 3 sentences.

Ming-Li Lin 20 Reputation points

2026-05-05T19:39:56.25+00:00
curl --location --request POST "

gentle, angry and fearful are in zh-CN-XiaomoNeural's style list. Still no difference.
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-07T10:56:30.2833333+00:00

Hi @Ming-Li Lin

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!
Ming-Li Lin 20 Reputation points

2026-05-07T18:05:33.9833333+00:00

HI @SRILAKSHMI C ,

Thanks for the reply. Microsoft needs to update the webpages.

I think Azure TTS has a lot of potentials, but I found outdated documents all the time. (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#use-speaking-styles-paralinguistics-and-roles specifically said
Styles are available on all English content for all voices. Style results are strongly relevant to the input content: the model adapts style application based on the semantic meaning of the text. Paralingsuitics are available on all voices with all languages.

Apparently paralingustics are not supported everywhere as the document says.

Also, I found there is some issue with the previous posts.

I also tried this:

<voice name="zh-CN-XiaomoNeural"> <mstts:express-as style="gentle"> 我找不到女兒的水壺 </mstts:express-as> </voice> <voice name="zh-CN-XiaomoNeural"> [<mstts:express-as style="angry"> 我找不到女兒的水壺 </mstts:express-as> </voice> <voice name="zh-CN-XiaomoNeural"> <mstts:express-as style="fearful"> 我找不到女兒的水壺 </mstts:express-as> </voice>

Those are style tags supported by zh-CN-XiaomoNeural documented in the voice file, and there is still no difference at all when I produce the sound.

It's very frustrating when the documents mention so many features but only 20% of them are working.

-Ming-Li
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-08T12:30:01.3933333+00:00

Hi @Ming-Li Lin

I understand your frustration; I’ll check internally with the engineering team and will get back to you as soon as I have more information.

Thank you!
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-18T11:11:00.31+00:00
Hi @Ming-Li Lin,

Thank you for your patience while we checked this internally with the engineering team.

We received confirmation from the product team that the current paralinguistic and expressive style rendering support is primarily available for en-US voices at this time.

Although the current documentation mentions broader support across voices and languages, the engineering team confirmed that features such as:

mstts:express-as emotional styles

paralinguistic tags like [laughter], [coughing], [breathing], etc.

are not fully supported for the tested zh-CN voices currently, including scenarios such as:

zh-CN-Xiaochen:DragonHDLatestNeural

zh-CN-XiaomoNeural

Because of this limitation, the generated audio may sound identical even when different styles or paralinguistic tags are provided in the SSML.

At the moment, this is considered a documentation gap rather than a service-side outage. The product team has confirmed they will work on updating the documentation to better clarify the current language support and limitations.

Based on the latest guidance from engineering:

Full expressive/paralinguistic behavior is currently expected mainly for supported en-US voices.

zh-CN voices may expose style metadata, but audible differentiation is not guaranteed at this time.

This is not related to your SSML syntax or API usage.

We understand the confusion and frustration caused by the mismatch between the documentation and the observed behavior, and we appreciate you bringing this to attention.

Thank you!
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-19T10:50:53.1133333+00:00

Hi @Ming-Li Lin,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!
Ming-Li Lin 20 Reputation points

2026-05-20T20:54:20.0533333+00:00

Hi @SRILAKSHMI C ,

Oh well... In the old days, anything behaved differently than the spec would be called a bug. I guess Azure has different definitions now. All external developers developing based on Azure TTS would need to spend a lot of time testing, and scratching their heads wondering why the functions specified on the website are not working. I really hope at least you guys can fix the webpages. (if not fixing the behavior)

I tested some en-US voices and it seems styles are working there. However, if styles are not workable on other languages besides en, it restricted the market a lot.

The answers were not really great but thanks again for the communication between me and the dev teams. I really appreciate it.

-Ming-Li
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-27T10:51:41.79+00:00

Hi @Ming-Li Lin

Thank you for the feedback and for taking the time to validate the behavior across multiple voices and languages.

I completely understand your concern regarding the mismatch between the documented capabilities and the actual behavior observed in production. Your feedback is valid, especially from a developer experience perspective, where accurate documentation is critical for implementation and troubleshooting.

I have shared your concerns internally with the product/documentation team, including the impact this can have for developers working with non-English TTS scenarios. The engineering team has acknowledged that the current documentation does not clearly reflect the present language limitations for expressive styles and paralinguistic rendering, particularly for zh-CN voices.

We also appreciate the additional confirmation you provided regarding en-US voices behaving differently, as that aligns with the current backend support scope communicated by the product team.

Thank you again for your patience and for helping highlight this documentation gap. Feedback like yours is valuable for improving both the product experience and the documentation clarity for other developers.

If everything discussed is clear from your side, could you please accept the response so we can proceed with closing the thread? if you have any additional questions in the future, please feel free to reach out.

Thank you!

Answer accepted by question author

SRILAKSHMI C 19,550 Microsoft External Staff Moderator

Hello @Ming-Li Lin

Thank you for the detailed repro and for testing multiple variations

You’re observing that:

Paralinguistic tags like [laughter], [coughing], etc. produce no effect

Styles (e.g., gentle, angry, fearful) result in little to no audible difference

This is happening with zh-CN voices such as zh-CN-Xiaochen:DragonHDLatestNeural in Azure AI Speech

Expected behavior

1. Paralinguistic tags ([laughter], [coughing], etc.)

These are not supported SSML constructs in Azure Text-to-Speech
When included in the input, they are treated as plain text, not as instructions

As a result, no paralinguistic audio (e.g., laughter, cough) will be generated

Additionally Paralinguistic effects are currently:

Available only for select voices (primarily English) that explicitly support them
Not exposed for zh-CN voices, including DragonHD variants

2. Styles (mstts:express-as)

Styles are supported, but:
- Only those listed in the voice’s styleList are applicable
- The effect is language-, voice-, and content-dependent

For zh-CN voices:

Style variations may be subtle
Strong differences are not guaranteed
The model applies style based on input semantics and context

Why you’re seeing no noticeable difference

Bracketed tags are ignored

Input text is very short → limited context for style application

Content is neutral, not strongly aligned with emotional styles

DragonHDLatestNeural voices are optimized for:

Naturalness
Stability
Not exaggerated expressiveness

Recommendations

Use valid SSML for styles

Ensure styles are applied using <mstts:express-as> with supported values:

<speak version="1.0" xml:lang="zh-CN"
       xmlns:mstts="http://www.w3.org/2001/mstts">
  <voice name="zh-CN-XiaomoNeural">
    <mstts:express-as style="angry">
      你為什麼要這樣做？我真的很生氣！
    </mstts:express-as>
  </voice>
</speak>

Use longer, expressive content

Provide text with clear emotional context to observe differences.

For paralinguistic sounds

Since zh-CN voices don’t support them directly:

Use <audio> tag to insert sound clips
Or use a voice that explicitly supports these events

Please refer this

Improve synthesis with SSML: https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp

Language & voice support (styles & roles): https://docs.microsoft.com/azure/ai-services/speech-service/language-support?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#voice-styles-and-roles

I Hope this helps. Do let me know if you have any further queries.

Thank you!

0 comments

1 additional answer

Your answer

Ming-Li Lin 20 Reputation points

2026-05-05T19:39:56.25+00:00

curl --location --request POST "

gentle, angry and fearful are in zh-CN-XiaomoNeural's style list. Still no difference.
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-07T10:56:30.2833333+00:00

Hi @Ming-Li Lin

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!
Ming-Li Lin 20 Reputation points

2026-05-07T18:05:33.9833333+00:00

HI @SRILAKSHMI C ,

Thanks for the reply. Microsoft needs to update the webpages.

I think Azure TTS has a lot of potentials, but I found outdated documents all the time. (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#use-speaking-styles-paralinguistics-and-roles specifically said
Styles are available on all English content for all voices. Style results are strongly relevant to the input content: the model adapts style application based on the semantic meaning of the text. Paralingsuitics are available on all voices with all languages.

Apparently paralingustics are not supported everywhere as the document says.

Also, I found there is some issue with the previous posts.

I also tried this:

<voice name="zh-CN-XiaomoNeural"> <mstts:express-as style="gentle"> 我找不到女兒的水壺 </mstts:express-as> </voice> <voice name="zh-CN-XiaomoNeural"> [<mstts:express-as style="angry"> 我找不到女兒的水壺 </mstts:express-as> </voice> <voice name="zh-CN-XiaomoNeural"> <mstts:express-as style="fearful"> 我找不到女兒的水壺 </mstts:express-as> </voice>

Those are style tags supported by zh-CN-XiaomoNeural documented in the voice file, and there is still no difference at all when I produce the sound.

It's very frustrating when the documents mention so many features but only 20% of them are working.

-Ming-Li
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-08T12:30:01.3933333+00:00

Hi @Ming-Li Lin

I understand your frustration; I’ll check internally with the engineering team and will get back to you as soon as I have more information.

Thank you!
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-18T11:11:00.31+00:00

Hi @Ming-Li Lin,

Thank you for your patience while we checked this internally with the engineering team.

We received confirmation from the product team that the current paralinguistic and expressive style rendering support is primarily available for en-US voices at this time.

Although the current documentation mentions broader support across voices and languages, the engineering team confirmed that features such as:

mstts:express-as emotional styles

paralinguistic tags like [laughter], [coughing], [breathing], etc.

are not fully supported for the tested zh-CN voices currently, including scenarios such as:

zh-CN-Xiaochen:DragonHDLatestNeural

zh-CN-XiaomoNeural

Because of this limitation, the generated audio may sound identical even when different styles or paralinguistic tags are provided in the SSML.

At the moment, this is considered a documentation gap rather than a service-side outage. The product team has confirmed they will work on updating the documentation to better clarify the current language support and limitations.

Based on the latest guidance from engineering:

Full expressive/paralinguistic behavior is currently expected mainly for supported en-US voices.

zh-CN voices may expose style metadata, but audible differentiation is not guaranteed at this time.

This is not related to your SSML syntax or API usage.

We understand the confusion and frustration caused by the mismatch between the documentation and the observed behavior, and we appreciate you bringing this to attention.

Thank you!
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-19T10:50:53.1133333+00:00

Hi @Ming-Li Lin,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!
Ming-Li Lin 20 Reputation points

2026-05-20T20:54:20.0533333+00:00

Hi @SRILAKSHMI C ,

Oh well... In the old days, anything behaved differently than the spec would be called a bug. I guess Azure has different definitions now. All external developers developing based on Azure TTS would need to spend a lot of time testing, and scratching their heads wondering why the functions specified on the website are not working. I really hope at least you guys can fix the webpages. (if not fixing the behavior)

I tested some en-US voices and it seems styles are working there. However, if styles are not workable on other languages besides en, it restricted the market a lot.

The answers were not really great but thanks again for the communication between me and the dev teams. I really appreciate it.

-Ming-Li
SRILAKSHMI C 19,550 Reputation points Microsoft External Staff Moderator

2026-05-27T10:51:41.79+00:00

Hi @Ming-Li Lin

Thank you for the feedback and for taking the time to validate the behavior across multiple voices and languages.

I completely understand your concern regarding the mismatch between the documented capabilities and the actual behavior observed in production. Your feedback is valid, especially from a developer experience perspective, where accurate documentation is critical for implementation and troubleshooting.

I have shared your concerns internally with the product/documentation team, including the impact this can have for developers working with non-English TTS scenarios. The engineering team has acknowledged that the current documentation does not clearly reflect the present language limitations for expressive styles and paralinguistic rendering, particularly for zh-CN voices.

We also appreciate the additional confirmation you provided regarding en-US voices behaving differently, as that aligns with the current backend support scope communicated by the product team.

Thank you again for your patience and for helping highlight this documentation gap. Feedback like yours is valuable for improving both the product experience and the documentation clarity for other developers.

If everything discussed is clear from your side, could you please accept the response so we can proceed with closing the thread? if you have any additional questions in the future, please feel free to reach out.

Thank you!

Answer 1

Paralinguistic tags and speaking styles are only honored by voices that explicitly support them. For Chinese (zh-CN), support is voice- and style-specific, and not all Dragon HD / Dragon HD Flash voices implement paralinguistics or the same style set.

From the language support tables:

zh-CN-Xiaochen:DragonHDFlashLatestNeural lists supported styles (cheerful, debating, empathetic, live-commercial, poetry-reading, sad, sorry) and explicitly notes Roles: Not supported.
Other zh-CN voices list different style sets, and only some support roles (for example, zh-CN-XiaomoNeural and zh-CN-YunyeNeural support roles like Boy, Girl, YoungAdultFemale, etc.).
The documentation for SSML styles and roles shows that paralinguistics and roles are controlled via mstts:express-as and role attributes, not by inserting bracketed tokens like [laughter] in the plain text.

In the provided SSML, the text uses literal tokens like [laughter] instead of SSML elements. For Azure Speech, paralinguistic effects are not triggered by bracketed words; they require supported SSML constructs (for example, mstts:express-as with a supported style value, or role where the voice supports roles). Since zh-CN-Xiaochen:DragonHDLatestNeural is not listed with paralinguistic support in the tables and the text does not use SSML paralinguistic elements, the synthesized audio will be identical regardless of the bracketed markers.

To get audible differences:

Use a zh-CN voice that lists the desired styles or roles in the language support table (for example, zh-CN-XiaomoNeural, zh-CN-YunyeNeural, or other zh-CN voices with rich style sets).
Apply styles via mstts:express-as with supported style values (such as cheerful, sad, serious, etc.) and, where available, role attributes as documented in the SSML customization page.
Do not rely on bracketed tokens like [laughter] in the text; instead, use the documented SSML mechanisms for styles, paralinguistics, and roles.

If a specific Dragon HD voice does not list paralinguistic or role support in the language support tables, style and paralinguistic tags will not change the output for that voice.

References: