Analysis of Approaches to Paralinguistic Feature Annotation Automation in Russian Speech
DOI:
https://doi.org/10.17072/1993-0550-2025-2-101-122Keywords:
automatic annotation, audio annotation, text annotation, paralinguistic characteristics, speech generationAbstract
The development of speech synthesis systems with the ability to control speech character-istics using natural language is of practical interest, since it provides an intuitive way to influence the results of the generation. At the same time, for Russian-language data there exists a shortage of both such systems and labeled datasets required to create them. Man-ual labeling of large datasets is a resource-intensive process that requires not only expert knowledge, but also inter-annotator labeling consistency. In this regard, the task of auto-mating the annotation of paralinguistic characteristics of Russian-language speech becomes relevant, allowing to unify the labeling already existing in available datasets as well as accelerate its scaling to unlabeled ones. This article considers the main approaches to the annotation of such paralinguistic charac-teristics as pauses, stresses, as well as the pitch and timbre of the voice. In particular, at-tention is paid to reviewing available software implementations of the methods described.The key conclusion from the analysis was the existence of a sufficient number of methods suitable for annotating "basic" characteristics in Russian-language speech. Pauses and fundamental frequency can be extracted using methods that do not use linguistic information, while for stress annotation there are methods based on neural networks and, thus, taking into account the context of the utterance to resolve stress placement in homo-graphs, achieving an Accuracy metric score as high as 98%. At the same time, automatic annotation of more complex characteristics, such as timbre and expressed emotions, re-mains poorly studied. These results indicate the need for additional research in the field of methods for automatic annotation of paralinguistic features in Russian-language speech corpora.References
Diwan A., Zheng Z., Harwath D., Choi E. Rich Style-Prompted Text-to-Speech Datasets, 2025. URL: http://arxiv.org/abs/2503.04713 (дата обращения: 31.03.2025).
Guo Z., Leng Y., Wu Y., et al. PromptTTS: Controllable Text-to-Speech with Text Descriptions // Proceedings of ICASSP 2023 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023. P. 1–5. DOI: 10.1109/ICASSP49357.2023.10096285.
Lacombe Y., Srivastav V., Gandhi S. Inference and training library for high-quality TTS models. URL: https://github.com/huggingface/parler-tts (дата обращения: 15.04.2025).
Kondratenko V., Sokolov A., Karpov N. et al. Large Raw Emotional Dataset with Aggregation Mechanism, 2022. URL: http://arxiv.org/abs/2212.12266 (дата обращения: 31.03.2025).
Князев С. В., Мороз Г. А., Дьяченко С. В. Корпус Просодии Русских Диалектов (ПРуД). URL: https://lingconlab.github.io/PRuD/ (дата обращения: 10.04.2025).
Кривнова О. Ф., Архипов А. В., Захаров Л. М., Кобозева И. М. Интонация устного дискурса: русский интонационный корпус РИНКО (RINCO) // Речевые Технологии. 2020. № 1–2. С. 113–120.
Речевые хезитации: формальный и функциональный аспекты / Яковлева Э. Б.: Институт научной информации по общественным наукам РАН, 2016. 74 с. URL: https://elibrary.ru/item.asp?id=30706219 (дата обращения: 21.04.2025).
Гутник Г. Нейронная сеть для восстановления пунктуации на русском языке. URL: https://github.com/gleb-skobinsky/ru_punct (дата обращения: 20.04.2025).
Котик К. Punctuation and casing restoration for the Russian Language (BERT-based). URL: https://github.com/kotikkonstantin/ru-autopunctuation (дата обращения: 20.04.2025).
Hwang J. S., Lee S. H., Lee S. W. PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling // Pattern Recognition / eds. H. Lu et al. Springer Nature, 2023. P. 415–427.
Dai Z., Yu J., Wang Y. et al. Automatic Prosody Annotation with Pre-Trained Text-Speech Model // Proceedings of Interspeech 2022. P. 5513–5517. DOI: 10.21437/Interspeech.2022-10005.
Bain M., Huh J., Han T., Zisserman A. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio // Proceedings of Interspeech. 2023. P. 4489–4493. DOI: 10.21437/Interspeech.2023-78.
RapidFuzz 3.13.0 documentation. URL: https://rapidfuzz.github.io/RapidFuzz/index.html (дата обращения: 20.04.2025).
Поляков А. Accenter/Accenter.txt at master. Accenter created by Alexey Polyakov, GitHub. URL: https://github.com/sStress/Accenter/blob/master/Accenter.txt (дата обращения: 11.04.2025).
Reynolds R., Tyers F. Automatic word stress annotation of Russian unrestricted text // Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015). 2015. P. 173–180. URL: https://aclanthology.org/W15-1822/ (дата обращения: 11.04.2025).
Vaswani A. et al. Attention is All you Need // Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017. Vol. 30. URL: https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (дата обращения: 11.04.2025).
Гусев И. Package for word stress detection URL: https://github.com/IlyaGusev/russ (дата обращения: 11.04.2025).
Грамматический словарь русского языка / Зализняк А. А. М., 1977. URL: https://gramdict.ru/ (дата обращения: 11.04.2025).
Гришина Е. А., Зеленков Ю. Г., Орехов Б. В. Наивная Поэзия В Акцентологическом Корпусе // Труды Института русского языка им. В. В. Виноградова. 2015. Т. 3, № 6. С. 257–272.
Koziev I. Detection of poetic meter, rhyme, and stress placement in the texts of Russian accentual-syllabic poems and songs URL: https://github.com/RussianNLP/RussianPoetryScansionTool (дата обращения: 13.04.2025).
Koziev I. Automated Evaluation of Meter and Rhyme in Russian Generative and Human-Authored Poetry, 2025. URL: http://arxiv.org/abs/2502.20931 (дата обращения: 16.04.2025).
Ponomareva M., Milintsevich K., Chernyak E., Starostin A. Automated Word Stress Detection in Russian // Proceedings of the First Workshop on Subword and Character Level Models in NLP SCLeM. 2017. P. 31–35. URL: https://aclanthology.org/W17-4104/ (дата обращения: 11.04.2025).
Ponomareva M. Python package russtress accentuates russian text. URL: https://github.com/MashaPo/russtress (дата обращения: 13.04.2025).
Короткова Ю. О. Комбинированный Словарно-Нейросетевой Акцентуатор Для Разметки Русского Поэтического Текста // Труды Института русского языка им. В. В. Виноградова. 2022. № 3. С. 181–190.
Савчук С. О., Архангельский Т. А., Бонч-Осмоловская А. А. и др. Национальный корпус русского языка 2.0: новые возможности и перспективы развития // Вопросы Языкознания. 2024. № 2. С. 7–34.
Короткова Ю. О., ru-accent-poet. URL: https://github.com/yuliya1324/ru_accent (дата обращения: 11.04.2025).
Педагогическое речеведение: словарь-справочник под ред. Т. А. Ладыженской и А. К. Михальской / Князьков А. А. под ред. Т. А. Ладыженской и А. К. Михальской. 1998. URL: http://rus-yaz.niv.ru/doc/pedagogical-speech/index.htm (дата обращения: 11.04.2025).
Pitch Extraction and Fundamental Frequency: History and Current Techniques. Pitch Extraction and Fundamental Frequency / Gerhard D. Department of Computer Science, University of Regina, 2003. 44 p.
Klofstad C. A. et al. Sounds like a winner: voice pitch influences perception of leadership capacity in both men and women // Proceedings in Biological Sciences. 2012. Vol. 279, № 1738. P. 2698–2704.
O’Connor J. J. M. et al. The influence of voice pitch on perceptions of trustworthiness across social contexts // Evolution and Human Behavior. 2017. Vol. 38, № 4. P. 506–512.
Wang T. Y., Kawaguchi I., Kuzuoka H., Otsuki M. Effect of Manipulated Amplitude and Frequency of Human Voice on Dominance and Persuasiveness in Audio Conferences // Proc. ACM Human-Computer Interaction. 2018. Vol. 2. № CSCW. P. 177:1–177:18.
Wu H. X., Li Y., Ching B. H H., Chen. T. T. You are how you speak: The roles of vocal pitch and semantic cues in shaping social perceptions // Perception. 2023. Vol. 52, № 1. P. 40–55.
Имамвердиев Я. Н., Сухостат Л. В., Подходы для оценки периода основного тона речевого сигнала в зашумлённой среде // Речевые Технологии. 2014. № 1–2. С. 84–103.
Ross M., Shaffer H., Cohen A. et al. Average magnitude difference function pitch extractor // IEEE Transactions on Acoustics, Speech, and Signal Processing. 1974. Vol. 22, № 5. P. 353–362.
De Cheveigné A., Kawahara H. YIN, a fundamental frequency estimator for speech and music // The Journal of the Acoustical Society of America. 2002. Vol. 111, № 4. P. 1917–1930.
Mauch M., Dixon S. PYIN: A fundamental frequency estimator using probabilistic threshold distributions // IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014. P. 659–663. URL: https://ieeexplore.ieee.org/document/6853678 (дата обращения: 15.04.2025).
Morise M. Harvest: A High-Performance Fundamental Frequency Estimator from Speech Signals // Proceedings of Interspeech 2017. P. 2321–2325. DOI:10.21437/Interspeech.2017-68.
Kasi K., Zahorian S. A. Yet Another Algorithm for Pitch Tracking // 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. 2002. Vol. 1. P. I-361-I-364. URL: https://ieeexplore.ieee.org/document/5743729 (дата обращения: 15.04.2025).
Hsu J. et al. PyWorld: a Python wrapper for WORLD vocoder. URL: https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder (дата обращения: 20.04.2025).
Morise M., Kawahara H., Katayose H. Fast and Reliable F0 Estimation Method Based on the Period Extraction of Vocal Fold Vibration of Singing Voice and Speech // CD-ROM Proceeding AES 35th International Conference: Audio for Games. London, United Kingdom, 2009.
Schmitt B. J. B. pYAAPT. AMFM_decompy 1.0.11 documentation. URL: https://bjbschmitt.github.io/AMFM_decompy/pYAAPT.html (дата обращения: 15.04.2025).
Drugman T., Huybrechts G., Klimkov V., Moinet A. Traditional Machine Learning for Pitch Detection // IEEE Signal Processing Letters. 2018. Vol. 25, № 11. P. 1745–1749.
Kim J. W., Salamon J., Li P., Bello J. P. CREPE: A Convolutional Representation for Pitch Estimation. CREPE // Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. P. 161–165. DOI: 10.1109/ICASSP.2018.8461329.
Boersma P., van Heuven P. Praat, a system for doing phonetics by computer // Glot International. 2001. Vol. 5, № 9/10. P. 341–345.
Jadoul Y., Thompson B., de Boer B. Introducing Parselmouth: A Python interface to Praat // Journal of Phonetics. 2018. Vol. 71. P. 1–15.
Boersma P., Accurate Short-Term Analysis of The Fundamental Frequency and The Harmonics-To-Noise Ratio of a Sampled Sound // Proceedings of the institute of phonetic sciences. 1993. Vol. 17, № 1193. P. 97–110.
Thinakaran P., Gladston A. R., Vijayalakshmi P. et al. Utilizing POS-Driven Pitch Contour Analysis for Enhanced Tamil Text-to-Speech Synthesis // Proceedings of the 21st International Conference on Natural Language Processing (ICON). 2024. P. 269–273.
Hai J., Thakkar K., Wang H. et al. DreamVoice: Text-Guided Voice Conversion // Proceedings of Interspeech 2024. P. 4373–4377. DOI: 10.21437/Interspeech.2024-1432.
Kawamura M., Yamamoto R., Shirahata Y. et al. LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning. // Proceedings of Interspeech 2024. P. 1850–1854. DOI: 10.21437/Interspeech.2024-692.
Nguyen T. A., Hsu W. N., D’Avirro A. et al. EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis // Proceedings of Interspeech 2023. P. 4823–4827. DOI: 10.21437/Interspeech.2023-1905.
Richter J., Wu Y. C., Krenn S., et al. EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation // Proceedings of Interspeech 2024. P. 4873–4877. DOI: 10.21437/Interspeech.2024-153.
Ji S., Zuo J., Fang M. et al. TextrolSpeech: A Text Style Control Speech Corpus with Codec Language Text-to-Speech Models // IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024. P. 10301–10305. URL: https://ieeexplore.ieee.org/document/10445879 (дата обращения: 31.03.2025).
Guan W., Li Y., Li T.et al. MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis // Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence. P. 18117–18125. DOI: 10.1609/aaai.v38i16.29769.
Jin Z. et al. SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description // Proceedings of the 32nd ACM International Conference on Multimedia. 2024. P. 1255–1264. DOI: 10.1145/3664647.3681674.
Watanabe A., Takamichi S., Saito Y., et al. Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control // 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). P. 1–8. DOI: 10.1109/ASRU57964.2023.10389693.
Nagrani A., Chung J. S., Xie W., Zisserman A. Voxceleb: Large-scale speaker verification in the wild // Computer Speech & Language. 2020. Vol. 60. P. 101027. DOI: 10.1016/j.csl.2019.101027.
Wagner J., Triantafyllopoulos A., Wierstorf H. et al. Dawn of the transformer era in speech emotion recognition: closing the valence gap // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023. Vol. 45, № 9. P. 10745–10759. DOI: 10.1109/TPAMI.2023.3263585.
Sendlmeier W., Burkhardt F., Kienast M., Paeschke A., Weiss B. The Berlin Database of Emotional Speech. URL: http://emodb.bilderbar.info/docu/ (дата обращения: 09.06.2025).
Meng R. et al. SFR-Embedding-Mistral:Enhance Text Retrieval with Transfer Learning. URL: https://huggingface.co/Salesforce/SFR-Embedding-Mistral (дата обращения: 15.04.2025).
Wiseman J. Python interface to the WebRTC Voice Activity Detector. URL: https://github.com/wiseman/py-webrtcvad (дата обращения: 15.04.2025).
Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. URL: https://github.com/snakers4/silero-vad (дата обращения: 15.04.2025).
McFee B. et al. librosa/librosa: 0.11.0. 2025. DOI: 10.5281/zenodo.15006942.
Кедрова Г. Е., Потапов В. В., Егоров А. М., Омельянова Е. Б. Акцентология. Ударение и фонетическое оформление слова. Фонетическая природа ударных гласных // Русская фонетика: учеб. материалы. 2002. URL: https://www.philol.msu.ru/~fonetica/akcent/phon_priroda/index.html (дата обращения: 09.06.2025).
Lubenets I., Davidchuk N., Amenets A. Emotions recognition from audio and text files URL: https://huggingface.co/datasets/Aniemore/resd_annotated (дата обращения: 15.04.2025).
Ardila R., Branson M., Davis K., et al. Common Voice: A Massively-Multilingual Speech Corpus. Common Voice. // Proceedings of the Twelfth Language Resources and Evaluation Conference. P. 4218–4222. ISBN: 979-10-95546-34-4.
Поволоцкая А. А., Карпов А. А. Аналитический обзор методов автоматического анализа экстралингвистических компонентов спонтанной речи // Информатика и автоматизация. 2024. Т. 23, № 1. С. 5–38.
Lee Y., Yeon I., Nam J., Chung J. S.VoiceLDM: Text-to-Speech with Environmental Context. VoiceLDM. 2023. URL: http://arxiv.org/abs/2309.13664 (дата обращения: 15.04.2025).
Jung J., Ahn J., Jung C. VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis. 2024. URL: http://arxiv.org/abs/2412.19259 (дата обращения: 13.04.2025).
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Евгений Николаевич Радченко , Екатерина Владимировна Исаева

This work is licensed under a Creative Commons Attribution 4.0 International License.
Articles are published under license Creative Commons Attribution 4.0 International (CC BY 4.0).
