HISTORICAL DOCUMENTS CLASSIFICATION USING BERT: LLM AND HISTORICAL DOMAIN

Authors

  • I. N. Galushko Moscow State University; National Research University Higher School of Economics

DOI:

https://doi.org/10.17072/2219-3111-2025-2-147-158

Keywords:

text classification, political history, artificial intelligence, attention mechanism analysis, machine learning, BERT, NLP

Abstract

At the present stage of studying Russian history, discussions about processing large collections of historical documents are becoming especially relevant. Today, the process of digitizing archival collections is actively underway, but in most cases, the created corpus is simply posted on the site and remains unused for years. This is because we often encounter difficulties in processing an entire collection when accessing the funds of a large social institution; digitized funds can contain hundreds of thousands of pages of documentation. Limited time does not allow even a quick reading to cover all the available documents. This problem could be at least partially solved by using LLMs for annotation or text search optimization. However, at the current stage of archival development, specialists are just beginning to work with natural language processing methods. The main request of the professional community is to study the specifics of the work of artificial intelligence models and machine learning on historical domain texts. This article is a preliminary study of modern LLMs' interaction with historical texts. For the analysis, we chose one of the most popular models – BERT – and one of the most common NLP tasks – classification.

References

Володин А.Ю. Цифровая трансформация истории? Данные, стандарты, подходы [Электронный ресурс] // История. 2020. T. 11, вып. 3 (89). URL: https://history.jes.su/s207987840009746-9-1/ (дата обращения: 10.08.2024). DOI: 10.18254/S207987840009746-Электронная библиотека исторических документов (проект РИО): http://docs.historyrussia.org/ru/nodes/1-glavnaya. EDN: JLAZBU.

Журналы заседаний Особого совещания при Главнокомандующем Вооруженными Силами на Юге России А.И. Деникине. Сентябрь 1918-го – декабрь 1919 года. М.: РОССПЭН, 2008. 1003 с.

Меньшевики в большевистской России. 1918—1924. Меньшевики в 1918 году. М.: РОССПЭН, 1999. 799 с.

Объединенное дворянство. Съезды уполномоченных губернских дворянских обществ. 1906–1916 гг.: а 3 т. Т. 2. 1909–1912 гг. Кн. 2. 1911–1912 гг. М.: РОССПЭН, 2001. 608 с.

Рабочее оппозиционное движение в большевистской России. 1918 г. Собрания уполномоченных фабрик и заводов. Документы и материалы. М.: РОССПЭН, 2006. 656 с.

Юмашева Ю.Ю. Историческая наука, архивы, библиотеки, музеи и искусственный интеллект: год спустя // Документ. Архив. История. Современность: сб. науч. тр. Екатеринбург: Изд-во Урал. ун-та, 2022. Вып. 22. С. 217‒241. EDN: BPPUWY.

Devlin J., Ming-Wei C., Kenton L., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // NAACL-HLT 2019. 2019. Р. 4171–4186. DOI: 10.18653/v1/N19-1423.

Kuratov, Y., Arkhipov, M. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language // arXiv preprint arXiv:1905.07213. 2019. Available at: https://arxiv.org/abs/

07213 (accessed: 04.08.2024).

Lundberg S. M., Lee S. I. A unified approach to interpreting model predictions // Advances in neural information processing systems. 2017. Vol. 30.

Petroni F., Rocktäschel T., Patrick L., Bakhtin A., Wu Yu., Miller A.H., Riedel S. Language models as knowledge bases? // arXiv preprint arXiv:1909.01066. 2019. Available at: https://arxiv.org/abs/1909.01066 (accessed: 04.08.2024).

Wolf T., Debut L., Sanh V. [et al.]. Transformers: State-of-the-Art Natural Language Processing // Pro-ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. Р. 38–45. DOI: 10.18653/v1/2020.emnlp-demos.6.

Yang Tze-I, Torget A., Mihalcea R. Topic Modeling on Historical Newspapers // Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Portland, OR, USA: Association for Computational Linguistics, 2011. P. 96–104.

References

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 4171–4186). https://doi.org/10.18653/v1/N19-1423

Kuratov, Y., & Arkhipov, M. (2019). Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv. https://doi.org/10.48550/arXiv.1905.07213

Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.

Mensheviki v bol’shevistskoy Rossii. 1918–1924. Mensheviki v 1918 godu [Mensheviks in Bolshevik Russia. 1918–1924: The Mensheviks in 1918]. (1999). ROSSPEN.

Ob"edinennoe dvoryanstvo. S"ezdy upolnomochennykh gubernskikh dvoryanskikh obshchestv. 1906–1916 gg. V 3 t. T. 2. 1909–1912 gg. Kn. 2. 1911–1912 gg. [United Nobility: Congresses of Authorized Representatives of Provincial Noble Societies. 1906–1916. In 3 vols. Vol. 2: 1909–1912. Book 2: 1911–1912]. (2001). ROSSPEN.

Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? arXiv. https://arxiv.org/abs/1909.01066

Rabochee oppozitsionnoe dvizhenie v bol'shevistskoy Rossii. 1918 g. Sobraniya upolnomochennykh fabrik i zavodov. Dokumenty i materialy [The Workers’ Opposition Movement in Bolshevik Russia, 1918: Meetings of Factory and Plant Delegates. Documents and Materials]. (2006). ROSSPEN.

Volodin, A. (2020). Digital transformation of history? Data, standards, approaches. ISTORIYA, 11(3). https://doi.org/10.18254/S207987840009746-9

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6

Yang, T.-I., Torget, A., & Mihalcea, R. (2011). Topic modeling on historical newspapers. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 96–104). Association for Computational Linguistics.

Yumasheva, Y. Y. (2022). Historical science, archives, libraries, museums, and artificial intelligence: A year later. Dokument. Arkhiv. Istoriya. Sovremennost’, 22, 217–241.

Zhurnaly zasedaniy Osobogo soveshchaniya pri Glavnokomanduyushchem Vooruzhennymi Silami na Yuge Rossii A. I. Denikine [Minutes of the Meetings of the Special Council under the Commander-in-Chief of the Armed Forces of South Russia, A. I. Denikin]. (2008). ROSSPEN.

Published

2025-06-30

How to Cite

Galushko И. Н. . (2025). HISTORICAL DOCUMENTS CLASSIFICATION USING BERT: LLM AND HISTORICAL DOMAIN. PERM UNIVERSITY HERALD. History, (2 (69), 147–158. https://doi.org/10.17072/2219-3111-2025-2-147-158