The Russian Corpus of Linguistic Acceptability (RuCoLA) is a dataset consisting of Russian language sentences with their binary acceptability judgements. It includes expert-written sentences from linguistic publications and machine-generated examples.
The corpus covers a variety of language phenomena, ranging from syntax and semantics to generative model hallucinations. We release RuCoLA to facilitate the development of methods for identifying errors in natural language and create a public leaderboard to track the progress made on this problem.
In recent years, natural language processing systems have rapidly improved in quality for a number of tasks, many of which involve concepts as difficult as common sense or even general world knowledge. This trend was enabled by the emergence of large-scale self-supervised pretraining methods that formed the backbone of mainstream language models like BERT or GPT-3. Such models have surpassed human performance on canonical NLU benchmarks and proved capable of generating texts hardly distinguishable from those written by humans.
- Shortcomings of language models
Despite these impressive results, modern language models are still far from perfect, particularly for the Russian language. Although passages from generative models may seem human-like at first glance, they tend to be rife with hallucinated facts or contradictory information. Furthermore, a growing number of studies have reported that even the largest language models do not properly capture various linguistic phenomena and have limited ability to make fine-grained judgments about the correct use of language.
- Why we created RuCoLA
With that in mind, we designed RuCoLA as a benchmark for evaluating the linguistic competence of Russian language models. RuCoLA follows the general concept of linguistic acceptability: unlike grammatical correctness, which relates to the structure of language, acceptability denotes whether the utterance would be considered natural by a native speaker. Thus, a grammatical sentence can be unacceptable (e.g., "Colorless green ideas sleep furiously"), but an acceptable sentence has to be grammatical. Similarly to GLUE-style and probing benchmarks (e.g., GLUE, Russian SuperGLUE and RuSentEval), it can be used to compare the general language understanding capabilities of neural networks or to analyze and improve the fluency and consistency of text generation models.
- Vladislav MikhailovSberDevices, Sberbank
- Tatiana ShamardinaABBYY
- Max RyabininYandex, HSE University
- Alena PestovaHSE University
- Ivan SmurovABBYY, MIPT
- Ekaterina ArtemovaHSE University, Huawei Noah's Ark Lab