The University of Sheffield
Download file
Download file
2 files

EAMT2022 EN-PL Grammatical Agreement Dataset and Models

Download all (3.32 GB)
online resource
posted on 2022-06-23, 10:29 authored by Sebastian VincentSebastian Vincent, Carolina ScartonCarolina Scarton, Loic Barrault

The dataset and model checkpoints are needed to reproduce the results of the EAMT 2022 paper Controlling Extra-Textual Information About Dialogue Participants: A Case Study of English-to-Polish Neural Machine Translation, Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 121–130,

This data ( originally comes from the OpenSubtitles18 corpus and the Europarl corpus.


[P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)](

The corpus can found at [OPUS website]( The data was originally sourced from [](


[Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Conference Proceedings: The Tenth Machine Translation Summit, 79–86.](

Data originally sourced from [](

Direct links:



- English XML files:

- Polish XML files:

- English-to-Polish alignment files:

The models ( were trained in PyTorch:

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G.,  Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf,  A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,  Steiner, B., Fang, L., … Chintala, S. (2019). PyTorch: An imperative  style, high-performance deep learning library. Advances in Neural  Information Processing Systems, 32(NeurIPS).

Full documentation to how to use the resources is included in the GitHub repository which contains a link to this ORDA page:


UKRI Centre for Doctoral Training in Speech and Language Technologies and their Applications

Engineering and Physical Sciences Research Council

Find out more...



  • There is no personal data or any that requires ethical approval


  • The data complies with the institution and funders' policies on access and sharing

Sharing and access restrictions

  • The data can be shared openly

Data description

  • The file formats are open or commonly used

Methodology, headings and units

  • Headings and units are explained in the files

Usage metrics

    Department of Computer Science