EAMT2022 EN-PL Grammatical Agreement Dataset and Models
The dataset and model checkpoints are needed to reproduce the results of the EAMT 2022 paper Controlling Extra-Textual Information About Dialogue Participants: A Case Study of English-to-Polish Neural Machine Translation, Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 121–130, https://aclanthology.org/2022.eamt-1.15.
This data (data.zip) originally comes from the OpenSubtitles18 corpus and the Europarl corpus.
[P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)](https://aclanthology.org/L16-1147/)
The corpus can found at [OPUS website](https://opus.nlpl.eu/OpenSubtitles-v2018.php). The data was originally sourced from [OpenSubtitles.org](http://www.opensubtitles.org/)
[Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Conference Proceedings: The Tenth Machine Translation Summit, 79–86.](https://aclanthology.org/2005.mtsummit-papers.11/)
Data originally sourced from [statmt.org](https://www.statmt.org/europarl/)
- English XML files:
- Polish XML files:
- English-to-Polish alignment files:
The models (checkpoints.zip) were trained in PyTorch:
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., … Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32(NeurIPS).
Full documentation to how to use the resources is included in the GitHub repository which contains a link to this ORDA page:
UKRI Centre for Doctoral Training in Speech and Language Technologies and their Applications
Engineering and Physical Sciences Research CouncilFind out more...
EthicsThere is no personal data or any that requires ethical approval
PolicyThe data complies with the institution and funders' policies on access and sharing
Sharing and access restrictionsThe data can be shared openly
- The file formats are open or commonly used
Methodology, headings and units
- Headings and units are explained in the files