1/1
2 files

EAMT2022 EN-PL Grammatical Agreement Dataset and Models

online resource
posted on 23.06.2022, 10:29 by Sebastian VincentSebastian Vincent, Carolina ScartonCarolina Scarton, Loic Barrault

The dataset and model checkpoints are needed to reproduce the results of the EAMT 2022 paper Controlling Extra-Textual Information About Dialogue Participants: A Case Study of English-to-Polish Neural Machine Translation, Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 121–130, https://aclanthology.org/2022.eamt-1.15.


This data (data.zip) originally comes from the OpenSubtitles18 corpus and the Europarl corpus.


OpenSubtitles18:


[P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)](https://aclanthology.org/L16-1147/)

The corpus can found at [OPUS website](https://opus.nlpl.eu/OpenSubtitles-v2018.php). The data was originally sourced from [OpenSubtitles.org](http://www.opensubtitles.org/)



Europarl:


[Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Conference Proceedings: The Tenth Machine Translation Summit, 79–86.](https://aclanthology.org/2005.mtsummit-papers.11/)


Data originally sourced from [statmt.org](https://www.statmt.org/europarl/)


Direct links:

Europarl: https://www.statmt.org/europarl/v7/pl-en.tgz

OpenSubtitles: 

- English XML files:

http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/xml/en.zip

- Polish XML files:

http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/xml/pl.zip

- English-to-Polish alignment files:

http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/xml/en-pl.xml.gz


The models (checkpoints.zip) were trained in PyTorch:

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G.,  Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf,  A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,  Steiner, B., Fang, L., … Chintala, S. (2019). PyTorch: An imperative  style, high-performance deep learning library. Advances in Neural  Information Processing Systems, 32(NeurIPS).


Full documentation to how to use the resources is included in the GitHub repository which contains a link to this ORDA page: 

https://github.com/st-vincent1/grammatical_agreement_eamt

Funding

UKRI Centre for Doctoral Training in Speech and Language Technologies and their Applications

Engineering and Physical Sciences Research Council

Find out more...

History

Ethics

There is no personal data or any that requires ethical approval

Policy

The data complies with the institution and funders' policies on access and sharing

Sharing and access restrictions

The data can be shared openly

Data description

  • The file formats are open or commonly used

Methodology, headings and units

  • Headings and units are explained in the files