Armenian language dataset from CC-100, monolingual Datasets from Web Crawl Data

Armenian language dataset extracted from CC-100 research dataset

Description from website This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

Data and Resources

Web page with dataset description and data files URLsHTML
Includes data files for multiple languages
Explore
- More information
- Go to resource
Armenian language datasetTXT
XZipped dataset of Armenian language from CC-100
Explore
- More information
- Go to resource

Additional Info

Field	Value
Source	https://data.statmt.org/cc-100/
Last Updated	April 6, 2023, 14:41 (UTC)
Created	April 6, 2023, 14:39 (UTC)