Skip to main content

The Arabic E-Book Corpus

Data citation Information

Språkbanken Text (2025). The Arabic E-Book Corpus (updated: 2025-09-12). [Data set]. Språkbanken Text. https://doi.org/10.23695/xwz6-jv19
BibTeX Additional ways to cite the dataset.
A collection of 1,745 books in Arabic.

The Arabic E-Book Corpus is a freely available collection of 1,745 books (81.5 million words) published in by the Hindawi foundation between 2008 and 2024. The books are of various genres, including non-fiction, novels, children's literature, poetry, and plays. The corpus is provided in two versions: html and unformatted plain text. The latter version will be appropriate for most purposes.

For additional detail, see Hallberg, A. (2025). An 81-million-word multi-genre corpus of Arabic books. Data in Brief, 60, 111456.

The corpus is also available for download in HTML format or unformatted plain text.

Accessible through

Access Platform Licence
CC-BY-4.0

Download

File Size Modified Licence
arabic-ebooks.xml.bz2
corpus Information (XML, scrambled)
142.88 MB 2025-09-12 CC-BY-4.0

Type

  • Corpus

Language

Arabic

Size

Tokens: 76,486,597
Sentences: 3,629,107

Created

2025-09-12

Updated

2025-09-12

Contact

sb-info@svenska.gu.se