Skip to main content

The Arabic E-Book Corpus

Data citation Information

Språkbanken Text (2025). The Arabic E-Book Corpus (updated: 2025-09-12). [Data set]. Språkbanken Text. https://doi.org/10.23695/xwz6-jv19
BibTeX Additional ways to cite the dataset.
A collection of 1,745 books in Arabic.

The Arabic E-Book Corpus is a freely available collection of 1,745 books (81.5 million words) published in by the Hindawi foundation between 2008 and 2024. The books are of various genres, including non-fiction, novels, children's literature, poetry, and plays. The corpus is provided in two versions: html and unformatted plain text. The latter version will be appropriate for most purposes.

For additional detail, see Hallberg, A. (2025). An 81-million-word multi-genre corpus of Arabic books. Data in Brief, 60, 111456.

The corpus is also available for download in HTML format or unformatted plain text.

Accessible through

Access Platform Licence
CC BY 4.0

Download

File Size Modified Licence
arabic-ebooks.xml.bz2
this file contains a scrambled version of the corpus Information (XML)
142.88 MB 2025-09-12 CC BY 4.0

Type

  • Corpus

Language

Arabic

Size

Sentences: 3,629,107
Tokens: 76,486,597

Created

2025-09-12

Updated

2025-09-12

Contact

sb-info@svenska.gu.se