Mission

To support the growth of relational machine learning.

How to cite

BibTeX:

@misc{motl2024ctupraguerelationallearning,
    title={The CTU Prague Relational Learning Repository},
    author={Jan Motl and Oliver Schulte},
    year={2024},
    eprint={1511.03086},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/1511.03086},
}

Cite this article.

FAQ

Why are the datasets not stored in CSV files?

Because CSV files do not store information about data types, PKs, FKs and other constraints.

Why MariaDB database?

Because in combination with ClowdFlows you can process the datasets online.
Just open one of the public workflows (like Cross-validation), change the credentials in "MySQL Connect" operator to the credentials from the repository and you are ready to go!

Why am I not able to connect to the database?

If you are connecting to the database over a corporate network, the corporate firewalls could be the culprit (it may block port 3306).
Try to access the database with a different internet provider (e.g. with your cellular provider).
Also, keep in mind that database names are case sensitive. Database "mutagenesis" is not the same database as "Mutagenesis".
If the problems persist, contact us and provide us with the following information:

Your database client and its version (e.g. MySQL Workbench 6.3.10).
The database name you tried to connect to (e.g. mutagenesis).

Why MySQL Workbench complaints about incompatible/nonstandard server version?

We are using open source version of MySQL called MariaDB, hence the warning. For all purposes that the public account permits it is safe to ignore the message.

MySQL Workbench cannot acquire lock for a table when trying to dump a database

Make sure you disable the "lock-tables" option in the advanced options as the "guest" user does not have the privilege to lock tables.

Why mysqldump cannot find COLUMN_STATISTICS in information_schema?

MariaDB has the table in MYSQL.COLUMNM_STATS. Use one of the workarounds.

Why do the datasets contain missing values/composite keys/strange data types/any other ugly thing you may think of?

Because they are also present in the real datasets.

What is the point of including artificial datasets?

While datasets like Adventure Works may not contain any pattern that could be found during modeling, they still increase the diversity of the repository. For example, the named Adventure Works dataset has the highest table count in the whole repository.
If your algorithm can process all the tables present in Adventure Works, it may be able to process real-world datasets.

Tools that use our repository

dm: Relational Data Models, a package for working with relational data in R.
Data Xtractor, a visual SQL query builder for Windows.
getML, a propositionalization library in Python.