Top datasets
PKDD'99 Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions. The standard task is to predict the loan outcome for finished loans (A vs B in loan.status) at the time of the loan start (defined by loan.dat…
The dataset comprises of 230 molecules trialed for mutagenicity on Salmonella typhimurium. A subset of 188 molecules is learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining subset of 42 molecules is named the …
An anonymized dump of all user-contributed content on the Stats Stack Exchange network.
The IMDb database: moderately large, real database of movies.
East-West challenge (1980) database describes east-bound and west-bound trains.
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wo…
PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).
KDD Cup 2001 prediction of gene/protein function and localization.
This dataset lists facts about the Department of Computer Science and Engineering at the University of Washington (UW-CSE), such as entities (e.g., Student, Professor) and their relationships (i.e. AdvisedBy, Publication).
A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.
Predictive Toxicology Challenge (2000) consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats.
For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenicity tests and 148 negative tests.