CiteSeer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

Original source: linqs.soe.ucsc.edu

Versions

CiteSeer (by Jan Motl)
- Note that some papers appear in cite table without having a content entry.

Dataset details

Associated task:

Classification

Domain:

Education

Data types:

String

Size:

5.9 MB

Count of tables:

Count of rows:

113,760

Count of columns:

Missing values:

Compound keys:

Loops:

Yes

Type:: Real
Instance count:: 3,312
Target table:: paper
Target column:: class_label
Target ID:: paper_id
Target timestamp:: ?

References

Algorithms

Dataset version	Target	Algorithm	Author text	Measure	Value
CiteSeer		CBCC	Case-Based Collective Classification	Accuracy	0.669
CiteSeer		MLN	Investigating Markov Logic Networks for Collective Classification	Accuracy	0.742

How to download the dataset

The datasets are publicly available directly from MariaDB database.

Open your favourite MariaDB client (MySQL Workbench works, but see FAQ)
Use following credentials:
- hostname: relational.fel.cvut.cz
- port: 3306
- username: guest
- password: ctu-relational
Export "CiteSeer" database (or other version of the dataset, if available) in your favourite format (e.g. CSV or SQL dump).