6 Ramaswamy dataset

The Ramaswamy dataset is a benchmark gene expression dataset introduced in 2001 for studying multi-class cancer classification. It was developed by S. Ramaswamy and colleagues at the Broad Institute, providing an early demonstration that molecular profiles could differentiate among diverse tumor types.

The dataset originated from the paper Multiclass cancer diagnosis using tumor gene expression signatures (Ramaswamy et al., 2001). The study aimed to show that a single molecular classifier could distinguish multiple tumor origins. Each sample’s RNA expression was profiled using microarray technology, making it one of the earliest large-scale, multi-tissue cancer expression resources.

6.0.1 Key facts

Introduced: 2001 (published in Proceedings of the National Academy of Sciences)
Samples: 190 tumor and normal tissue samples
Classes: 14 distinct cancer and normal tissue types
Platform: Affymetrix oligonucleotide microarrays
Feature count: ~16,000 genes after preprocessing

6.0.2 Data composition

Samples span multiple cancer types—such as lung, breast, kidney, ovary, and leukemia—plus normal tissues. The dataset provides preprocessed log-transformed expression intensities, often filtered for variance and normalized across arrays. It serves as a model for testing algorithms in supervised learning, feature selection, and dimensionality reduction for biological data.

6.0.3 Research impact

The Ramaswamy dataset became widely used in bioinformatics for benchmarking multi-class classifiers, including support vector machines and neural networks. It highlights challenges in high-dimensional, low-sample-size genomics and continues to appear in comparative studies evaluating classification robustness.

6.0.4 Access and usage

The dataset is publicly available through repositories like the Broad Institute and the Gene Expression Omnibus (accession GSE2109 and related entries). Researchers typically download it in processed matrix form for cross-study machine learning evaluation.

6.0.5 Comparison Mappings

A normal tissue vs tumor comparison map was drawn up from the available samples in the data set. The fifteen tissue comparisons where then grouped into four tumor categories: (1.) carcinomas, (2.) blastomas, (3.) lymphomas, and (4.) leukemias, as described in the tables below:

Carcinoma mappings.
Normal Tissue	Tumor	Comparison Abbreviation
Bladder	Bladder transitional cell carcinoma	BLAD/TCC
Breast	Breast adenocarcinoma	BR/BRAD
Colon	Colorectal adenocarcinoma	COL/COADREAD
Kidney	Renal cell carcinoma	KID/RCC
Lung	Lung adenocarcinoma	LU/LUAD
Ovary	Ovarian adenocarcinoma	OV/OVAD
Pancreas	Pancreatic adenocarcinoma	PA/PAAD
Prostate	Prostate adenocarcinoma	PR/PRAD
Uterus	Uterine adenocarcinoma	UT/EAC

Blastoma mappings.
Normal Tissue	Tumor	Comparison Abbreviation	No. of Normal Samples	No. of Tumor Samples
Brain	Glioblastoma	Brain/GBM
Brain	Medulloblastoma	Brain/MB

Lymphoma mappings.
Normal Tissue	Tumor	Comparison Abbreviation	No. of Normal Samples	No. of Tumor Samples
Lymphoid	Follicular lymphoma	GC/FL
Lymphoid	Large B-cell lymphoma	GC/LBCL

Leukemia mappings.
Normal Tissue	Tumor	Comparison Abbreviation
Peripheral Blood	Acute myeloid leukemia	PB/AML
Peripheral Blood	(Bone marrow) B-cell ALL	PB/B-ALL
Peripheral Blood	(Bone marrow) T-cell ALL	PB/T-ALL