6 Ramaswamy dataset
The Ramaswamy dataset is a benchmark gene expression dataset introduced in 2001 for studying multi-class cancer classification. It was developed by S. Ramaswamy and colleagues at the Broad Institute, providing an early demonstration that molecular profiles could differentiate among diverse tumor types.
The dataset originated from the paper Multiclass cancer diagnosis using tumor gene expression signatures (Ramaswamy et al., 2001). The study aimed to show that a single molecular classifier could distinguish multiple tumor origins. Each sample’s RNA expression was profiled using microarray technology, making it one of the earliest large-scale, multi-tissue cancer expression resources.
6.0.1 Key facts
Introduced: 2001 (published in Proceedings of the National Academy of Sciences)
Samples: 190 tumor and normal tissue samples
Classes: 14 distinct cancer and normal tissue types
Platform: Affymetrix oligonucleotide microarrays
Feature count: ~16,000 genes after preprocessing
6.0.2 Data composition
Samples span multiple cancer types—such as lung, breast, kidney, ovary, and leukemia—plus normal tissues. The dataset provides preprocessed log-transformed expression intensities, often filtered for variance and normalized across arrays. It serves as a model for testing algorithms in supervised learning, feature selection, and dimensionality reduction for biological data.
6.0.3 Research impact
The Ramaswamy dataset became widely used in bioinformatics for benchmarking multi-class classifiers, including support vector machines and neural networks. It highlights challenges in high-dimensional, low-sample-size genomics and continues to appear in comparative studies evaluating classification robustness.
6.0.4 Access and usage
The dataset is publicly available through repositories like the Broad Institute and the Gene Expression Omnibus (accession GSE2109 and related entries). Researchers typically download it in processed matrix form for cross-study machine learning evaluation.
6.0.5 Comparison Mappings
A normal tissue vs tumor comparison map was drawn up from the available samples in the data set. The fifteen tissue comparisons where then grouped into four tumor categories: (1.) carcinomas, (2.) blastomas, (3.) lymphomas, and (4.) leukemias, as described in the tables below:
| Normal Tissue | Tumor | Comparison Abbreviation | No. of Normal Samples | No. of Tumor Samples |
|---|---|---|---|---|
| Bladder | Bladder transitional cell carcinoma | BLAD/TCC | ||
| Breast | Breast adenocarcinoma | BR/BRAD | ||
| Colon | Colorectal adenocarcinoma | COL/COADREAD | ||
| Kidney | Renal cell carcinoma | KID/RCC | ||
| Lung | Lung adenocarcinoma | LU/LUAD | ||
| Ovary | Ovarian adenocarcinoma | OV/OVAD | ||
| Pancreas | Pancreatic adenocarcinoma | PA/PAAD | ||
| Prostate | Prostate adenocarcinoma | PR/PRAD | ||
| Uterus | Uterine adenocarcinoma | UT/EAC |
| Normal Tissue | Tumor | Comparison Abbreviation | No. of Normal Samples | No. of Tumor Samples |
|---|---|---|---|---|
| Brain | Glioblastoma | Brain/GBM | ||
| Brain | Medulloblastoma | Brain/MB |
| Normal Tissue | Tumor | Comparison Abbreviation | No. of Normal Samples | No. of Tumor Samples |
|---|---|---|---|---|
| Lymphoid | Follicular lymphoma | GC/FL | ||
| Lymphoid | Large B-cell lymphoma | GC/LBCL |
| Normal Tissue | Tumor | Comparison Abbreviation | No. of Normal Samples | No. of Tumor Samples |
|---|---|---|---|---|
| Peripheral Blood | Acute myeloid leukemia | PB/AML | ||
| Peripheral Blood | (Bone marrow) B-cell ALL | PB/B-ALL | ||
| Peripheral Blood | (Bone marrow) T-cell ALL | PB/T-ALL |