8 Preprocessing Workflow for Global Cancer Expression Data
This document describes the standalone preprocessing pipeline used to
construct cleaned, chip-specific ExpressionSet objects from raw
Affymetrix CEL files for the global cancer dataset
GSE68928.
Overview
Driver script: R/run_preprocessing_pipeline.R
Estimated runtime: - ~15 minutes when CEL files and GEO metadata are already available locally
- ~75 minutes if FTP download is enabled or remote resources must be retrieved
Primary outputs: - output/global_cancer/RData/global_cancer_eset_list.RData - output/global_cancer/RData/annotations/full_chip_annotations.rds
Primary log file: - output/global_cancer/logs/preprocess/preprocess_pipeline_log.txt
The goal of this preprocessing pipeline is to transform raw public
microarray data into normalized, metadata-enriched ExpressionSet
objects suitable for downstream analysis of cancer expression structure
and complexity.
Historical and Methodological Context
The original version of this research (2004) was implemented using a text-based data pipeline. Microarray data were processed as flat files, stored in a MySQL database, and queried using Perl scripts to generate analysis-ready subsets.
In the current reconstruction, this workflow has been translated into a Bioconductor-based framework centered on ExpressionSet objects. This introduces a separation between:
- raw expression data (derived directly from CEL files via RMA), and
- phenotype metadata (retrieved from GEO and attached post hoc)
Because publicly available GEO metadata is not perfectly aligned with the CEL-derived sample structure—and contains occasional spelling and labeling inconsistencies—a dedicated metadata cleaning step is required.
This preprocessing pipeline therefore serves as a bridge between the original data-engineering architecture and a modern object-oriented bioinformatics workflow.
Workflow Steps
1. Download .CEL Files (Optional)
Purpose: Retrieve raw Affymetrix CEL files if not already available locally.
- Controlled by:
download_enabled - Script:
R/preprocessing/download_cel_files.R - Source: EBI FTP (E-GEOD-68928)
- Destination:
data/global_cancer/CEL/
Notes:
In the current workflow, downloads are typically disabled because CEL
files are already present locally. GEO-derived phenotype metadata is
attached in a later step.
2. Build ExpressionSet Objects
Purpose: Detect chip type, validate CEL files, group by chip, and
normalize using RMA.
- Controlled by:
build_esets - Script:
R/preprocessing/build_expression_sets.R
Chip types: - GPL98 → hu35ksuba - GPL80 → hu6800
Behavior: - Detect chip type via read.affybatch() - Skip unreadable or invalid CEL files (logged) - Group valid CELs by chip - Normalize each chip-specific group using RMA - Produce one ExpressionSet per chip
Output:
- Named list of chip-specific ExpressionSet objects
3. Attach GEO Metadata
Purpose: Attach phenotype metadata from GEO to each ExpressionSet.
- Controlled by:
process_metadata - Script:
R/preprocessing/attach_geo_metadata.R - GEO accession:
GSE68928
Behavior: - Load GEO platform objects - Map GEO platforms to internal chip names - Match samples using GSM identifiers derived from CEL filenames - Write matched phenotype data into pData(eset)
4. Clean Metadata and Standardize Labels
Purpose: Correct inconsistencies in GEO metadata and create
analysis-ready labels.
- Script:
R/helpers/clean_and_label_metadata.R
Metadata fields used: - characteristics_ch1 → disease state - characteristics_ch1.1 → tissue source
Derived variables: - condition: "normal" or "cancer" - tissue_label: combined tissue + disease label
(e.g., lung_adenocarcinoma, prostate_normal)
Behavior: - Apply known corrections to spelling and formatting - Normalize free-text metadata into structured categories - Add derived fields to pData(eset)
Notes:
This step encodes domain-specific knowledge about the dataset and is
critical for downstream analysis.
5. Annotate Full-Chip Probe Sets
Purpose: Attach biological annotations to all probes on each chip.
- Controlled by:
run_annotation - Scripts:
R/wrappers/run_annotate_chip_probes.RR/annotate/annotate_chip_probes.R
Annotation dimensions: - Gene symbols (SYMBOL) - Gene Ontology terms (MF and BP) - KEGG pathways - MSigDB Hallmark gene sets
Output: - output/global_cancer/RData/annotations/full_chip_annotations.rds
Notes:
Annotation is performed on the full chip once during preprocessing.
Downstream analysis subsets from this complete annotation layer rather
than recomputing annotations.
6. Save ExpressionSets and Annotations
Purpose: Persist preprocessed objects for downstream analysis.
Outputs: - global_cancer_eset_list.RData
(named list of chip-specific ExpressionSet objects) - full_chip_annotations.rds
(full annotation results for each chip)
These outputs serve as a stable preprocessing checkpoint and are reused
by the analysis pipeline.
Outputs
| File | Description |
|---|---|
global_cancer_eset_list.RData |
List of normalized ExpressionSet objects |
full_chip_annotations.rds |
Full-chip annotation results |
preprocess_pipeline_log.txt |
Main preprocessing log |
skipped_chiptype.txt |
CEL files with no detectable chip type |
skipped_hu35ksuba_cel.txt |
Invalid CEL files for hu35ksuba |
skipped_hu6800_cel.txt |
Invalid CEL files for hu6800 |
Customizing for Other Datasets
This pipeline can be adapted to other Affymetrix or GEO datasets by:
- Updating GEO accession and FTP source
- Adjusting chip-to-platform mappings
- Modifying metadata parsing logic in
clean_and_label_metadata.R
Testing and Validation
CEL file counts and chip distributions are logged
Metadata matching success is reported
Skipped files are recorded for inspection
Phenotype data can be inspected via:
Biobase::pData(eset)
Directory Layout
/R/preprocessing/download_cel_files.R /R/preprocessing/build_expression_sets.R /R/preprocessing/attach_geo_metadata.R /R/helpers/clean_and_label_metadata.R /R/wrappers/run_annotate_chip_probes.R /R/annotate/annotate_chip_probes.R
/output/global_cancer/logs/preprocess/ /output/global_cancer/RData/ /output/global_cancer/RData/annotations/
8.1 References
- Source paper: Ramaswamy S, Tamayo P, Rifkin R, et al.
Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures.
Proc Natl Acad Sci USA 2001;98(26):15149–15154
DOI:10.1073/pnas.211566398 - GEO accession:
GSE68928