8  Preprocessing Workflow for Global Cancer Expression Data

This document describes the standalone preprocessing pipeline used to
construct cleaned, chip-specific ExpressionSet objects from raw
Affymetrix CEL files for the global cancer dataset
GSE68928.


Overview

Driver script: R/run_preprocessing_pipeline.R
Estimated runtime: - ~15 minutes when CEL files and GEO metadata are already available locally
- ~75 minutes if FTP download is enabled or remote resources must be retrieved

Primary outputs: - output/global_cancer/RData/global_cancer_eset_list.RData - output/global_cancer/RData/annotations/full_chip_annotations.rds

Primary log file: - output/global_cancer/logs/preprocess/preprocess_pipeline_log.txt

The goal of this preprocessing pipeline is to transform raw public
microarray data into normalized, metadata-enriched ExpressionSet
objects suitable for downstream analysis of cancer expression structure
and complexity.


Historical and Methodological Context

The original version of this research (2004) was implemented using a text-based data pipeline. Microarray data were processed as flat files, stored in a MySQL database, and queried using Perl scripts to generate analysis-ready subsets.

In the current reconstruction, this workflow has been translated into a Bioconductor-based framework centered on ExpressionSet objects. This introduces a separation between:

  • raw expression data (derived directly from CEL files via RMA), and
  • phenotype metadata (retrieved from GEO and attached post hoc)

Because publicly available GEO metadata is not perfectly aligned with the CEL-derived sample structure—and contains occasional spelling and labeling inconsistencies—a dedicated metadata cleaning step is required.

This preprocessing pipeline therefore serves as a bridge between the original data-engineering architecture and a modern object-oriented bioinformatics workflow.


Workflow Steps

1. Download .CEL Files (Optional)

Purpose: Retrieve raw Affymetrix CEL files if not already available locally.

  • Controlled by: download_enabled
  • Script: R/preprocessing/download_cel_files.R
  • Source: EBI FTP (E-GEOD-68928)
  • Destination: data/global_cancer/CEL/

Notes:
In the current workflow, downloads are typically disabled because CEL
files are already present locally. GEO-derived phenotype metadata is
attached in a later step.


2. Build ExpressionSet Objects

Purpose: Detect chip type, validate CEL files, group by chip, and
normalize using RMA.

  • Controlled by: build_esets
  • Script: R/preprocessing/build_expression_sets.R

Chip types: - GPL98hu35ksuba - GPL80hu6800

Behavior: - Detect chip type via read.affybatch() - Skip unreadable or invalid CEL files (logged) - Group valid CELs by chip - Normalize each chip-specific group using RMA - Produce one ExpressionSet per chip

Output:
- Named list of chip-specific ExpressionSet objects


3. Attach GEO Metadata

Purpose: Attach phenotype metadata from GEO to each ExpressionSet.

  • Controlled by: process_metadata
  • Script: R/preprocessing/attach_geo_metadata.R
  • GEO accession:
    GSE68928

Behavior: - Load GEO platform objects - Map GEO platforms to internal chip names - Match samples using GSM identifiers derived from CEL filenames - Write matched phenotype data into pData(eset)


4. Clean Metadata and Standardize Labels

Purpose: Correct inconsistencies in GEO metadata and create
analysis-ready labels.

  • Script: R/helpers/clean_and_label_metadata.R

Metadata fields used: - characteristics_ch1 → disease state - characteristics_ch1.1 → tissue source

Derived variables: - condition: "normal" or "cancer" - tissue_label: combined tissue + disease label
(e.g., lung_adenocarcinoma, prostate_normal)

Behavior: - Apply known corrections to spelling and formatting - Normalize free-text metadata into structured categories - Add derived fields to pData(eset)

Notes:
This step encodes domain-specific knowledge about the dataset and is
critical for downstream analysis.


5. Annotate Full-Chip Probe Sets

Purpose: Attach biological annotations to all probes on each chip.

  • Controlled by: run_annotation
  • Scripts:
    • R/wrappers/run_annotate_chip_probes.R
    • R/annotate/annotate_chip_probes.R

Annotation dimensions: - Gene symbols (SYMBOL) - Gene Ontology terms (MF and BP) - KEGG pathways - MSigDB Hallmark gene sets

Output: - output/global_cancer/RData/annotations/full_chip_annotations.rds

Notes:
Annotation is performed on the full chip once during preprocessing.
Downstream analysis subsets from this complete annotation layer rather
than recomputing annotations.


6. Save ExpressionSets and Annotations

Purpose: Persist preprocessed objects for downstream analysis.

Outputs: - global_cancer_eset_list.RData
(named list of chip-specific ExpressionSet objects) - full_chip_annotations.rds
(full annotation results for each chip)

These outputs serve as a stable preprocessing checkpoint and are reused
by the analysis pipeline.


Outputs

File Description
global_cancer_eset_list.RData List of normalized ExpressionSet objects
full_chip_annotations.rds Full-chip annotation results
preprocess_pipeline_log.txt Main preprocessing log
skipped_chiptype.txt CEL files with no detectable chip type
skipped_hu35ksuba_cel.txt Invalid CEL files for hu35ksuba
skipped_hu6800_cel.txt Invalid CEL files for hu6800

Customizing for Other Datasets

This pipeline can be adapted to other Affymetrix or GEO datasets by:

  • Updating GEO accession and FTP source
  • Adjusting chip-to-platform mappings
  • Modifying metadata parsing logic in
    clean_and_label_metadata.R

Testing and Validation

  • CEL file counts and chip distributions are logged

  • Metadata matching success is reported

  • Skipped files are recorded for inspection

  • Phenotype data can be inspected via:

    Biobase::pData(eset)

Directory Layout

/R/preprocessing/download_cel_files.R /R/preprocessing/build_expression_sets.R /R/preprocessing/attach_geo_metadata.R /R/helpers/clean_and_label_metadata.R /R/wrappers/run_annotate_chip_probes.R /R/annotate/annotate_chip_probes.R

/output/global_cancer/logs/preprocess/ /output/global_cancer/RData/ /output/global_cancer/RData/annotations/


8.1 References

  • Source paper: Ramaswamy S, Tamayo P, Rifkin R, et al.
    Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures.
    Proc Natl Acad Sci USA 2001;98(26):15149–15154
    DOI:10.1073/pnas.211566398
  • GEO accession:
    GSE68928