9  Analysis Pipeline Overview

This pipeline processes and analyzes microarray CEL files from public repositories to generate cleaned ExpressionSet objects, identify differentially expressed probe sets, annotate them, and run pairwise gene set comparisons. Each stage can be independently toggled via the config file.


9.1 Stage-by-Stage Breakdown


9.1.1 Stage 1: Limma Filtering

Purpose: Identify differentially expressed probe sets per chip.

  • Controlled by: force_rebuild_limma_filter
  • Wrapper: run_chip_limma_filter.R
  • Labels: assigned using assign_labels_to_all.R
  • Parameters:
    • logFC > 0.33
    • p-value < 0.05
  • Comparison basis: <<e.g., cancer vs normal, batch comparisons>>

9.1.2 Stage 2: Annotation of Probes

Purpose: Annotate filtered probe sets with gene-level metadata.

  • Controlled by: force_run_annotation
  • Script: run_annotate_all_filtered_probes.R
  • Output:
    • .RData with annotation object
    • Excel summary for human inspection
  • Annotation sources: <<e.g., hgu133plus2.db, Ensembl>>

9.1.3 Stage 3: Visualization

Purpose: Generate summary plots and diagnostic visuals.

  • Controlled by: run_visualization
  • Output: HTML report in output/reports/
  • Report: filtered_probes_report.Rmd
  • Includes: <<e.g., volcano plots, PCA, heatmaps>>

9.1.4 Stage 4: Matrix + Comparison Map Construction

Purpose: Create chip- and tissue-specific expression matrices and predefined comparisons.

  • Controlled by: force_rebuild_matrix_maps
  • Config script: global_cancer_matrix_config_v2.R
  • Functions:
    • build_matrix_lists_by_tissue()
    • define_predefined_comparisons()
  • Output: globalCancer_matrix_config.RData
  • Notes: <<add logic used in defining comparisons>>

9.1.5 Stage 5: Pairwise Comparisons

Purpose: Apply gene set metrics across all comparison pairs.

  • Controlled by: run_pairwise
  • Script: run_all_pairwise_comparisons_v3.R
  • Parameters:
    • Chips: hu35ksuba, hu6800
    • Engines: complexity, entropy
    • Gene sets: KEGG_Q75
    • Minimum probes: 5
  • Interpretation: <<how pairwise scores are used>>

9.1.6 Stage 6: Aggregation + Gene Set Annotation

Purpose: Aggregate engine results and annotate gene sets.

  • Controlled by: run_aggregator
  • Scripts:
    • aggregate_engine_results_by_engine.R
    • helpers/gene_set_tools.R
  • Output:
    • .csv and .rds files for each engine
  • Annotation: attach_gene_set_names()
  • Notes: <<GO terms? KEGG? How annotation improves interpretability?>>

9.1.7 Stage 7: Summary Reports

Purpose: Generate text and CSV summaries for downstream reporting.

  • Controlled by: run_reporter
  • Scripts:
    • summarize_pairwise_results_v2.R
    • clean_aggregated_results.R
  • Output:
    • Cleaned results: output/cleaned_results/
    • Text summaries: output/reports/
  • Example contents: <<Top pathways? Most variable gene sets?>>