9 Analysis Pipeline Overview

This pipeline processes and analyzes microarray CEL files from public repositories to generate cleaned ExpressionSet objects, identify differentially expressed probe sets, annotate them, and run pairwise gene set comparisons. Each stage can be independently toggled via the config file.

9.1 Stage-by-Stage Breakdown

9.1.1 Stage 1: Limma Filtering

Purpose: Identify differentially expressed probe sets per chip.

Controlled by: force_rebuild_limma_filter
Wrapper: run_chip_limma_filter.R
Labels: assigned using assign_labels_to_all.R
Parameters:
- logFC > 0.33
- p-value < 0.05
Comparison basis: <<e.g., cancer vs normal, batch comparisons>>

9.1.2 Stage 2: Annotation of Probes

Purpose: Annotate filtered probe sets with gene-level metadata.

Controlled by: force_run_annotation
Script: run_annotate_all_filtered_probes.R
Output:
- .RData with annotation object
- Excel summary for human inspection
Annotation sources: <<e.g., hgu133plus2.db, Ensembl>>

9.1.3 Stage 3: Visualization

Purpose: Generate summary plots and diagnostic visuals.

Controlled by: run_visualization
Output: HTML report in output/reports/
Report: filtered_probes_report.Rmd
Includes: <<e.g., volcano plots, PCA, heatmaps>>

9.1.4 Stage 4: Matrix + Comparison Map Construction

Purpose: Create chip- and tissue-specific expression matrices and predefined comparisons.

Controlled by: force_rebuild_matrix_maps
Config script: global_cancer_matrix_config_v2.R
Functions:
- build_matrix_lists_by_tissue()
- define_predefined_comparisons()
Output: globalCancer_matrix_config.RData
Notes: <<add logic used in defining comparisons>>

9.1.5 Stage 5: Pairwise Comparisons

Purpose: Apply gene set metrics across all comparison pairs.

Controlled by: run_pairwise
Script: run_all_pairwise_comparisons_v3.R
Parameters:
- Chips: hu35ksuba, hu6800
- Engines: complexity, entropy
- Gene sets: KEGG_Q75
- Minimum probes: 5
Interpretation: <<how pairwise scores are used>>

9.1.6 Stage 6: Aggregation + Gene Set Annotation

Purpose: Aggregate engine results and annotate gene sets.

Controlled by: run_aggregator
Scripts:
- aggregate_engine_results_by_engine.R
- helpers/gene_set_tools.R
Output:
- .csv and .rds files for each engine
Annotation: attach_gene_set_names()
Notes: <<GO terms? KEGG? How annotation improves interpretability?>>

9.1.7 Stage 7: Summary Reports

Purpose: Generate text and CSV summaries for downstream reporting.

Controlled by: run_reporter
Scripts:
- summarize_pairwise_results_v2.R
- clean_aggregated_results.R
Output:
- Cleaned results: output/cleaned_results/
- Text summaries: output/reports/
Example contents: <<Top pathways? Most variable gene sets?>>