7  Filtering of Microarrays

Gene expression studies often involve measuring tens of thousands of probes (e.g., microarray features or RNA-seq transcripts) across a relatively small number of biological samples. This high-dimensional, low-sample-size setting poses significant statistical and biological challenges.

Firstly, the large number of probes compared to the number of samples increases the risk of false positives due to multiple hypothesis testing. Without filtering, many probes that show random fluctuations due to noise rather than true biological differences may appear significant, leading to misleading results.

Secondly, gene expression is inherently stochastic. Expression levels can vary due to technical variability, cell-to-cell differences, and transient transcriptional activity. Many probes may exhibit low or inconsistent expression across samples, making them unreliable for downstream analysis.

Filtering helps mitigate these issues by removing probes with low overall expression, low variability, or poor detection across samples. This reduces noise and focuses the analysis on more robust, biologically relevant signals, improving statistical power and the interpretability of results. Overall, filtering is a crucial step for enhancing data quality and ensuring meaningful biological insights.

7.1 Background to limma filtering

Limma filtering refers to the use of the limma (Linear Models for Microarray Data) package in R, which provides a comprehensive framework for filtering and analyzing gene expression data. Limma filtering typically involves removing probes that are not reliably expressed before statistical modeling. This is often done using functions like filterByExpr() (for RNA-seq) or applying thresholds on expression intensity, variability, or detection p-values (for microarrays).

7.1.1 Key aspects of limma filtering:

  • It is data-driven and informed by the design matrix, ensuring that retained probes have sufficient expression in at least one group or condition of interest.

  • It works well with both microarray and RNA-seq data.

  • It integrates seamlessly with downstream linear modeling and differential expression analysis.

7.1.2 Comparison with other filtering approaches:

  • Simple expression cutoffs: Some methods filter probes based on arbitrary thresholds (e.g., removing probes with mean expression below a fixed value). While straightforward, these cutoffs may not account for experimental design or variability across conditions.

  • Variance-based filtering: Filters based on variability across samples (e.g., retaining top X% most variable genes). This improves focus on dynamic genes but can miss low-variance but biologically significant ones.

  • Detection p-values or Present/Absent calls (microarrays): Filters out probes not reliably detected, but these methods may be overly conservative and platform-dependent.

7.1.3 Advantages of limma filtering:

  • More principled and reproducible than arbitrary thresholds.

  • Balances sensitivity and specificity by considering both expression and design context.

  • Enhances downstream statistical modeling by removing uninformative probes early.

In summary, limma filtering offers a statistically sound and flexible approach that improves the reliability and power of gene expression analyses compared to more naive or rigid filtering methods.

7.2 Dynamic Filtering of GO Molecular Function Terms and KEGG Pathways

The next level of comparisons involved GO Molecular Function Terms and KEGG Pathways for each comparison per chip.

7.2.0.1 KEGG Pathways

Statistic hu35ksuba hu6800
Median 18 41.5
3rd Quartile 36 77.5
Mean 28.4 55.9
Max 427 676

*Interpretation: Most KEGG terms have 20–80** probes. Very sparse ones exist but are uncommon.

7.2.0.2 GO Molecular Function

Stat hu35ksuba hu6800
Median 2 2
3rd Quartile 6 6
Mean ~12 ~12
Max ~6400 ~5300

Interpretation: Most GO terms are tiny. The median is just 2 probes! A hard threshold is absolutely required to avoid noise.

7.3 Filtered Probe Summary

Across two Affymetrix platforms (hu35ksuba and hu6800), dynamic filtering thresholds were applied using quantile metrics on GO and KEGG pathway memberships. Probe retention varied significantly by cancer type and chip:

7.4 Application of limma filtering to data set

Limma filtering was applied to the fifteen normal/tumor comparisons using thresholds of logFC = 1 and p-value <= 0.05. The number of probes returned per normal/tumor comparison after limma filtering are listed in the table below:

7.5 Filtered Probes

7.5.1 Carcinomas

⚠️ Table not available.

7.5.2 Blastomas

⚠️ Table not available.

7.5.3 Leukemias

⚠️ Table not available.

7.5.4 Lymphomas

⚠️ Table not available.

7.5.4.1 Interpretation of Filtered Probe Counts Across Comparisons

A high number of filtered probes retained after limma preprocessing in a normal vs. tumor comparison suggests a substantial degree of differential gene expression, implying that many genes meet both the fold-change and statistical significance criteria. This typically indicates a pronounced molecular divergence between normal and tumor tissues—potentially due to disrupted regulatory programs, altered signaling pathways, or extensive reprogramming of the transcriptome in tumor cells.

In contrast, comparisons yielding fewer filtered probes may reflect one or more of the following:

  • Biological similarity between normal and tumor tissues (e.g., early-stage tumors or tissues with inherently low transcriptional diversity),

  • Increased inter-sample variability, which can dilute statistical power,

  • Platform sensitivity differences (as seen in some GPL80 results missing certain comparisons),

Or more complex expression patterns not well captured by standard differential expression filters (e.g., widespread but modest shifts rather than sharp, localized changes).

It is also notable that certain comparisons—such as PB/B-ALL and PB/T-ALL—retain an exceptionally high number of probes across both platforms. This may reflect the distinct nature of hematologic malignancies, where malignant transformation often results in global transcriptional shifts tied to lineage commitment, proliferation, or immune regulation.

Conversely, comparisons like LU/LUAD or PR/PRAD retain very few probes, potentially due to transcriptomic similarity between normal and tumor tissues in those contexts, or perhaps due to lower-quality input data, technical limitations, or biological heterogeneity.

Lastly, missing values (e.g., in lymphomas on the hu6800 platform) should not be interpreted as an absence of differential expression, but rather as platform coverage limitations—emphasizing the importance of integrating platform metadata (e.g., probe content and annotation depth) in interpretation.

7.6 Shared probes within a tumor category

To determine how closely the tumors within a cancer category resembled each other, a determination was made as to the number of filtered probes shared across a tumor category. The results are presented in the tables below:

Number of Filtered Probes Shared Across a Tumor Category
Tumor Category Number of Shared Probes (hu35ksuba) Number of Shared Probes (hu6800)
Carcinomas 0 0
Blastomas 51 47
Leukemias 354 328
Lymphomas 181 NA

7.6.1 Interpretation of Shared Probe Counts

The number of shared probes across comparisons within a tumor category reflects the consistency and overlap of differentially expressed genes detected across normal–tumor comparisons of related tissue types. High shared probe counts suggest that similar molecular changes recur across comparisons, possibly due to shared oncogenic mechanisms, common cell-of-origin features, or conserved dysregulation of specific pathways.

In this analysis, leukemias and lymphomas exhibit substantial numbers of shared probes (e.g., 354 and 328 for leukemias; 181 for lymphomas on GPL98). This pattern may reflect the homogeneity of hematologic malignancies, where transformation processes frequently affect conserved transcriptional programs related to hematopoiesis, immune signaling, and proliferation.

By contrast, carcinomas exhibit no shared probes across comparisons, suggesting either:

  • Distinct molecular landscapes between carcinoma subtypes (e.g., breast vs. colon),

  • Greater biological heterogeneity among epithelial tumors,

  • Or divergence in sample origin, histology, or progression stage—limiting the overlap of detected differentially expressed genes.

Blastomas show modest but consistent probe sharing, suggesting some convergence in dysregulated developmental pathways typical of embryonal tumors, though less striking than in leukemias.

Notably, missing values (e.g., “NA” for lymphomas on the GPL80/hu6800 platform) do not indicate an absence of shared signal but rather reflect incomplete platform coverage or the absence of corresponding comparisons, underlining the importance of interpreting probe overlap in light of both biological context and technical design constraints (such as probe composition and annotation depth across microarray platforms).