7 Filtering of Microarrays
Gene expression studies often involve measuring tens of thousands of probes (e.g., microarray features or RNA-seq transcripts) across a relatively small number of biological samples. This high-dimensional, low-sample-size setting poses significant statistical and biological challenges.
Firstly, the large number of probes compared to the number of samples increases the risk of false positives due to multiple hypothesis testing. Without filtering, many probes that show random fluctuations due to noise rather than true biological differences may appear significant, leading to misleading results.
Secondly, gene expression is inherently stochastic. Expression levels can vary due to technical variability, cell-to-cell differences, and transient transcriptional activity. Many probes may exhibit low or inconsistent expression across samples, making them unreliable for downstream analysis.
Filtering helps mitigate these issues by removing probes with low overall expression, low variability, or poor detection across samples. This reduces noise and focuses the analysis on more robust, biologically relevant signals, improving statistical power and the interpretability of results. Overall, filtering is a crucial step for enhancing data quality and ensuring meaningful biological insights.
7.1 Background to limma filtering
Limma filtering refers to the use of the limma (Linear Models for Microarray Data) package in R, which provides a comprehensive framework for filtering and analyzing gene expression data. Limma filtering typically involves removing probes that are not reliably expressed before statistical modeling. This is often done using functions like filterByExpr() (for RNA-seq) or applying thresholds on expression intensity, variability, or detection p-values (for microarrays).
7.1.1 Key aspects of limma filtering:
It is data-driven and informed by the design matrix, ensuring that retained probes have sufficient expression in at least one group or condition of interest.
It works well with both microarray and RNA-seq data.
It integrates seamlessly with downstream linear modeling and differential expression analysis.
7.1.2 Comparison with other filtering approaches:
Simple expression cutoffs: Some methods filter probes based on arbitrary thresholds (e.g., removing probes with mean expression below a fixed value). While straightforward, these cutoffs may not account for experimental design or variability across conditions.
Variance-based filtering: Filters based on variability across samples (e.g., retaining top X% most variable genes). This improves focus on dynamic genes but can miss low-variance but biologically significant ones.
Detection p-values or Present/Absent calls (microarrays): Filters out probes not reliably detected, but these methods may be overly conservative and platform-dependent.
7.1.3 Advantages of limma filtering:
More principled and reproducible than arbitrary thresholds.
Balances sensitivity and specificity by considering both expression and design context.
Enhances downstream statistical modeling by removing uninformative probes early.
In summary, limma filtering offers a statistically sound and flexible approach that improves the reliability and power of gene expression analyses compared to more naive or rigid filtering methods.
7.2 Dynamic Filtering of GO Molecular Function Terms and KEGG Pathways
The next level of comparisons involved GO Molecular Function Terms and KEGG Pathways for each comparison per chip.
7.2.0.1 KEGG Pathways
| Statistic | hu35ksuba | hu6800 |
|---|---|---|
| Median | 18 | 41.5 |
| 3rd Quartile | 36 | 77.5 |
| Mean | 28.4 | 55.9 |
| Max | 427 | 676 |
*Interpretation: Most KEGG terms have 20–80** probes. Very sparse ones exist but are uncommon.
7.2.0.2 GO Molecular Function
| Stat | hu35ksuba | hu6800 |
|---|---|---|
| Median | 2 | 2 |
| 3rd Quartile | 6 | 6 |
| Mean | ~12 | ~12 |
| Max | ~6400 | ~5300 |
Interpretation: Most GO terms are tiny. The median is just 2 probes! A hard threshold is absolutely required to avoid noise.
7.3 Filtered Probe Summary
Across two Affymetrix platforms (hu35ksuba and hu6800), dynamic filtering thresholds were applied using quantile metrics on GO and KEGG pathway memberships. Probe retention varied significantly by cancer type and chip:
7.4 Application of limma filtering to data set
Limma filtering was applied to the fifteen normal/tumor comparisons using thresholds of logFC = 1 and p-value <= 0.05. The number of probes returned per normal/tumor comparison after limma filtering are listed in the table below:
7.5 Filtered Probes
7.5.1 Carcinomas
⚠️ Table not available.
7.5.2 Blastomas
⚠️ Table not available.
7.5.3 Leukemias
⚠️ Table not available.
7.5.4 Lymphomas
⚠️ Table not available.
7.5.4.1 Interpretation of Filtered Probe Counts Across Comparisons
A high number of filtered probes retained after limma preprocessing in a normal vs. tumor comparison suggests a substantial degree of differential gene expression, implying that many genes meet both the fold-change and statistical significance criteria. This typically indicates a pronounced molecular divergence between normal and tumor tissues—potentially due to disrupted regulatory programs, altered signaling pathways, or extensive reprogramming of the transcriptome in tumor cells.
In contrast, comparisons yielding fewer filtered probes may reflect one or more of the following:
Biological similarity between normal and tumor tissues (e.g., early-stage tumors or tissues with inherently low transcriptional diversity),
Increased inter-sample variability, which can dilute statistical power,
Platform sensitivity differences (as seen in some GPL80 results missing certain comparisons),
Or more complex expression patterns not well captured by standard differential expression filters (e.g., widespread but modest shifts rather than sharp, localized changes).
It is also notable that certain comparisons—such as PB/B-ALL and PB/T-ALL—retain an exceptionally high number of probes across both platforms. This may reflect the distinct nature of hematologic malignancies, where malignant transformation often results in global transcriptional shifts tied to lineage commitment, proliferation, or immune regulation.
Conversely, comparisons like LU/LUAD or PR/PRAD retain very few probes, potentially due to transcriptomic similarity between normal and tumor tissues in those contexts, or perhaps due to lower-quality input data, technical limitations, or biological heterogeneity.
Lastly, missing values (e.g., in lymphomas on the hu6800 platform) should not be interpreted as an absence of differential expression, but rather as platform coverage limitations—emphasizing the importance of integrating platform metadata (e.g., probe content and annotation depth) in interpretation.