25 Latent-Space Representation and Geometric Analysis
26 Purpose
This chapter defines the construction of latent-space representations and the geometric measures used to quantify cancer state structure.
It serves as the methodological foundation for the latent-space results presented in the subsequent chapter (Notebook 5–derived analyses). No biological interpretation is performed here; all quantities defined below are used later for comparison and interpretation.
27 Relationship to Prior Framework
The classical analysis pipeline operates in gene-expression space and evaluates complexity using entropy- and rank-based measures.
The latent-space framework introduces a learned representation ( Z ^k ), derived via a variational autoencoder (VAE), in which geometric properties of sample distributions can be analyzed.
This chapter defines:
- how ( Z ) is constructed
- how samples are grouped (normal vs tumor)
- how geometric properties of each group are quantified
These definitions are used directly in the latent complexity results.
28 Latent Representation Pipeline
28.1 Data flow
The latent representation is produced through the following stages:
- Preprocessing and normalization (R pipeline; Affymetrix RMA)
- Feature selection (high-variance probe filtering)
- Model training (VAE; Python / PyTorch)
- Embedding extraction →
latent.npy - Metadata alignment →
metadata_aligned.csv
28.2 Latent data structure
After alignment:
- Each sample ( i ) is represented as a vector ( z_i ^k )
- Samples are labeled by:
- condition (normal vs tumor)
- tissue / cancer type
The working dataset is:
[ = { (z_i, y_i, c_i) } ]
where: - ( y_i {, } ) - ( c_i ) denotes cancer type
29 Diagnostic Procedures
These steps validate the learned representation before geometric analysis.
29.1 PCA sanity check
Principal component analysis is applied to the latent vectors to assess:
- variance concentration
- potential degeneracy
- global structure
This is a diagnostic step only and does not enter downstream metrics.
29.2 VAE training diagnostics
Model performance is evaluated via:
- total loss
- reconstruction loss
- KL divergence
These ensure stable convergence and a usable latent embedding.
29.3 Latent visualization with centroids
For each class ( c ), the centroid is defined as:
[ c = {i c} z_i ]
Centroid plots provide qualitative visualization of class separation but are not used directly as quantitative measures.
30 Latent Geometry Measures
Let ( X_c ^k ) denote the set of latent vectors for class ( c ).
30.1 Covariance structure
[ _c = (X_c) ]
Let eigenvalues be:
[ _1, _2, , _k ]
30.2 Participation Ratio (Effective Dimensionality)
[ _c = ]
Interpretation: - higher PR → variance distributed across more dimensions - lower PR → variance concentrated in fewer directions
30.3 Eigenvalue Entropy
Define normalized eigenvalues:
[ p_i = ]
Then:
[ H_c = -_i p_i p_i ]
Interpretation: - higher entropy → more uniform variance distribution - lower entropy → dominance of a few axes
30.4 Anisotropy
[ A_c = ]
Interpretation: - high anisotropy → strong directional dominance - low anisotropy → isotropic dispersion
30.5 Class Radius (Dispersion)
[ R_c = _{i c} | z_i - _c | ]
Interpretation: - measures spread around centroid - sensitive to global dispersion
30.6 Centroid Distance
For two classes ( c_1, c_2 ):
[ D(c_1, c_2) = | {c_1} - {c_2} | ]
Interpretation: - global separation between classes
30.7 Local Neighborhood Structure
Using k-nearest neighbors:
- same-class neighbor fraction
- neighborhood mixing
These capture local structure not visible in covariance measures.
31 Normal vs Tumor Comparisons
For each cancer type ( t ), two groups are defined:
- ( X_{t}^{(N)} ): normal samples
- ( X_{t}^{(T)} ): tumor samples
For each metric ( M ), we compute:
[ M_t = M_t^{(T)} - M_t^{(N)} ]
These deltas form the primary inputs to the results chapter.
32 Output Objects for Results Layer
The following quantities are exported for downstream analysis:
- ( )
- ( H ) (entropy)
- ( A ) (anisotropy)
- ( R ) (radius)
- centroid distances
- neighborhood statistics
These outputs are consumed directly by the latent complexity results.
33 Link to Results Chapter
The next chapter applies these definitions to each cancer type and evaluates:
- within-class geometric changes (Notebook 5)
- between-class structure (Notebook 6)
- correspondence with classical complexity measures
All interpretation, biological reasoning, and regime classification are deferred to that chapter.
34 Critical Limitations
34.1 Representation dependence
All measures depend on the learned embedding and are not direct observables.
34.2 Feature-selection mismatch
Latent-space analysis may use a different subset of genes than classical analysis.
34.3 Scale separation
Latent geometry, probe-level expression, and pathway-level results capture different aspects of system structure.
35 Summary
This chapter defines the latent-space representation and the geometric measures used to quantify cancer state structure.
These definitions provide the formal basis for the results presented in the following chapter.