25  Latent-Space Representation and Geometric Analysis

26 Purpose

This chapter defines the construction of latent-space representations and the geometric measures used to quantify cancer state structure.

It serves as the methodological foundation for the latent-space results presented in the subsequent chapter (Notebook 5–derived analyses). No biological interpretation is performed here; all quantities defined below are used later for comparison and interpretation.


27 Relationship to Prior Framework

The classical analysis pipeline operates in gene-expression space and evaluates complexity using entropy- and rank-based measures.

The latent-space framework introduces a learned representation ( Z ^k ), derived via a variational autoencoder (VAE), in which geometric properties of sample distributions can be analyzed.

This chapter defines:

  • how ( Z ) is constructed
  • how samples are grouped (normal vs tumor)
  • how geometric properties of each group are quantified

These definitions are used directly in the latent complexity results.


28 Latent Representation Pipeline

28.1 Data flow

The latent representation is produced through the following stages:

  1. Preprocessing and normalization (R pipeline; Affymetrix RMA)
  2. Feature selection (high-variance probe filtering)
  3. Model training (VAE; Python / PyTorch)
  4. Embedding extractionlatent.npy
  5. Metadata alignmentmetadata_aligned.csv

28.2 Latent data structure

After alignment:

  • Each sample ( i ) is represented as a vector ( z_i ^k )
  • Samples are labeled by:
    • condition (normal vs tumor)
    • tissue / cancer type

The working dataset is:

[ = { (z_i, y_i, c_i) } ]

where: - ( y_i {, } ) - ( c_i ) denotes cancer type


29 Diagnostic Procedures

These steps validate the learned representation before geometric analysis.

29.1 PCA sanity check

Principal component analysis is applied to the latent vectors to assess:

  • variance concentration
  • potential degeneracy
  • global structure

This is a diagnostic step only and does not enter downstream metrics.

29.2 VAE training diagnostics

Model performance is evaluated via:

  • total loss
  • reconstruction loss
  • KL divergence

These ensure stable convergence and a usable latent embedding.

29.3 Latent visualization with centroids

For each class ( c ), the centroid is defined as:

[ c = {i c} z_i ]

Centroid plots provide qualitative visualization of class separation but are not used directly as quantitative measures.


30 Latent Geometry Measures

Let ( X_c ^k ) denote the set of latent vectors for class ( c ).

30.1 Covariance structure

[ _c = (X_c) ]

Let eigenvalues be:

[ _1, _2, , _k ]


30.2 Participation Ratio (Effective Dimensionality)

[ _c = ]

Interpretation: - higher PR → variance distributed across more dimensions - lower PR → variance concentrated in fewer directions


30.3 Eigenvalue Entropy

Define normalized eigenvalues:

[ p_i = ]

Then:

[ H_c = -_i p_i p_i ]

Interpretation: - higher entropy → more uniform variance distribution - lower entropy → dominance of a few axes


30.4 Anisotropy

[ A_c = ]

Interpretation: - high anisotropy → strong directional dominance - low anisotropy → isotropic dispersion


30.5 Class Radius (Dispersion)

[ R_c = _{i c} | z_i - _c | ]

Interpretation: - measures spread around centroid - sensitive to global dispersion


30.6 Centroid Distance

For two classes ( c_1, c_2 ):

[ D(c_1, c_2) = | {c_1} - {c_2} | ]

Interpretation: - global separation between classes


30.7 Local Neighborhood Structure

Using k-nearest neighbors:

  • same-class neighbor fraction
  • neighborhood mixing

These capture local structure not visible in covariance measures.


31 Normal vs Tumor Comparisons

For each cancer type ( t ), two groups are defined:

  • ( X_{t}^{(N)} ): normal samples
  • ( X_{t}^{(T)} ): tumor samples

For each metric ( M ), we compute:

[ M_t = M_t^{(T)} - M_t^{(N)} ]

These deltas form the primary inputs to the results chapter.


32 Output Objects for Results Layer

The following quantities are exported for downstream analysis:

  • ( )
  • ( H ) (entropy)
  • ( A ) (anisotropy)
  • ( R ) (radius)
  • centroid distances
  • neighborhood statistics

These outputs are consumed directly by the latent complexity results.


34 Critical Limitations

34.1 Representation dependence

All measures depend on the learned embedding and are not direct observables.

34.2 Feature-selection mismatch

Latent-space analysis may use a different subset of genes than classical analysis.

34.3 Scale separation

Latent geometry, probe-level expression, and pathway-level results capture different aspects of system structure.


35 Summary

This chapter defines the latent-space representation and the geometric measures used to quantify cancer state structure.

These definitions provide the formal basis for the results presented in the following chapter.