Back to home

Uncertainty-Aware Generative Oversampling: Introducing LEO-CVAE

Published on arXiv | arXiv:2509.25334v3

I'm pleased to share my latest research on addressing class imbalance in clinical genomics through uncertainty-aware generative modeling. Our paper introduces LEO-CVAE (Local Entropy-Guided Oversampling with a Conditional Variational Autoencoder), bridging information theory and deep learning to tackle a fundamental challenge in biomedical machine learning.

The Problem

Class imbalance is pervasive in clinical datasets, where minority classes—patients with rare conditions or adverse outcomes—are often the most critical to identify. In clinical genomics, this challenge is amplified by two characteristics: high-dimensional, nonlinear data manifolds and ambiguous, overlapping class boundaries due to biological heterogeneity.

Traditional oversampling methods like SMOTE rely on linear interpolation, which breaks down in complex genomic feature spaces, generating biologically implausible synthetic samples. While deep generative models like CVAEs can capture nonlinear distributions, they treat all minority samples equally, missing a crucial insight: samples near class boundaries are more valuable for learning than those deep within a class's feature space.

The Solution

LEO-CVAE addresses this gap through three innovations:

Results

Evaluated on TCGA lung cancer (binary classification, 817 patients) and ADNI Alzheimer's (multiclass classification, 744 participants), LEO-CVAE consistently outperformed both traditional methods (SMOTE, Borderline-SMOTE, ADASYN) and generative baselines (Standard CVAE, CVAE with Focal Loss).

Significance

This work is the first to integrate local entropy as a direct signal guiding both training and generation in a deep generative model for imbalanced learning. The approach is particularly suited for domains with complex, nonlinear data structures and ambiguous class boundaries—characteristics common in biomedical applications.