Uncertainty-Aware Generative Oversampling: Introducing LEO-CVAE
Published on arXiv | arXiv:2509.25334v3
I'm pleased to share my latest research on addressing class imbalance in clinical genomics through uncertainty-aware generative modeling. Our paper introduces LEO-CVAE (Local Entropy-Guided Oversampling with a Conditional Variational Autoencoder), bridging information theory and deep learning to tackle a fundamental challenge in biomedical machine learning.
The Problem
Class imbalance is pervasive in clinical datasets, where minority classes—patients with rare conditions or adverse outcomes—are often the most critical to identify. In clinical genomics, this challenge is amplified by two characteristics: high-dimensional, nonlinear data manifolds and ambiguous, overlapping class boundaries due to biological heterogeneity.
Traditional oversampling methods like SMOTE rely on linear interpolation, which breaks down in complex genomic feature spaces, generating biologically implausible synthetic samples. While deep generative models like CVAEs can capture nonlinear distributions, they treat all minority samples equally, missing a crucial insight: samples near class boundaries are more valuable for learning than those deep within a class's feature space.
The Solution
LEO-CVAE addresses this gap through three innovations:
- Local Entropy Score (LES): We quantify sample-level uncertainty using Shannon entropy over each sample's k-nearest neighborhood, formally identifying "hard-to-learn" regions where classes overlap.
- Local Entropy-Weighted Loss (LEWL): We modify the CVAE training objective to prioritize learning in high-entropy regions, compelling the model to focus on contested decision boundaries.
- Entropy-Guided Sampling: During generation, we preferentially select high-entropy samples as seeds, concentrating synthetic data creation in the most informative regions.
Results
Evaluated on TCGA lung cancer (binary classification, 817 patients) and ADNI Alzheimer's (multiclass classification, 744 participants), LEO-CVAE consistently outperformed both traditional methods (SMOTE, Borderline-SMOTE, ADASYN) and generative baselines (Standard CVAE, CVAE with Focal Loss).
- On TCGA: achieved the highest AUC-ROC (0.661 ± 0.030) and AUPRC (0.889 ± 0.021)
- On ADNI: obtained the highest macro-averaged metrics, demonstrating balanced performance across all classes
Significance
This work is the first to integrate local entropy as a direct signal guiding both training and generation in a deep generative model for imbalanced learning. The approach is particularly suited for domains with complex, nonlinear data structures and ambiguous class boundaries—characteristics common in biomedical applications.
Access the Work
Read the full paper: arXiv:2509.25334v3
Access the code: github.com/Amirhossein-Zare/LEO-CVAE
Zare, A., Zare, A., Pezeshki, P.S., Rahimi, H., Ebrahimi, A., Vazquez-García, I., & Celi, L.A. (2025). Uncertainty-Aware Generative Oversampling Using an Entropy-Guided Conditional Variational Autoencoder. arXiv preprint arXiv:2509.25334v3.