Authors:
Aparna Shrivatsava,P. Raghu Vamsi,DOI NO:
https://doi.org/10.26782/jmcms.2026.06.00006Keywords:
Classification,Imbalanced dataset,Logistic Regression,Machine Learning,Mahalanbois distance,Multivariate datasets.,Abstract
Imbalanced datasets present significant challenges for standard classification algorithms and often lead to biased models that perform poorly on minority classes. To address this, this study proposes a framework combining adaptive data transformation, Multi-class Mahalanobis Distance (MMD) metric learning, and logistic regression. The MMD transformation optimizes the feature space by computing a shared covariance matrix to pull similar data points closer and to increase class separability. The proposed MMD (LR) method significantly outperformed existing techniques like Local Mahalanobis Distance Learning (LMDL) on various benchmark imbalanced datasets. MMD(LR) achieved an average performance gain of 6.70% in precision, 7.16% in F1-score, and 14.10% in Area Under the Curve (AUC). Notably, the model achieved a perfect 100% AUC on the Wine and Iris datasets, and a 99.69% AUC on the Breast Cancer dataset. It demonstrates its exceptional robustness and adaptability for classifying complex and imbalanced data.Refference:
I. Aeberhard, S., and M. Forina. Wine Dataset. UCI Machine Learning Repository, 1992. 10.24432/C5PC7J.
II. Altalhan, M., A. Algarni, and M. Turki-Hadj Alouane. “Imbalanced Data Problem in Machine Learning: A Review.” IEEE Access, 2025. 10.1109/access.2025.3531662.
III. Shrivastava, A., and P. R. Vamsi. “Improving Anomaly Classification using Combined Data Transformation and Machine Learning Methods.” International Journal of Performability Engineering, vol. 20, 2024, p. 68. 10.23940/ijpe.24.02.p2.6880.
IV. Ashoorzadeh, A., A. T. Eshlaghy, and M. A. Kazemi. “A Novel Classification Method: A Hybrid Approach Based on Large Margin Nearest Neighbor Classifier.” Journal of Computational Robotics, vol. 17, 2024, pp. 17–34.
V. Blum, L., M. Elgendi, and C. Menon. “Impact of Box-Cox Transformation on Machine-Learning Algorithms.” Frontiers in Artificial Intelligence, vol. 5, 2022. 10.3389/frai.2022.877569.
VI. Boudt, K., P. J. Rousseeuw, S. Vanduffel, and T. Verdonck. “The Minimum Regularized Covariance Determinant Estimator.” Statistics and Computing, vol. 30, 2019, pp. 113–128. 10.1007/s11222-019-09869-x.
VII. Chen, Y., A. Wiesel, Y. C. Eldar, and A. O. Hero. “Shrinkage Algorithms for MMSE Covariance Estimation.” vol. 58, 2010, pp. 5016–5029. 10.1109/TSP.2010.2053029.
VIII. Dai, D., J. Pan, and Y. Liang. “Regularized Estimation of the Mahalanobis Distance Based on Modified Cholesky Decomposition.” vol. 8, 2022, pp. 559–573. 10.1080/23737484.2022.2107961.
IX. de Amorim, L. B., G. D. Cavalcanti, and R. M. Cruz. “The Choice of Scaling Technique Matters for Classification Performance.” Applied Soft Computing, vol. 133, 2023, article 109924.
X. de la Cruz Huayanay, A., J. L. Bazán, and C. M. Russo. “Performance of Evaluation Metrics for Classification in Imbalanced Data.” Computational Statistics, vol. 40, 2025, pp. 1447–1473. 10.1007/s00180-024-01539-5.
XI. de Vazelhes, W., C. J. Carey, Y. Tang, N. Vauquier, and A. Bellet. “metric-learn: Metric Learning Algorithms in Python.” Journal of Machine Learning Research, vol. 21, 2020, pp. 1–6.
XII. Elmi, J., M. Eftekhari, A. Mehrpooya, and M. R. Ravari. “A Novel Framework Based on the Multi Label Classification for Dynamic Selection of Classifiers.” International Journal of Machine Learning and Cybernetics, vol. 14, 2023, pp. 2137–2154. 10.1007/s13042-022-01751-z.
XIII. Fisher, R. A. Iris Dataset. UCI Machine Learning Repository, 1936. 10.24432/C56C76.
XIV. Gao, X., D. Xie, Y. Zhang, et al. “A Comprehensive Survey on Imbalanced Data Learning.” arXiv, 2025, arXiv:2502.08960.
XV. German, B. Glass Identification Dataset. UCI Machine Learning Repository, 1987. 10.24432/C5WW2P.
XVI. Kahn, M. Diabetes Dataset. UCI Machine Learning Repository, 1990, 10.24432/C5T59G.
XVII. Kamoi, R., and K. Kobayashi. “Why is the Mahalanobis Distance Effective for Anomaly Detection?” arXiv, 2020.
XVIII. Nakai, K. Yeast Dataset. UCI Machine Learning Repository, 1991. 10.24432/C5KG68.
XIX. Nakai, K. Ecoli Dataset. UCI Machine Learning Repository, 1996. 10.24432/C5388M.
XX. Qiao, S., et al. “LMNNB: Two-in-One Imbalanced Classification Approach by Combining Metric Learning and Ensemble Learning.” 10.1007/s10489-021-02901-6.
XXI. Safi, S. K., and S. Gul. “An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance.” Mathematics, vol. 12, 2024, article 3243. 10.3390/math12203243.
XXII. Saran, N. A., and F. Nar. “Fast Binary Logistic Regression.” PeerJ Computer Science, vol. 11, 2025, e2579. 10.7717/peerj-cs.2579.
XXIII. Siddappa, N. G., and T. Kampalappa. “Imbalance Data Classification Using Local Mahalanobis Distance Learning Based on Nearest Neighbor.” SN Computer Science, vol. 1, 2020, article 76. 10.1007/s42979-020-0085-x.
XXIV. Suárez, J. L., S. García, and F. Herrera. “A Tutorial on Distance Metric Learning: Mathematical Foundations, Algorithms, Experimental Analysis, Prospects and Challenges.” Neurocomputing, vol. 425, 2021, pp. 300–322. 10.1016/j.neucom.2020.08.017.
XXV. Sun, Y., A. K. Wong, and M. S. Kamel. “Classification of Imbalanced Data: A Review.” International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, 2009, pp. 687–719. 10.1142/S0218001409007326.
XXVI. Thakkar, B. “Continuous Variable Analyses: Student’s t-test, Mann–Whitney U Test, Wilcoxon Signed-Rank Test.” Translational Cardiology, 2025, pp. 165–167. 10.1016/B978-0-323-91790-2.00052-6.
XXVII. Wang, L., M. Han, X. Li, N. Zhang, and H. Cheng. “Review of Classification Methods on Unbalanced Data Sets.” IEEE Access, vol. 9, 2021, pp. 62719–62744. 10.1109/ACCESS.2021.3074243.
XXVIII. Wilson, E. B. “The Distribution of Chi-Square.” Proceedings of the National Academy of Sciences, vol. 17, no. 12, 1931, pp. 684–688.
XXIX. Wolberg, W. Breast Cancer Wisconsin (Original) Dataset. UCI Machine Learning Repository, 1990. 10.24432/C5HP4Z.
XXX. Zhu, B., X. Jing, L. Qiu, and R. Li. “An Imbalanced Data Classification Method Based on Hybrid Resampling and Fine Cost Sensitive Support Vector Machine.” Computers, Materials & Continua, vol. 79, 2024, pp. 3979–3997. 10.32604/cmc.2024.048062.
XXXI. Hu, Lianyu, et al. “Interpretable Clustering: A Survey.” ACM Computing Surveys, vol. 58, no. 8, 2026, pp. 1–21. 10.1145/3789495.

