APPLICATION OF MACHINE LEARNING ALGORITMS FOR HEART DISEASE PREDICTION

Anna I. Pavlova

doi:10.12731/2658-6649-2023-15-3-475-496

Anna I. Pavlova Новосибирский государственный университет экономики и управления https://orcid.org/0000-0001-6159-1439

DOI: https://doi.org/10.12731/2658-6649-2023-15-3-475-496

Ключевые слова: сердечно-сосудистые заболевания, алгоритмы машинного обучения, модель машинного обучения, прогнозирование

Аннотация

Работа посвящена применению алгоритмов машинного обучения для прогнозирования сердечно-сосудистых заболеваний (ССЗ). Ежегодно во всем мире фиксируется большое количество смертей. По данным Всемирной организации здравоохранения ССЗ являются основной причиной высокой смертности в мире. Одним из необходимых профилактических мер по снижению смертности от ССЗ является своевременное прогнозирование заболеваний у людей, подвергшихся высокому риску таких заболеваний.

Обоснование. Для своевременного прогнозирования ССЗ в настоящее время используют специально разрабатываемые шкалы и алгоритмы машинного обучения. Для прогнозирования заболеваний сердца часто применяют алгоритмы: наивный байесовский классификатор (Naïve Bayes Сlassificator, NBС), k-ближайших соседей (K-Nearest Neghboors, KNN), дерево решений (Decision Tree, DT). В отечественной литературе известны работы, посвященные применению прогнозированию ССЗ с помощью градиентного алгоритма Adam при обучении глубокой нейронной сети. Одним из необходимых условий повышения прогностической способности модели машинного обучения (ММО) является оптимальный подбор гиперпараметров. Выбор оптимальных гиперпараметров ММО часто осуществляется на основании эмпирического опыта.

Цель. Изучить особенности применения машинных алгоритмов для прогнозирования заболеваний сердца.

Материалы и методы. Научная новизна работы. В проведенных исследованиях выполнен анализ алгоритмов машинного обучения для прогнозирования риска возникновения ССЗ с применением подхода автоматического поиска гиперпараметров ММО. Для построения ММО использованы следующие алгоритмы: NBС, KNN, DT, логистическая регрессия (Logistic Regression), машина опорных векторов (Support Vector Machine, SVM), случайные леса (Random Foorest, RF), адаптированный полиномиальный байесовский классификатор (Complement Naïve Bayes Classificator, CNBС), линейный дискриминантный анализ (Linear Discriminant Analysis, LDA), градиентный бустинг (XGBoost).

Для оценки точности моделей машинного обучения использованы показатели: средняя абсолютная ошибка (mean absolute error, MAE), точность (precision), полнота (recall), F-мера, доля ложноположительных примеров (False Positive Rate, FPR), доля отрицательных примеров (False Negative Rate, FNR). Дополнительно при анализе результатов построения ММО служил визуальный анализ кривой ROC (receiver operating characteristic) и площадь под кривой ROC (Areas under the curve, AUC). Использование значения AUC позволяет оценить прогностические возможности ММО.

Результаты. Результаты обучения показали, что алгоритмы RF и XGBoost характеризуются более высокими показателями точности. При оптимальном подборе параметров ММО общая точность классификации составила 0,88 и 0,94 соответственно.

Заключение. Применение алгоритмов машинного обучения позволяет с высокой точностью построить прогнозные модели. Ансамблевые алгоритмы машинного обучения RF и XGBoost характеризуются более высокими показателями точности в сравнении со следующими алгоритмами: деревья решений, байесовские методы классификации, логистическая регрессия, линейный дискриминантный анализ.

Скачивания

Данные скачивания пока не доступны.

Биография автора

Anna I. Pavlova, Новосибирский государственный университет экономики и управления

кандидат технических наук, доцент

Литература

Список литературы

Айвазян С.А., Бухштабер В.М., Енюков И.С., Мешалкин Л.Д. Прикладная статистика: классификация и снижение размерности. М.: Финансы и статистика, 1989. 607 c.

Белялов Ф.И. Прогнозирование заболеваний с помощью шкал // Комплексные проблемы сердечно-сосудистых заболеваний. 2018. Т.7. №.1. С. 84–93. https://doi.org/10.17802/2306-1278-2018-7-1-84-93

Ветров Д.П., Кропотов Д.А. Алгоритмы выбора моделей и построения коллективных решений в задачах классификации, основанные на принципе устойчивости. Москва, URSS, 2006. 112 с.

Воронцов К.В. Математические методы обучения по прецедентам (теория обучения машин). URL: http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf

Воронцов К.В. Лекции по статистическим (байесовским) алгоритмам классификации. URL: http://www.ccas.ru/voron/download/Bayes.pdf

Вьюгин В.В. Математические основы теории машинного обучения и прогнозирования. М.: 2013. 387 с.

Дуда Р., Харт П. Распознавание образов и анализ сцен / Пер. с англ. М.: Мир, 1976. 511 с.

Загоруйко Н. Г. Прикладные методы анализа данных и знаний. Новосибирск: ИМ СО РАН, 1999. 270 c.

Кардиоваскулярная профилактика 2017. Российские национальные рекомендации // Российский кардиологический журнал. 2018. № 6. C. 7-122. https://doi.org/10.15829/1560-4071-2018-6-7-122.

Литвин А.А., Калинин А.Л., Тризна Н.М. Использование данных доказательной медицины в клинической практике (сообщение 3 – диагностические исследования) // Проблемы здоровья и экологии. 2008. Т.18. №4. С.12-19.

Невзорова В.А., Плехова Н.Г., Присеко Л.Г. и др. Методы машинного обучения в прогнозировании исходов сердечно-сосудистых заболеваний с артериальной гипертензией (по материалам ЭССЭ-РФ в Приморском крае) // Российский кардиологический журнал. 2020. Т. 25. №3. С. 10–16. https://doi.org/10.15829/1560-4071-2020-3-3751

Панев Н.И., Евсеева Н.А., Филимонов С.Н., Коротенко О.Ю., Данилов И.П. Система прогнозирования развития ишемической болезни сердца у шахтёров с антракосиликозом // Медицина труда и промышленная экология. 2021. Т. 61. №.6. С. 365–370. https://doi.org/10.31089/1026-9428-2021-61-6-365-370

Самигулин Т.Р., Джурабаев А.Э.У. Анализ тональности текста методами машинного обучения // Научный результат. Информационные технологии, 2021, Т. 6, №1. С.55–62. https://doi.org/10.18413/2518-1092-2021-6-1-0-7

Смирнова М.Д., Свирида О.Н., Фофанова Т.В. и др. Алгоритм прогнозирования сердечно-сосудистых осложнений у больных низкого/умеренного риска с использованием классических и новых факторов (по данным) десятилетнего наблюдения // Кардиоваскулярная терапия и профилактика. 2021. Т. 20. №. 6. C. 2799. https://doi.org/10.15829/1728-8800-2021-2799

Ту Дж., Гонсалес Р. Принципы распознавания образов / Пер. с англ.; Пол ред. Ю. И. Журавлева. М.: Мир, 1978. 411 с.

Чазова И.Е., Ошепкова Е.В. Опыт борьбы с сердечно-сосудистыми заболеваниями в России // Аналитический вестник. 2015. № 44(597). С.4-8.

Aravind Akella, Sudheer Akella Machine learning algorithms for predicting coronary artery disease: efforts toward an open source solution // Future Science OA, 2021. https://doi.org/10.2144/fsoa-2020-0206

Breiman L. Random Forest // Machine Learning, 2001, vol. 45, no. 1, pp. 5-32.

Bordes A., Ertekin S., Weston J., Bottou L. Fast Kernel Classifiers with Online and Active Learning // Journal of Machine Learning Research, 2005, no. 6, pp. 1579–1619.

Cardiovascular disease. World Health Organization website. 2022. https://www.who.int/ru/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

Cortes C., Vapnik V. Support vector networks // Machine Learning, 1995, no. 20, pp. 273–297.

Cuocolo R., Perillo T., De Rosa E., Ugga L., Petretta M. Current applications of big data and machine learning in cardiology // Journal Geriatric Cardiology, 2019, vol. 16, no.8, pp.601 – 607.

Deo R.C. Machine learning in medicine // Circulation, 2015, vol. 132, no. 20, pp. 1920- 1930.

Fawcett T. An introduction to ROC analysis // Pattern Recognition Letters, 2006, vol. 27, no 8, pp. 861-874.

Forgy E.W. Cluster analysis of multivariate data: efficiency versus interpretability of classifications // Biometrics, 1965, vol. 21, pp. 768–769.

Foster K.R., Koprowski R., Skufca J.D. Machine learning, medical diagnosis, and biomedical engineering research-commentary // Biomedical Engineering Online, 2014, no. 13, article no. 94. https://biomedical-engineering-online.biomedcentral.com/articles/10.1186/1475-925X-13-94

Guyon, B. Boser, Vapnik V. Automatic capacity tuning of very large VC-dimension classifiers. In S. J. Hanson, J. D. Cowan, and C. Lee Giles, editors, Advances in Neural Information Processing Systems, Morgan Kaufmann, San Mateo, CA, 1993, pp. 147–155.

Ho T. K. The Random subspace method for construction decision tree forests // IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, vol. 20, no. 8, pp. 832–844. https://doi.org/10.1109/34.709601

Karimollah Hajian-Tilaki Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation // Caspian Journal of Internal Medicine, 2013, vol. 4, no. 2, pp. 627–635. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3755824/

Khan Y, Qamar U, Yousaf N, Khan A. Machine learning techniques for heart disease datasets // Proceedings of the 2019 11th International Conference on Machine Learning and Computing – ICMLC ’19, 2019. https://doi.org/10.1145/3318299.3318343

Khaled M.A. Prediction of heart disease and classifiers sensitivity analysis // Almustafa BMC Bioinfirmatics, 2020, no. 21, Article number: 278. https://doi.org/10.1186/s12859-020-03626-y

Kohavi R., Quinlan J.R. Decision tree discovery. C5.1.3. 1999. https://ai.stanford.edu/~ronnyk/treesHB.pdf

Kummar R, Indrayan A. Receiver operating characteristic (ROC) curve for medical researchers // Indian Pediatrics, 2011, vol. 48, no. 7, pp. 277-89.

Lu Y., Dendukuri N., Schiller I., Joseph L. A Bayesian approach to simultaneously adjusting for verification and reference standard bias in diagnostic test studies // Statistics in medicine, 2010, no. 29, pp. 2532-2543.

Maheswari S., Pitchai R. Heart disease prediction system using decision tree and naive bayes algorithm // Current Medical Imaging, 2019, no. 15, pp. 712–77. https://doi.org/10.2174/1573405614666180322141259

Nefedov A. Support Vector Machines: A Simple Tutorial. 2016. https://svmtutorial.online/download.php?file=SVM_tutorial.pdf

Obermeyer Z., Ezekiel J.E. Predicting the future – big data, machine learning, and clinical medicine // The New England Journal of Medicine, 2016, vol. 375, no.13, pp. 1216–1219.

Quinlan J. R. Induction of decision trees // Machine Learning, 1986, vol. 1, no. 1, pp. 81-106.

Sajda P. Machine learning for detection and diagnosis of disease // Annual Review of Biomedical Engineering, 2006, no. 8, pp. 537–565.

Suykens J.A., Vandewalle J. Least squares support vector machine classifiers // Neural Process Letters, 2004, vol. 9, no. 3, pp. 293–300.

Swets J.A. ROC analysis applied to the evaluation of medical imaging techniques // Investigative Radiology, 1979, no. 14, pp.109-21.

Tsay D., Patterson C. From machine learning to artificial intelligence applications in cardiac care: real-world examples in improving imaging and patient access // Circulation, 2018, vol.138, no. 22, pp. 2569–2575.

Twala B. An empirical comparison of techniques for handling incomplete data using decision trees // Applied Artificial Intelligence, 2009, no. 23, pp. 373–405.

Vapnik V., Lerner A. Pattern recognition using generalized portrait method // Automation and Remote Control, 1963, no. 24, pp. 774–780.

Ziyu Jin, Ning Li Diagnosis of each main coronary artery stenosis based on whale optimization algorithm and stacking model // Mathematical Biosciences and Engineering, 2022, vol.19, is. 5, pp.4568-4591. https://doi.org/10.3934/mbe.2022211

Zou K.H., O’Malley A.J., Mauri L. Receiver-Operating Characteristic Analysis for Evaluating Diagnostic Tests and Predictive Models // Circulation, 2007, is.5, vol. 115, pp. 654 – 657.

References

Ayvazyan S.A., Bukhshtaber V.M., Enyukov I.S., Meshalkin L.D. Prikladnaya statistika: klassifikatsiya i snizhenie razmernosti [Applied Statistics: Classification and Dimension Reduction]. M.: Finance and statistics, 1989. 607 p.

Belyalov F.I. Kompleksnye problemy serdechno-sosudistykh zabolevaniy, 2018, vol. 7, no. 1, pp. 84–93. https://doi.org/10.17802/2306-1278-2018-7-1-84-93

Vetrov D.P., Kropotov D.A. Algoritmy vybora modeley i postroeniya kollektivnykh resheniy v zadachakh klassifikatsii, osnovannye na printsipe ustoychivosti [Algorithms for choosing models and constructing collective solutions in classification problems based on the principle of stability]. Moscow, URSS, 2006, 112 p.

Vorontsov K.V. Matematicheskie metody obucheniya po pretsedentam (teoriya obucheniya mashin) [Mathematical methods of learning by precedents (the theory of machine learning)]. http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf

Vorontsov K.V. Lektsii po statisticheskim (bayesovskim) algoritmam klassifikatsii [Lectures on statistical (Bayesian) classification algorithms]. http://www.ccas.ru/voron/download/Bayes.pdf

V'yugin V.V. Matematicheskie osnovy teorii mashinnogo obucheniya i prognozirovaniya [Mathematical foundations of the theory of machine learning and forecasting]. M., 2013, 387 p.

Duda R., Hart P. Raspoznavanie obrazov i analiz stsen [Pattern recognition and scene analysis]. M.: Mir, 1976, 511 p.

Zagoruyko N.G. Prikladnye metody analiza dannykh i znaniy [Applied methods of data and knowledge analysis]. Novosibirsk: IM SO RAN, 1999, 270 p.

Kardiovaskulyarnaya profilaktika 2017. Rossiyskie natsional'nye rekomendatsii [Cardiovascular prevention 2017. Russian national guidelines]. Rossiyskiy kardiologicheskiy zhurnal, 2018, no. 6, pp. 7-122. https://doi.org/10.15829/1560-4071-2018-6-7-122

Litvin A.A., Kalinin A.L., Trizna N.M. Problemy zdorov'ya i ekologii, 2008, vol. 18, no. 4, pp. 12-19.

Nevzorova V.A., Plekhova N.G., Priseko L.G. et al. Rossiyskiy kardiologicheskiy zhurnal, 2020, vol. 25, no. 3, pp. 10–16. https://doi.org/10.15829/1560-4071-2020-3-3751

Panev N.I., Evseeva N.A., Filimonov S.N., Korotenko O.Yu., Danilov I.P. Meditsina truda i promyshlennaya ekologiya, 2021, vol. 61, no. 6, pp. 365–370. https://doi.org/10.31089/1026-9428-2021-61-6-365-370

Samigulin T.R., Dzhurabaev A.E.U. Nauchnyy rezul'tat. Informatsionnye tekhnologii, 2021, vol. 6, no. 1, pp. 55–62. https://doi.org/10.18413/2518-1092-2021-6-1-0-7

Smirnova M.D., Svirida O.N., Fofanova T.V. et al. Kardiovaskulyarnaya terapiya i profilaktika, 2021, vol. 20, no. 6, p. 2799. https://doi.org/10.15829/1728-8800-2021-2799

Tu J., Gonzalez R. Printsipy raspoznavaniya obrazov [Principles of pattern recognition] / ed. Yu. I. Zhuravlev. M.: Mir, 1978, 411 p.

Chazova I.E., Oshepkova E.V. Analiticheskiy vestnik, 2015, no. 44(597), pp. 4-8.

Aravind Akella, Sudheer Akella Machine learning algorithms for predicting coronary artery disease: efforts toward an open source solution. Future Science OA, 2021. https://doi.org/10.2144/fsoa-2020-0206

Breiman L. Random Forest. Machine Learning, 2001, vol. 45, no. 1, pp. 5-32.

Bordes A., Ertekin S., Weston J., Bottou L. Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research, 2005, no. 6, pp. 1579–1619.

Cardiovascular disease. World Health Organization website. 2022. https://www.who.int/ru/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

Cortes C., Vapnik V. Support vector networks. Machine Learning, 1995, no. 20, pp. 273–297.

Cuocolo R., Perillo T., De Rosa E., Ugga L., Petretta M. Current applications of big data and machine learning in cardiology. Journal Geriatric Cardiology, 2019, vol. 16, no.8, pp.601-607.

Deo R.C. Machine learning in medicine. Circulation, 2015, vol. 132, no. 20, pp. 1920- 1930.

Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, vol. 27, no 8, pp. 861-874.

Forgy E.W. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 1965, vol. 21, pp. 768–769.

Foster K.R., Koprowski R., Skufca J.D. Machine learning, medical diagnosis, and biomedical engineering research-commentary. Biomedical Engineering Online, 2014, no. 13, article no. 94. https://biomedical-engineering-online.biomedcentral.com/articles/10.1186/1475-925X-13-94

Guyon, B. Boser, Vapnik V. Automatic capacity tuning of very large VC-dimension classifiers. In S. J. Hanson, J. D. Cowan, and C. Lee Giles, editors, Advances in Neural Information Processing Systems, Morgan Kaufmann, San Mateo, CA, 1993, pp. 147–155.

Ho T. K. The Random subspace method for construction decision tree forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, vol. 20, no. 8, pp. 832–844. https://doi.org/10.1109/34.709601

Karimollah Hajian-Tilaki Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian Journal of Internal Medicine, 2013, vol. 4, no. 2, pp. 627–635. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3755824/

Khan Y, Qamar U, Yousaf N, Khan A. Machine learning techniques for heart disease datasets. Proceedings of the 2019 11th International Conference on Machine Learning and Computing – ICMLC ’19, 2019. https://doi.org/10.1145/3318299.3318343

Khaled M.A. Prediction of heart disease and classifiers sensitivity analysis. Almustafa BMC Bioinfirmatics, 2020, no. 21, Article number: 278. https://doi.org/10.1186/s12859-020-03626-y

Kohavi R., Quinlan J.R. Decision tree discovery. C5.1.3. 1999. https://ai.stanford.edu/~ronnyk/treesHB.pdf

Kummar R, Indrayan A. Receiver operating characteristic (ROC) curve for medical researchers. Indian Pediatrics, 2011, vol. 48, no. 7, pp. 277-89.

Lu Y., Dendukuri N., Schiller I., Joseph L. A Bayesian approach to simultaneously adjusting for verification and reference standard bias in diagnostic test studies. Statistics in medicine, 2010, no. 29, pp. 2532-2543.

Maheswari S., Pitchai R. Heart disease prediction system using decision tree and naive bayes algorithm. Current Medical Imaging, 2019, no. 15, pp. 712–77. https://doi.org/10.2174/1573405614666180322141259

Nefedov A. Support Vector Machines: A Simple Tutorial. 2016. https://svmtutorial.online/download.php?file=SVM_tutorial.pdf

Obermeyer Z., Ezekiel J.E. Predicting the future – big data, machine learning, and clinical medicine. The New England Journal of Medicine, 2016, vol. 375, no.13, pp. 1216–1219.

Quinlan J. R. Induction of decision trees. Machine Learning, 1986, vol. 1, no. 1, pp. 81-106.

Sajda P. Machine learning for detection and diagnosis of disease. Annual Review of Biomedical Engineering, 2006, no. 8, pp. 537–565.

Suykens J.A., Vandewalle J. Least squares support vector machine classifiers. Neural Process Letters, 2004, vol. 9, no. 3, pp. 293–300.

Swets J.A. ROC analysis applied to the evaluation of medical imaging techniques. Investigative Radiology, 1979, no. 14, pp.109-21.

Tsay D., Patterson C. From machine learning to artificial intelligence applications in cardiac care: real-world examples in improving imaging and patient access. Circulation, 2018, vol.138, no. 22, pp. 2569–2575.

Twala B. An empirical comparison of techniques for handling incomplete data using decision trees. Applied Artificial Intelligence, 2009, no. 23, pp. 373–405.

Vapnik V., Lerner A. Pattern recognition using generalized portrait method. Automation and Remote Control, 1963, no. 24, pp. 774–780.

Ziyu Jin, Ning Li Diagnosis of each main coronary artery stenosis based on whale optimization algorithm and stacking model. Mathematical Biosciences and Engineering, 2022, vol.19, is. 5, pp.4568-4591. https://doi.org/10.3934/mbe.2022211

Zou K.H., O’Malley A.J., Mauri L. Receiver-Operating Characteristic Analysis for Evaluating Diagnostic Tests and Predictive Models. Circulation, 2007, is.5, vol. 115, pp. 654 – 657.

АВТОРАМ

ПОЛИТИКА ЖУРНАЛА

ПРИМЕНЕНИЕ АЛГОРИТМОВ МАШИННОГО ОБУЧЕНИЯ ДЛЯ ПРОГНОЗИРОВАНИЯ ЗАБОЛЕВАНИЙ СЕРДЦА

Аннотация

Скачивания

Биография автора

Литература