A Catalogue of Machine Learning Algorithms for Healthcare Risk Predictions
Extracting useful knowledge from proper data analysis is a very
challenging task for efficient and timely decision-making. To achieve this, there exist a ...
plethora of machine learning (ML)
algorithms, while, especially in healthcare, this
complexity increases due to the domain’s requirements
for analytics-based risk predictions. This manuscript
proposes a data analysis mechanism experimented in diverse
healthcare scenarios, towards constructing a catalogue of the
most efficient ML algorithms to be used depending on the healthcare scenario’s requirements and datasets, for efficiently predicting the onset of a disease. To this context, seven (7) different ML algorithms (Naïve Bayes, K-Nearest Neighbors, Decision Tree, Logistic Regression, Random Forest, Neural Networks, Stochastic Gradient Descent) have been executed on top of diverse healthcare scenarios (stroke, COVID-19, diabetes, breast cancer, kidney disease, heart failure). Based on a variety of performance metrics (accuracy, recall, precision,
F1-score, specificity, confusion matrix), it has been
identified that a sub-set of ML algorithms are more efficient
for timely predictions under specific healthcare scenarios,
and that is why the envisioned ML catalogue prioritizes the ML
algorithms to be used, depending on the scenarios’ nature and needed
metrics. Further evaluation must be performed considering additional
scenarios, involving state-of-the-art techniques (e.g., cloud deployment,
federated ML) for improving the mechanism’s efficiency.
Argyro Mavrogiorgou, Athanasios Kiourtis, Spyridon Kleftakis, Konstantinos Mavrogiorgos, Nikolaos Zafeiropoulos, Dimosthenis Kyriazis
A Comparative Study of Monolithic and Microservices Architectures in Machine Learning Scenarios
Choosing the most suitable architecture for applications is not an easy decision. While the software giants
have almost all put in place the microservices
smaller platforms such decision it is not so obvious. In the healthcare
domain and specifically when accomplishing Machine Learning (ML) tasks in this domain,
considering its special characteristics, the decision should be made based on specific
metrics. In the context of the beHEALTHIER platform, a platform that is able to handle
heterogeneous healthcare data towards their successful management and analysis by applying
various ML tasks, such research gap was fully investigated. There has been conducted an experiment
by installing the platform in three (3) different architectural ways, referring to the monolithic
architecture, the clustered microservices architecture exploiting docker compose, and the microservices
architecture exploiting Kubernetes cluster. For these three (3) environments, time-based measurements were
made for each Application Programming Interface (API) of the diverse platform’s functionalities (i.e., components)
and useful conclusions were drawn towards the adoption of the most suitable software architecture.
Spyridon Kleftakis, Argyro Mavrogiorgou, Nikolaos Zafeiropoulos, Konstantinos Mavrogiorgos, Athanasios Kiourtis, Dimosthenis Kyriazis
Automated Rule-Based Data Cleaning Using NLP
Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability
of data, either when generated or received, is of vital importance to provide the best services possible...
of vital importance to provide the best services possible to users.
Accomplishing the aforementioned task is easier said than done, since data are complex,
generated at an extremely high rate and are of enormous size.
A variety of techniques and methods that are part of other subfields from the domain of
the Computer Science have been invoked to assist in making Data Cleaning the most efficient
and effective possible. Those subfields include, among others, Natural Language Processing (NLP),
which in essence refers to the interaction among computers and human language, seeking to find a
way to program computers to be able to process and analyze huge volumes of human language data. NLP
is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied
to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning
mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the
mechanism not only to be extremely effective but also to be a lot more efficient compared to other
corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare
datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.
Konstantinos Mavrogiorgos, Argyro Mavrogiorgou, Athanasios Kiourtis, Nikolaos Zafeiropoulos, Spyridon Kleftakis, Dimosthenis Kyriazis
A Comparative Study of Collaborative Filtering in Product Recommendation
Product recommendation is considered a well-known technique for bringing customers and products together.
With applications in music, electronic shops, or almost any platform the user daily
deals with, the
recommendation system’s sole scope is to help customers and attract new ones to
discover new products. Through product recommendation, transaction costs can also
be decreased, improving overall decision-making and quality. To perform recommendations,
a recommendation system
must utilize customer feedback, such as habits, interests, prior transactions
as well as information used in customer profiling, and finally deliver suggestions.
Hence, data is the key factor in choosing the appropriate recommendation method and
drawing specific suggestions. This research investigates the data challenges of
recommendation systems, specifying collaborative-based, content-based, and hybrid-based
recommendations. In this context, collaborative filtering is being explored, with the
Surprise library and LightFM embeddings being analysed and compared on top of foodservice
transactional data. The involved algorithms’ metrics are being identified and parameterized,
while hyperparameters are being tuned properly on top of this transactional data, concluding
that LightFM provides more efficient recommendation results following the evaluation’s
precision and recall outcomes. Nevertheless, even though the Surprise library outperforms,
it should be used when constructing user-friendly models, requiring low code and low
Agori Argyro Patoulia, Athanasios Kiourtis, Argyro Mavrogiorgou, Dimosthenis Kyriazis
Interpretable Stroke Risk Prediction Using Machine Learning Algorithms
Stroke is the second most common cause of
death globally according to the World Health
Organization (WHO). Information Technology (IT), and
especially Machine Learning (ML), may be beneficial and useful
many aspects of stroke management. However, the majority of
the existing studies focus on the development of ML models for
confronting such cases without checking the degree of confidence
and reliability of the constructed models. To strengthen models’
performance, diverse metric functions have to be estimated, also finding
the most important features of the underlying datasets. Thus, this paper studies
whether the results from diverse ML models are true and realistic or not, based on diverse
metric functions to verify that they extract efficient and reliable results. With this in mind,
a plethora of models are built to predict the likelihood of stroke, referring to Support Vector Classifier,
K-Nearest Neighbors, Logistic Regression, Random Forest, XGB Classifier, and LGBM Classifier.
All the captured results are compared based on the chosen metric functions, concluding into the
most suitable and accurate model for stroke prediction.
Nikolaos Zafeiropoulos, Argyro Mavrogiorgou, Spyridon Kleftakis, Konstantinos Mavrogiorgos, Athanasios Kiourtis, Dimosthenis Kyriazis
A Comparative Study of ML Algorithms for Scenario-agnostic Predictions in Healthcare
The extraction of useful knowledge from collected data has
always been the holy grail for enterprises and researchers, supporting
efficient decision making, provided
optimization and profit maximization. However, this
task is easier said than done, since it presupposes
the application of complex mathematical models/algorithms.
Data Analysis has prospered due to the continuous demand to
simplify and optimize the knowledge extraction process. Several
mechanisms in different domains have been developed, consisting
of various techniques to analyze specific data. The need for such
mechanisms is even greater in healthcare, since there exist data of
different complexity that may provide high-valuable knowledge, if
properly analyzed. Considering these challenges, this paper proposes a
mechanism for performing Data Analysis in diverse scenarios' healthcare
data to extract valuable insights. The mechanism can collect data and apply
several Machine Learning algorithms to ensure the best result about the prediction of
certain features of the provided data.
Argyro Mavrogiorgou, Spyridon Kleftakis, Nikolaos Zafeiropoulos, Konstantinos Mavrogiorgos, Athanasios Kiourtis, Dimosthenis Kyriazis
Digital Twin in Healthcare Through
the Eyes of the Vitruvian Man
In recent years, worldwide, with the development of technology, a huge amount of data is collected in Electronic Health Records (EHRs). Although vast progress has been made with the use of
artificial intelligence in various areas of health domain and for specific problems, it is a fact that to date there is no holistic approach to a patient’s state of health using these technologies. Digital Twin refers to a complete physical and functional description of an item, product, or system, which includes pretty much all the information that could be useful in all—current and next—life cycle phases. This paper presents a platform that, using state of the art technologies such as Microservice Architecture (MSA), containerization (Docker), orchestration (Kubernetes) and Machine Learning Operations (MLOps), whereas it is inspired by Leonardo DaVinci’s Vitruvian man, building the Digital Twin of Patient platform. To achieve that, the platform’s architecture is designed with multiple clusters of Docker containers and Kubernetes orchestration. Specific parts or organs of the human body, are represented by clusters called “digital_twin_components”—DTCs. The set of those DTCs structure the “patient_digital_twin” cluster in which appropriate pipelines define and monitor in real time the “best” possible construction of the patient’s digital twin.
Spyridon Kleftakis, Argyro Mavrogiorgou, Konstantinos Mavrogiorgos, Athanasios Kiourtis, Dimosthenis Kyriazis
A Multi-layer Approach for Data Cleaning in the Healthcare Domain
It is an undeniable fact that nowadays there exists a plethora of sources that can generate data with complex and, most of the time, error-prone nature, as well as multiple origins. Those sources may be of different
Those sources may be of different complexity, but most of them share a common
characteristic: the lack of performing quality checks on the collected data. The aforementioned implies that, in every platform that utilizes data
originating from those sources, there should be a mechanism that is responsible for assuring the reliability of the collected data, thus providing
to the rest of the platform's mechanisms (e.g., risk analysis and prediction mechanisms) data of high quality that could lead to the best knowledge
extraction possible for decision making.
Konstantinos Mavrogiorgos, Athanasios Kiourtis, Argyro Mavrogiorgou, Spyridon Kleftakis, Dimosthenis Kyriazis
A Comparative Study of MongoDB, ArangoDB and CouchDB for Big Data Storage
A distinctive aspect of the current era is the ferocious amount of data that is generated and processed in a daily basis. There is no wonder that this epoch is generally characterized as
the “Era of Big Data”. Thus, many enterprises and research initiatives strive to find a way to effectively and efficiently collect, store and analyze Big Data in order to improve their services and make efficient decisions. Those approaches refer to several domains such as healthcare, transportation, governance, or insurance. Towards this direction, in this paper we contribute into the selection of the most appropriate database for efficiently storing and retrieving Big Data. More specifically, taking into account the nature of Big Data and the main categories of databases that currently exist, three (3) NoSQL document-based databases were considered for this comparative study, namely the ArangoDB, the MongoDB and the CouchDB. The performance of these databases was measured based on specific metrics and criteria, including the total execution time for the same CRUD operations and their corresponding demands for resources, concluding to the most suitable database for storing Big Data.
Konstantinos Mavrogiorgos, Athanasios Kiourtis, Argyro Mavrogiorgou, Dimosthenis Kyriazis
beHEALTHIER: A Microservices Platform for Analyzing and Exploiting Healthcare Data
The era of big data is surrounded by plenty of challenges, concerning aspects related to data quality, data management, and data analysis. Plenty of
these challenges are met ... in several domains, such as the healthcare domain,
where the corresponding healthcare platforms not only have to deal with managing and/or analyzing a tremendous
quantity of health data, but also have to accomplish these actions in the most efficient and secure way possible.
Towards this direction, medical institutions are paying attention to the replacement of traditional approaches
such as the Monolithic and Service Oriented Architecture (SOA), which deal with many difficulties for handling
the increasing amount of healthcare data. This paper presents a platform for overcoming these issues,
by adopting the Microservice Architecture (MSA), being able to efficiently manage and analyze these vast
amounts of data. More specifically, the proposed platform, namely beHEALTHIER, offers the ability to
construct health policies out of data of collective knowledge, by utilizing a newly proposed kind of
electronic health records (i.e., eXtended Health Records (XHRs)) and their corresponding networks,
through the efficient analysis and management of ingested healthcare data. In order to achieve that,
beHEALTHIER is architected based upon four (4) discrete and interacting pillars, namely the Data, the
Information, the Knowledge and the Actions pillars. Since the proposed platform is based on MSA, it fully
utilizes MSA's benefits, achieving fast response times and efficient mechanisms for healthcare data collection,
processing, and analysis.
Argyro Mavrogiorgou, Spyridon Kleftakis, Konstantinos Mavrogiorgos, Nikolaos Zafeiropoulos
Andreas Menychtas, Athanasios Kiourtis, Ilias Maglogiannis, Dimosthenis Kyriazis
Analyzing Collective Knowledge Towards Public Health Policy Making
Nowadays there exists a plethora of diverse
data sources producing tons of healthcare data, augmenting the size of data that
finally is stored both in Electronic Health Records (EHRs) and in Personal Health Records (PHRs).... Thus, the great
challenge that emerges is not only to gather all this data in an efficient and effective manner,
but also to extract knowledge out of it. The latter is the key factor that enables healthcare
professionals to take serious clinical decisions both on individual and on collective level,
finally forming representative public health policies. Towards this direction, the current
paper proposes a system that supports a new paradigm of EHRs, the eXtended Health Records
(XHRs), which include the majority of the health determinants. XHRs are then transformed
into XHRs Networks that capture the clinical, social and human context of diverse population
segmentations, producing the corresponding collective knowledge. By exploiting this knowledge,
the proposed system is finally able to create multi-modal policies, addressing various facts
and evolving risks that arise from diverse population segmentations.
Spyridon Kleftakis, Konstantinos Mavrogiorgos,
Nikolaos Zafeiropoulos, Argyro Mavrogiorgou,
Athanasios Kiourtis, Ilias Maglogiannis, Dimosthenis Kyriazis
An Optimized KDD Process for Collecting and Processing Ingested and Streaming Healthcare Data
Nowadays organizations are surrounded with enormous amounts of data, losing all the important information that resides in it.
Knowledge Discovery in Databases (KDD)
can aid organizations to transform this data into valuable...
by extracting complex patterns and relationships from it. To achieve that,
various KDD techniques and tools have been proposed, resulting into impressive
outcomes in various domains, especially in healthcare. Due to the huge amount of
data available within the healthcare systems, data mining is extremely important for
the healthcare sector. However, what is of major importance as well, is the way through which the
data is collected, preprocessed and integrated with each other, considering its heterogeneous and
diverse nature and format. To address all these challenges, this paper proposes a generalized KDD
approach, which in essence constitutes a supplement of all the existing approaches that study and
analyse the data mining part of the KDD process. This approach primarily concentrates on the phases
of the selection, the preprocessing, as well as the transformation of the collected healthcare data,
which are considered to be of great importance for its successful mining, analysis, and interpretation.
The prototype of the proposed approach provides an example of the developed mechanism, explaining in deep detail its phases,
verifying its possible wide applicability and adoption in various healthcare scenarios.
Argyro Mavrogiorgou, Athanasios Kiourtis, George Manias, Dimosthenis Kyriazis