01/10/2022
Περίληψη
Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability of data, either when
...
generated or received, is of vital importance to provide the best services possible to users.
Accomplishing the aforementioned task is easier said than done, since data are complex,
generated at an extremely high rate and are of enormous size.
A variety of techniques and methods that are part of other subfields from the domain of
the Computer Science have been invoked to assist in making Data Cleaning the most efficient
and effective possible. Those subfields include, among others, Natural Language Processing (NLP),
which in essence refers to the interaction among computers and human language, seeking to find a
way to program computers to be able to process and analyze huge volumes of human language data. NLP
is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied
to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning
mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the
mechanism not only to be extremely effective but also to be a lot more efficient compared to other
corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare
datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.
Συγγραφείς
Konstantinos Mavrogiorgos, Argyro Mavrogiorgou, Athanasios Kiourtis, Nikolaos Zafeiropoulos, Spyridon Kleftakis, Dimosthenis Kyriazis