Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse

 

 

 

Sura Mahroos1,*, Rihab Hazim1, Yaqeen Saad1, Nadia Mohammed2

 

1University of Anbar, College of Computer Sciences and Information Technology, Anbar, Ramadi, 31001, Iraq

 

2University of Anbar, College of Islamic Sciences, Anbar, Ramadi, 31001, Iraq

 

Emails: surasms917@uoanbar.edu.iq; rehz1991@uoanbar.edu.iq; yaqeen.cs91@uoanbar.edu.iq; nadia.fahad@uoanbar.edu.iq

 

 

 

 

 

Abstract

 

There are several real-world uses for the duplication system or record linkage. In order to help the system make the best judgments, it appears in a broad area of recognizing similar data, joining online papers in the wide web, detecting plagiarism, and allowing several applications to enter it. To improve the financial interest and applicability of logistics project, routing is crucial. The following is the issue with this study: Because duplicate receipts contain the same significant change in data restrictions and limitations, and the data change itself is minor, the duplicate record data is ambiguous to other redacted records that are reassembled with the same customer. The purpose of this study is to use statistical techniques and the Q-gram to discover the best method for the detection and removal of duplicate records. We propose the following goals to help achieve that goal: Reduce the size of the data warehouse (DW) by providing a data warehouse free of duplicates. Decrease the amount of time spent looking for the (DW) and improve the DSS. The approach is divided into two stages: first, identify similarity records based on Q-gram similarity; second, determine whether classification records may be improved by statistical methods. The percentage threshold of 0.68 has been determined. It goes through a statistical process that decides whether this record is duplicated if the key ratio similarity is surpassed. The accuracy of the suggested work is 79%.

 

Keywords: Duplicate Elimination; Data Cleaning; Similarity Score; Q-Gram Similarity; Statistical Tools