Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse

Sura Mahroos; Rihab Hazim; Yaqeen Saad; Nadia Mohammed

doi:https://doi.org/10.54216/JCIM.170101

Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse

Sura Mahroos ^{1
*} , Rihab Hazim ² , Yaqeen Saad ³ , Nadia Mohammed ⁴

1 University of Anbar, College of Computer Sciences and Information Technology, Anbar, Ramadi, 31001, Iraq - (surasms917@uoanbar.edu.iq)

2 University of Anbar, College of Computer Sciences and Information Technology, Anbar, Ramadi, 31001, Iraq - (rehz1991@uoanbar.edu.iq)

3 University of Anbar, College of Computer Sciences and Information Technology, Anbar, Ramadi, 31001, Iraq - (yaqeen.cs91@uoanbar.edu.iq)

4 University of Anbar, College of Islamic Sciences, Anbar, Ramadi, 31001, Iraq - (nadia.fahad@uoanbar.edu.iq)

Doi: https://doi.org/10.54216/JCIM.170101

Received: February 25, 2025 Revised: May 24, 2025 Accepted: July 04, 2025

Abstract

There are several real-world uses for the duplication system or record linkage. In order to help the system make the best judgments, it appears in a broad area of recognizing similar data, joining online papers in the wide web, detecting plagiarism, and allowing several applications to enter it. To improve the financial interest and applicability of logistics project, routing is crucial. The following is the issue with this study: Because duplicate receipts contain the same significant change in data restrictions and limitations, and the data change itself is minor, the duplicate record data is ambiguous to other redacted records that are reassembled with the same customer. The purpose of this study is to use statistical techniques and the Q-gram to discover the best method for the detection and removal of duplicate records. We propose the following goals to help achieve that goal: Reduce the size of the data warehouse (DW) by providing a data warehouse free of duplicates. Decrease the amount of time spent looking for the (DW) and improve the DSS. The approach is divided into two stages: first, identify similarity records based on Q-gram similarity; second, determine whether classification records may be improved by statistical methods. The percentage threshold of 0.68 has been determined. It goes through a statistical process that decides whether this record is duplicated if the key ratio similarity is surpassed. The accuracy of the suggested work is 79%.

Keywords :

Duplicate Elimination , Data Cleaning , Similarity Score , Q-Gram Similarity , Statistical Tools

References

[1] P. Bhatia, Data Mining and Data Warehousing: Principles and Practical Techniques. Cambridge University Press, 2019.

[2] Darch Abed Dawar, “Enhancing Wireless Security and Privacy: A 2-Way Identity Authentication Method for 5G Networks,” International Journal of Mathematics, Statistics, and Computer Science, vol. 2, pp. 183–198, 2024. doi: 10.59543/ijmscs.v2i.9073.

[3] Kennedy et al., “Epidemiology of homicide in community-dwelling older adults: a systematic review and meta-analysis,” Trauma, Violence, & Abuse, vol. 24, no. 2, pp. 390-406, 2023.

[4] M. Anitha et al., “Duplicate Detection of Records in Queries Using Clustering,” International Journal of Research in Computer Science, eISSN: 2249-8265.

[5] M. Padmanaban and T. Bhuvaneswari, “A technique for data deduplication using Q-gram concept with support vector machine,” International Journal of Computer Applications, vol. 61, no. 12, 2013.

[6] B. Khan et al., “Identification and removal of duplicated records,” World Applied Sciences Journal, vol. 13, no. 5, pp. 1178-1184, 2011.

[7] M. Padmanaban and R. Radha, “PSO Algorithm to Select Subsets of Q-Gram Features for Record Duplicate Detection,” International Journal of Computer Applications, vol. 82, no. 12, 2013.

[8] O. Azeroual, A. Nikiforova, and K. Sha, “Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication,” in 2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS). IEEE, 2023.

[9] O. Alotaibi, S. Tomy, and E. Pardede, “A Framework for Cleaning Streaming Data in Healthcare: A Context and User-Supported Approach,” Computers, vol. 13, no. 7, p. 175, 2024.

[10] L. Yang et al., “Authenticating q-gram-based similarity search results for outsourced string databases,” Mathematics, vol. 11, no. 9, p. 2128, 2023.

[11] M. Sagheer, S. S. Salih, and S. M. Searan, “Design and Implementation of Secure Stream Cipher Algorithm,” International Journal of Computing and Digital Systems, vol. 7, no. 03, pp. 127-134, 2018.

[12] S. Yakhni et al., “Using fuzzy reasoning to improve redundancy elimination for data deduplication in connected environments,” Soft Computing, vol. 27, no. 17, pp. 12387-12418, 2023.

[13] C. H. Lindquist et al., “When pre-release optimism meets post-release reality: Understanding reentry success through a longitudinal framework assessing pre-and post-release perceptions,” Crime & Delinquency, vol. 71, no. 1, pp. 144-174, 2025.

Cite This Article As :

Mahroos, Sura. , Hazim, Rihab. , Saad, Yaqeen. , Mohammed, Nadia. Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse. Journal of Cybersecurity and Information Management, vol. , no. , 2026, pp. 01-09. DOI: https://doi.org/10.54216/JCIM.170101

Mahroos, S. Hazim, R. Saad, Y. Mohammed, N. (2026). Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse. Journal of Cybersecurity and Information Management, (), 01-09. DOI: https://doi.org/10.54216/JCIM.170101

Mahroos, Sura. Hazim, Rihab. Saad, Yaqeen. Mohammed, Nadia. Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse. Journal of Cybersecurity and Information Management , no. (2026): 01-09. DOI: https://doi.org/10.54216/JCIM.170101

Mahroos, S. , Hazim, R. , Saad, Y. , Mohammed, N. (2026) . Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse. Journal of Cybersecurity and Information Management , () , 01-09 . DOI: https://doi.org/10.54216/JCIM.170101

Mahroos S. , Hazim R. , Saad Y. , Mohammed N. [2026]. Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse. Journal of Cybersecurity and Information Management. (): 01-09. DOI: https://doi.org/10.54216/JCIM.170101

Mahroos, S. Hazim, R. Saad, Y. Mohammed, N. "Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse," Journal of Cybersecurity and Information Management, vol. , no. , pp. 01-09, 2026. DOI: https://doi.org/10.54216/JCIM.170101

Journal of Cybersecurity and Information Management

Journal DOI

Journal Menu

Journal Volumes

Volume 0

Volume 1

Volume 2

Volume 3

Volume 4

Volume 5

Volume 6

Volume 7

Volume 8

Volume 9

Volume 10

Volume 11

Volume 12

Volume 13

Volume 14

Volume 15

Volume 16

Volume 17

Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse

Abstract

Keywords :

References

Cite This Article As :

Article Statistics

Download