806 385
Full Length Article
Fusion: Practice and Applications
Volume 6 , Issue 1, PP: 17-25 , 2021 | Cite this article as | XML | Html |PDF


Design of Effective Lossless Data Compression Technique for Multiple Genomic DNA Sequences

Authors Names :   Mahmud Alosta   1     Alireza Souri   2  

1  Affiliation :  Software Engineering and IT Department, Ecole de technologie superieure, Montreal (Qc), Canada

    Email :  mahmud.alosta.1@ens.etsmtl.ca

2  Affiliation :  Department of Computer Engineering, Halil University, Beyoglu, Istanbul, Turkey

    Email :  alirezasouri@halic.edu.tr

Doi   :   https://doi.org/10.54216/FPA.060103

Received: April 08, 2021 Accepted: August 12, 2021

Abstract :

In recent years, a massive amount of genomic DNA sequences are being created which leads to the development of new storing and archiving methods. There is a major challenge to process, store or transmit the huge volume of DNA sequences data. To lessen the number of bits needed to store and transmit data, data compression (DC) techniques are proposed. Recently, DC becomes more popular, and large number of techniques is proposed with applications in several domains. In this paper, a lossless compression technique named Arithmetic coding is employed to compress DNA sequences. In order to validate the performance of the proposed model, the artificial genome dataset is used and the results are investigated interms of different evaluation parameters. Experiments were performed on artificial datasets and the compression performance of Arithmetic coding is compared to Huffman coding, LZW coding, and LZMA techniques. From simulation results, it is clear that the Arithmetic coding achieves significantly better compression with a compression ratio of 0.261 at the bit rate of 2.16 bpc.

Keywords :

Arithmetic coding; Dataset; Data compression; DNA sequences; Lossless Compression

References :

[1]      Pratas, D., Hosseini, M. and Pinho, A.J., 2019, June. GeCo2: An optimized tool for lossless compression and analysis of DNA sequences. In International Conference on Practical Applications of Computational Biology & Bioinformatics (pp. 137-145). Springer, Cham.

[2]      Hossein, S.M., De, D., Mohapatra, P.K.D., Mondal, S.P., Ahmadian, A., Ghaemi, F. and Senu, N., 2020. DNA Sequences Compression by GP² R and Selective Encryption Using Modified RSA Technique. IEEE Access, 8, pp.76880-76895.

[3]      Saada, B. and Zhang, J., 2018. DNA sequence compression technique based on nucleotides occurrence. In Proceedings of the international multiconference of engineers and computer scientists (Vol. 1, pp. 14-16).

[4]      Jahaan, A., Ravi, T.N. and Panneer Arokiaraj, S., 2017. A Comparative Study and Survey on Existing DNA Compression Techniques. International Journal of Advanced Research in Computer Science, 8(3).

[5]      Mansouri, D. and Yuan, X., 2018, December. One-bit dna compression algorithm. In International Conference on Neural Information Processing (pp. 378-386). Springer, Cham.

[6]      Cheng, K.O., Law, N.F. and Siu, W.C., 2017. Clustering-based compression for population DNA sequences. IEEE/ACM transactions on computational biology and bioinformatics, 16(1), pp.208-221.

[7]      Pasricha, N. and Hayes, C., 2019, December. Detecting bot behaviour in social media using digital dna compression. In 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science. AICS (Artificial Intelligence and Cognitive Science) 2019.

[8]      Al-Okaily, A., Almarri, B., Al Yami, S. and Huang, C.H., 2017. Toward a better compression for DNA sequences using Huffman encoding. Journal of Computational Biology, 24(4), pp.280-288.

[9]      Kerbiriou, M. and Chikhi, R., 2019, May. Parallel decompression of gzip-compressed files and random access to DNA sequences. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 209-217). IEEE.

[10]   Yin, C., 2019. Encoding and decoding DNA sequences by integer chaos game representation. Journal of Computational Biology, 26(2), pp.143-151.

[11]   Bakr, N.S. and Sharawi, A.A., 2017, December. Improve the compression of bacterial DNA sequence. In 2017 13th International Computer Engineering Conference (ICENCO) (pp. 286-290). IEEE.

[12]   Habib, N., Ahmed, K., Jabin, I. and Rahman, M.M., 2018. Modified HuffBit compress algorithm–an application of R. Journal of integrative bioinformatics, 15(3).

[13]   Pratas, D. and Pinho, A.J., 2018, May. A DNA sequence corpus for compression benchmark. In International Conference on Practical Applications of Computational Biology & Bioinformatics (pp. 208-215). Springer, Cham.

[14]   Najam, M., Rasool, R.U., Ahmad, H.F., Ashraf, U. and Malik, A.W., 2019. Pattern matching for dna sequencing data using multiple bloom filters. BioMed research international, 2019.

[15]   Rashid, O.F., 2021. Text Encryption and Hiding based on DNA Cryptography and Image Steganography. International Journal of Computing and Digital System.

[16]   Barukab, O., Ali, F. and Khan, S.A., 2021. DBP-GAPred: An intelligent method for prediction of DNA-binding proteins types by enhanced evolutionary profile features with ensemble learning. Journal of Bioinformatics and Computational Biology, p.2150018.

[17]   Karmakar, J., Pathak, A., Nandi, D. and Mandal, M.K., 2021. Sparse representation based compressive video encryption using hyper-chaos and DNA coding. Digital Signal Processing, p.103143.

[18]   Alsaffar, Q.S., Mohaisen, H.N. and Almashhdini, F.N., 2021, February. An encryption based on DNA and AES algorithms for hiding a compressed text in colored Image. In IOP Conference Series: Materials Science and Engineering (Vol. 1058, No. 1, p. 012048). IOP Publishing.

[19]   Afify, F.M. and Rahouma, K.H., 2021, March. Applying Machine Learning for Securing Data Storage Using Random DNA Sequences and Pseudo-Random Sequence Generators. In International Conference on Advanced Machine Learning Technologies and Applications (pp. 286-298). Springer, Cham.

[20]   Díaz-Domínguez, D. and Navarro, G., 2021. Efficient construction of the extended BWT from grammar-compressed DNA sequencing reads. arXiv preprint arXiv:2102.03961.

[21]   Yang, Y.G., Guan, B.W., Zhou, Y.H. and Shi, W.M., 2021. Double image compression-encryption algorithm based on fractional order hyper chaotic system and DNA approach. Multimedia Tools and Applications, 80(1), pp.691-710.

[22]   Karcioglu, A.A. and Bulut, H., 2021. Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences. Computers in Biology and Medicine, 131, p.104292.

[23]   Hao, W., Xiang, L., Li, Y., Yang, P. and Shen, X., 2018. Reversible natural language watermarking using synonym substitution and arithmetic coding. Comput. Mater. Contin, 55, pp.541-559.

[24]   http://www.eie.polyu.edu.hk/~nflaw/DNAComp/

Cite this Article as :
Mahmud Alosta , Alireza Souri, Design of Effective Lossless Data Compression Technique for Multiple Genomic DNA Sequences, Fusion: Practice and Applications, Vol. 6 , No. 1 , (2021) : 17-25 (Doi   :  https://doi.org/10.54216/FPA.060103)