Abstract

Review Article

A Critical Review on Some Recent Developments in Comparison of Biological Sequences

DK Bhattacharya*

Published: 25 April, 2024 | Volume 7 - Issue 1 | Pages: 008-014

The present review highlights some of the very important contributions to non-alignment ways of comparing biological sequences, which may be genome sequences of nucleotides, protein sequences of amino acids, or sequences of protein secondary structures. The discussion centers around specific methods applicable to the comparison of three types of sequences. The methods of comparison of genome sequences are based on three pairs of biological groups of nucleotides; the same for protein sequences are based on either physio-chemical property values of amino acids or on classified groups of amino acids of different cardinalities obtained from the physio-chemical properties; the same for sequences of secondary structures of proteins are based on their sequential expressions of structure elements of cardinality three and four. Comparison is made in the time domain and also in the frequency domain. Different taxa of known phylogeny are considered for comparison. It tries to find out the specific method of comparison, which can show the exact phylogeny of the taxa. If a new sequence appears in the database, it becomes essential to know its phylogeny. For this purpose, a phylogenetic tree is drawn on the sequences of the known taxa together with this new sequence using the best possible method. If the species having this new sequence belongs to the old taxa, there is nothing to worry about. Otherwise, the species with the new sequence has to be studied separately. This is the general reason for the construction of a phylogenetic tree in any form of biological sequence comparison.

Read Full Article HTML DOI: 10.29328/journal.jgmgt.1001010 Cite this Article Read Full Article PDF

Keywords:

Biological groups of nucleotides; Physio-chemical properties of amino acids; Classified groups of amino acids; Standard secondary structures of protein

References

  1. Gates MA. A simple way to look at DNA. J Theor Biol. 1986 Apr 7;119(3):319-28. doi: 10.1016/s0022-5193(86)80144-8. PMID: 3016414.
  2. Nandy A. A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes. Current Science. 1994; 309-314.
  3. Leong PM, Morgenthaler S. Random walk and gap plots of DNA sequences. Bioinformatics. 1995; 11(5): 503-507.
  4. Guo X, Randic M, Basak SC. A novel 2-D graphical representation of DNA sequences of low degeneracy. Chemical Physics Letters. 2001; 350(1-2):106-112.
  5. Yau SS, Wang J, Niknejad A, Lu C, Jin N, Ho YK. DNA sequence representation without degeneracy. Nucleic Acids Res. 2003 Jun 15;31(12):3078-80. doi: 10.1093/nar/gkg432. PMID: 12799435; PMCID: PMC162336.
  6. Liao B. A 2D graphical representation of DNA sequence. Chemical Physics Letters. 2005; 401(1-3):196-199.
  7. Song J, Tang H. A new 2-D graphical representation of DNA sequences and their numerical characterization. J Biochem Biophys Methods. 2005 Jun 30;63(3):228-39. doi: 10.1016/j.jbbm.2005.04.004. PMID: 15939477.
  8. Randić M, Vračko M, Lerš N, Plavšić D. Novel 2-D graphical representation of DNA sequences and their numerical characterization. Chemical Physics Letters. 2003; 368(1-2):1-6.
  9. Randić M, Vračko M, Lerš N, Plavšić D. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chemical Physics Letters. 2003; 371(1-2): 202-207.
  10. Yao YH, Liao B, Wang TM. A 2D graphical representation of RNA secondary structures and the analysis of similarity/dissimilarity based on it. Journal of molecular structure: Theochem. 2005; 755(1-3):131-136.
  11. Randić M, Vracko M, Nandy A, Basak SC. On 3-D graphical representation of DNA primary sequences and their numerical characterization. J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1235-44. doi: 10.1021/ci000034q. PMID: 11045819.
  12. Nandy A, Nandy P. Graphical analysis of DNA sequence structure: II. Relative abundances of nucleotides in DNAs, gene evolution and duplication. Current Science. 1995; 75-85.
  13. Yao YH, Nan XY, Wang TM. A new 2D graphical representation— Classification curve and the analysis of similarity/dissimilarity of DNA sequences. Journal of Molecular Structure: Theochem. 2006; 764(1-3): 101-108.
  14. Das S, Pal J, Bhattacharya DK. Geometrical method of exhibiting similarity/dissimilarity under new 3D classification curves and establishing significance difference of different parameters of estimation. Intl J Adv Res Comp Sci Softw Engg. 2015; 5:279-287.
  15. Randić M, Witzmann F, Vračko M, Basak SC. On characterization of proteomics maps and chemically induced changes in proteomes using matrix invariants: application to peroxisome proliferators. Medicinal Chemistry Research. 2001; 10(7-8):456-479.
  16. Qi ZH, Fan TR. PN-curve: A 3D graphical representation of DNA sequences and their numerical characterization. Chemical Physics Letters. 2007; 442(4-6): 434- 440.
  17. Akhtar M, Epps J, Ambikairajah E. Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE Journal of Selected Topics in Signal Processing. 2008; 2(3): 310-321.
  18. Chakravarthy N, Spanias A, Iasemidis LD, Tsakalis K. Autoregressive modeling and feature analysis of DNA sequences. EURASIP Journal on Advances in Signal Processing. 2004; 2004(1):1-16.
  19. Chi R, Ding K. Novel 4D numerical representation of DNA sequences, Chemical Physics Letters. 2005; 407:63-67.
  20. Anastassiou D. Genomic Signal Processing. IEEE Signal Processing Magazine. 2001; 18:8-20.
  21. Cristea PD. Genetic Signal Representation and Analysis, SPIE Conference, BIOS’2002- International Biomedical Optics Symposium, Molecular Analysis and Informatics, San Jose USA, B.O.4623-10, 2002; 77-84.
  22. Cattani C. Complex Representation of DNA Sequences, 2nd International Conference on Bioinformatics Research and Development-BIRD. 2008; 13: 528-537.
  23. Brodzik AK, Peters O. Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences. Proc IEEE ICASSP. 2005; 5: 373-376.
  24. King BR, Aburdene M, Thompson A, Warres Z. Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity. EURASIP J Bioinform Syst Biol. 2014;2014(1):8. doi: 10.1186/1687-4153-2014-8. Epub 2014 May 28. PMID: 24991213; PMCID: PMC4077688.
  25. Zhao B, Duan V, Yau SS. A novel clustering method via nucleotide-based Fourier power spectrum analysis. J Theor Biol. 2011 Jun 21;279(1):83-9. doi: 10.1016/j.jtbi.2011.03.029. Epub 2011 Apr 2. PMID: 21443881; PMCID: PMC7094093.
  26. Hoang T, Yin C, Zheng H, Yu C, Lucy He R, Yau SS. A new method to cluster DNA sequences using Fourier power spectrum. J Theor Biol. 2015 May 7;372:135-45. doi: 10.1016/j.jtbi.2015.02.026. Epub 2015 Mar 5. PMID: 25747773; PMCID: PMC7094126.
  27. Nieto JJ, Torres A, Georgiou DN, Karakasidis TE. Fuzzy polynucleotide spaces and metrics. Bull Math Biol. 2006 Apr;68(3):703-25. doi: 10.1007/s11538-005-9020-5. PMID: 16794951.
  28. Torres A, Nieto JJ. The fuzzy polynucleotide space: basic properties. Bioinformatics. 2003 Mar 22;19(5):587-92. doi: 10.1093/bioinformatics/btg032. PMID: 12651716.
  29. Ghosh S, Pal J, Maji B, Bhattacharya DK. A method of genome sequence comparison based on a new form of fuzzy polynucleotide space. 7th International Conference on Emerging Applications of Information Technology (EAIT 2022). DOI: 10.1007/978-981-19-5191-6_11.
  30. Ghosh S, Pal J, Maji B, Cattani C, Bhattacharya DK. Choice of Metric Divergence in Genome Sequence Comparison. Protein J. 2024 Mar 16. doi: 10.1007/s10930-024-10189-x. Epub ahead of print. PMID: 38492188.
  31. Raychaudhury C, Nandy A. Indexing scheme and similarity measures for macromolecular sequences. J Chem Inf Comput Sci. 1999 Mar-Apr;39(2):243-7. doi: 10.1021/ci980077v. PMID: 10192941.
  32. Randić M. On characterization of DNA primary sequences by a condensed matrix. Chemical Physics Letters. 2000; 317(1-2):29-34.
  33. He PA, Wang J. Characteristic sequences for DNA primary sequence. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1080-5. doi: 10.1021/ci010131z. PMID: 12376994.
  34. Guo X, Randic M, Basak SC. A novel 2-D graphical representation of DNA sequences of low degeneracy. Chemical Physics Letters. 2001; 350(1-2):106-112.
  35. Liu Y, Guo X, Xu J, Pan L, Wang S. Some notes on 2-D graphical representation of DNA sequence. J Chem Inf Comput Sci. 2002 May-Jun;42(3):529-33. doi: 10.1021/ci010017g. PMID: 12086510.
  36. Yao Y, Nan XY, Wang T. A new 2D graphical representation—Classification curve and the analysis of similarity/dissimilarity of DNA sequences, Journal of Molecular Structure: THEOCHEM. 2006; 764(1–3):101-108.
  37. Das S, Pal J, Bhattacharya DK. Geometrical method of exhibiting similarity/dissimilarity under new 3D classification curves and establishing significance difference of different parameters of estimation, International Journal of Advanced Research in Computer Science and Software Engineering. 2015; 5: 279-287.
  38. Das S, Das A, Mondal B, Dey N, Bhattacharya DK, Tibarewala DN. Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides. Gene. 2020 Mar 10;730:144257. doi: 10.1016/j.gene.2019.144257. Epub 2019 Nov 21. PMID: 31759983.
  39. Ghosh S, Pal J, Bhattacharya DK. Classification of Amino Acids of a Protein on the basis of Fuzzy set theory. International Journal of Modern Sciences and Engineering Technology. 2014; 1(6): 30-35.
  40. Ghosh S, Pal J, Maji B, Bhattacharya DK. Protein Sequence Comparison on Fuzzy Matrix Amino Acid Space. IEEE Sponsored International Conference on Technological Advancements and Innovations (ICTAI - 2021). 2021; 10-12 Nov. 2021, DOI: 10.1109/ICTAI53825.2021.9673411.
  41. Pal J, Ghosh S, Maji B, Bhattacharya DK. Use of FFT in protein sequence comparison under their binary representations. Computational Molecular Bioscience. 2016; 6(02): 33.
  42. Pal J, Ghosh S, Maji B, Bhattacharya DK. MMV method: a new approach to compare protein sequences under binary representation. J Biomol Struct Dyn. 2024 Feb 20:1-7. doi: 10.1080/07391102.2024.2317982. Epub ahead of print. PMID: 38375605.
  43. Yu L, Zhang Y, Gutman I, Shi Y, Dehmer M. Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix. Sci Rep. 2017 Apr 10;7:46237. doi: 10.1038/srep46237. Erratum in: Sci Rep. 2017 May 04;7:46787. PMID: 28393857; PMCID: PMC5385872.
  44. Wu ZC, Xiao X, Chou KC. 2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. J Theor Biol. 2010 Nov 7;267(1):29-34. doi: 10.1016/j.jtbi.2010.08.007. Epub 2010 Aug 7. PMID: 20696175.
  45. Randić M. 2-D graphical representation of proteins based on physicochemical properties of amino acids. Chem Phys Lett.2007; 440: 291– 295, DOI: 10.1016/j.cplett.2007.04.03
  46. Zhang Y, Zhan Y, Xu C. A novel method of 2D graphical representation for proteins and its application. MATCH Commun Math Comput Chem. 2016; 75: 431- 446.
  47. Qi ZH, Jin MZ, Li SL, Feng J. A protein mapping method based on physio-chemical properties and dimension reduction. Comput Biol Med. 2015; 57:1-7. DOI: 10.1016/j.compbiomed.2014.11.012
  48. Yao YH, Dai Q, Li L, Nan XY, He PA, Zhang YZ. Similarity/dissimilarity studies of protein sequences based on a new 2D graphical representation. J Comput Chem. 2010 Apr 15;31(5):1045-52. doi: 10.1002/jcc.21391. PMID: 19777597.
  49. Yu C, Cheng SY, He RL, Yau SST. Protein map: an alignment-free sequence comparison method based on various properties of amino acids.  2011; 486:110.
  50. Zhang YP, Ruan JS, He PA. Analyzes of the similarities of protein sequences based on the pseudo amino acid composition. Chem Phys Lett. 2013; 590: 239– 244, DOI: 10.1016/j.cplett.2013.10.076
  51. Ma T, Liu Y, Dai Q, Yao Y, He PA. A graphical representation of protein based on a novel iterated function system. Phys A. 2014; 403:21- 28. DOI: 10.1016/j.physa.2014.01.067
  52. Ping P, Zhu X, Wang L Similarities/dissimilarities analysis of protein sequences based on PCA-FFT. J Biol Syst. 2017; 25:29- 45. DOI: 10.1142/s0218339017500024
  53. Mahmoodi-Reihani M, Abbasitabar F, Zare-Shahabadi V. A novel graphical representation and similarity analysis of protein sequences based on physicochemical properties. Phys 2018; 510: 477– 485. DOI: 10.1016/j.physa.2018.07.011
  54. Mahmoodi-Reihani M, Abbasitabar F, Zare-Shahabadi V. In Silico Rational Design and Virtual Screening of Bioactive Peptides Based on QSAR Modeling. ACS Omega. 2020 Mar 10;5(11):5951-5958. doi: 10.1021/acsomega.9b04302. PMID: 32226875; PMCID: PMC7097998.
  55. Yin C, Yau SST. Numerical representation of DNA sequences based on genetic code context and its applications in periodicity analysis of genomes. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. 2008; 223-227.
  56. Pal J, Ghosh S, Maji B, Bhattacharya DK. Protein sequence comparison under a new complex representation of amino acids based on their physio-chemical properties. Int J Eng Technol. 2018; 7:181-184.
  57. Pal J, Ghosh S, Maji B, Bhattacharya DK. Mathematical Approach to Protein Sequence Comparison Based on Physiochemical Properties. ACS Omega. 2022 Oct 17;7(43):39446-39455. doi: 10.1021/acsomega.2c06103. PMID: 36340165; PMCID: PMC9631895.
  58. Ghosh S, Pal J, Cattani C, Maji B, Bhattacharya DK. Protein sequence comparison based on representation on a finite dimensional unit hypercube, Journal of Biomolecular Structure and Dynamics. 2023; DOI: 10.1080/07391102.2023.2268719
  59. Zhang Y, Yu X. Analysis of Protein Sequence similarity- 978-1-4244-6439-5/19/$26.00(c) IEEE. 2010.
  60. Li C, Xing L, Wang X. 2-D graphical representation of protein sequences and its application to coronavirus phylogeny. BMB Rep. 2008 Mar 31;41(3):217-22. doi: 10.5483/bmbrep.2008.41.3.217. PMID: 18377725.
  61. Soumen G, Pal J, Maji B, Bhattacharya DK. A sequential development towards a unified approach to protein sequence comparison based on classified groups of amino acids. International Journal of Engineering & Technology. 2018; 7(2): 678-686.
  62. Levitt M, Chothia C. Structural patterns in globular proteins. Nature. 1976 Jun 17;261(5561):552-8. doi: 10.1038/261552a0. PMID: 934293.
  63. Nishikawa K,Kubota Y,Ooi T. Classification of Proteins into Groups Based on Amino Acid Composition and Other Characters, II. Grouping into Four Types -The Journal of Biochemistry. 1993; 94: 997–1007.
  64. Sheridan RP, Dixon JS, Venkataraghavan R, Kuntz ID, Scott KP. Amino acid composition and hydrophobicity patterns of protein domains correlate with their structures. Biopolymers. 1985 Oct;24(10):1995-2023. doi: 10.1002/bip.360241011. PMID: 4074850.
  65. Nakashima H, Nishikawa K, Ooi T. The folding type of a protein is relevant to the amino acid composition. J Biochem. 1986 Jan;99(1):153-62. doi: 10.1093/oxfordjournals.jbchem.a135454. PMID: 3957893.
  66. Klein P, Delisi C. Prediction of protein structural class from the amino acid sequence. Biopolymers. 1986 Sep;25(9):1659-72. doi: 10.1002/bip.360250909. PMID: 3768479.
  67. Sun XD, Huang RB. Prediction of protein structural classes using support vector machines. Amino Acids. 2006 Jun;30(4):469-75. doi: 10.1007/s00726-005-0239-0. Epub 2006 Apr 20. PMID: 16622605.
  68. Kneller DG, Cohen FE, Langridge R. Improvements in protein secondary structure prediction by an enhanced neural network. J Mol Biol. 1990 Jul 5;214(1):171-82. doi: 10.1016/0022-2836(90)90154-E. PMID: 2370661.
  69. Mao B, Chou KC, Zhang CT. Protein folding classes: a geometric interpretation of the amino acid composition of globular proteins. Protein Eng. 1994 Mar;7(3):319-30. doi: 10.1093/protein/7.3.319. PMID: 8177880.
  70. Sternberg MJ, Thornton JM. On the conformation of proteins: an analysis of beta-pleated sheets. J Mol Biol. 1977 Feb 25;110(2):285-96. doi: 10.1016/s0022-2836(77)80073-9. PMID: 845953.
  71. Flores TP, Moss DM, Thornton JM Solution phase bio panning method using engineered decoy proteins. Protein Engineering. 1994; 7:31-37.
  72. Westhead DR, Hutton DC, Thornton JM. Trends Biochem Sci. 1998; Jan;23(1):35-6. doi: 10.1016/s0968-0004(97)01161-4
  73. Westhead DR, Slidel TW, Flores TP, Thornton JM. Protein structural topology: Automated analysis and diagrammatic representation. Protein Sci. 1999 Apr;8(4):897-904. doi: 10.1110/ps.8.4.897. PMID: 10211836; PMCID: PMC2144300.
  74. Gilbert D, Westhead D, Viksna J, Thornton J. A computer system to perform structure comparison using TOPS representations of protein structure- Computers and Chemistry. 2001; 26:23-30.
  75. Krasnogor N, Pelta DA. Measuring the similarity of protein structures by means of the universal similarity metric- Bioinformatics. 2004; 20:1015–1021.
  76. Chew LP, Kedem K. Finding the consensus shape for a protein family. Algorithmica. 2003; 38(1):115–129.
  77. Krasnogor N. Self-generating metaheuristics in bioinformatics: The proteins structure comparison case. J Genet Program Evolv Mach. 2003; 5.
  78. Leluk J, Konieczny L, Roterman I. Search for structural similarity in proteins. Bioinformatics. 2003 Jan;19(1):117-24. doi: 10.1093/bioinformatics/19.1.117. PMID: 12499301.
  79. Gilbert D, Rosselló F, Valiente G, Veeramalai M. Alignment-free comparison of TOPS strings. London Algorithmics and Stringology. 2006; 8:177-197.
  80. Gilbert DR, Rossello F, Valiente G, Veeramalai M. Alignment-free comparison of TOPS strings, London Algorithmics and Stringology, Texts. In Algorithmics. 2007; 8: 177-197. Eds. J Daykin, M Mohamed, K Steinhofel. College Publications
  81. Liu L, Wang T. Comparison of TOPS strings based on LZ complexity. J Theor Biol. 2008 Mar 7;251(1):159-66. doi: 10.1016/j.jtbi.2007.11.016. Epub 2007 Nov 21. PMID: 18166201.
  82. Li B, Li YB, He HB. LZ complexity distance of DNA sequences and its application in phylogenetic tree reconstruction. Genomics Proteomics Bioinformatics. 2005 Nov;3(4):206-12. doi: 10.1016/s1672-0229(05)03028-7. PMID: 16689687; PMCID: PMC5172548.
  83. Zhang S, Yang L, Wang T. Use of information discrepancy measure to compare protein secondary structures. Journal of Molecular Structure: THEOCHEM. 2009; 909(1-3):102-106.
  84. Guo Y, Wang TM. A new method to analyze the similarity of protein structure using TOPS representations. J Biomol Struct Dyn. 2008 Dec;26(3):367-74. doi: 10.1080/07391102.2008.10507251. PMID: 18808202.
  85. Pal D, Dey S, Ghosh P, Bhattacharya DK, Das S, Maji B. A Unique Approach for Protein Secondary Structure Comparison under TOPS Representation - To appear in Journal of Bio molecular Structure and Dynamics- Taylor & Francis. 2004.
  86. Zhang S, Yang L, Wang T. Use of information discrepancy measure to compare protein secondary structures. Journal of Molecular Structure: THEOCHEM. 2009; 909(1-3):102-106.
  87. Liu L, Wang T. 2D representation of protein secondary structure sequences and its applications. J Comput Chem. 2006 Aug;27(11):1119-24. doi: 10.1002/jcc.20430. PMID: 16721724.
  88. Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol. 1993 Jul 20;232(2):584-99. doi: 10.1006/jmbi.1993.1413. PMID: 8345525.
  89. Zhang CT, Zhang R. S curve, a graphic representation of protein secondary structure sequence and its applications. Biopolymers. 2000 Jun;53(7):539-49. doi: 10.1002/(SICI)1097-0282(200006)53:7<539::AID-BIP2>3.0.CO;2-2. PMID: 10766950.

Figures:

Similar Articles

Recently Viewed

Read More

Most Viewed

Read More

Help ?