Bilgilendirme: Kurulum ve veri kapsamındaki çalışmalar devam etmektedir. Göstereceğiniz anlayış için teşekkür ederiz.
 

Novel Tiny Textural Motif Pattern-Based RNA Virus Protein Sequence Classification Model

Loading...
Publication Logo

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Pergamon-Elsevier Science Ltd

Open Access Color

OpenAIRE Downloads

OpenAIRE Views

Research Projects

Journal Issue

Abstract

Background: RNA viruses, including severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), are important human pathogens. Sequencing of the proteins produced by RNA viruses is essential for understanding disease pathogenesis and may have diagnostic and therapeutic implications. We aimed to develop an accurate and computationally efficient handcrafted feature engineering model for classifying the protein sequences of six pathogenic RNA viruses: SARS-CoV-2, influenza A, influenza B, influenza C, human respirovirus 3, and human immunodeficiency virus (HIV)-1. The first five cause primary respiratory infections; the last has some functional similarity with SARS-CoV-2, justifying the need for diagnostic differentiation. Materials and method: We downloaded 14,787 protein sequences belonging to the six categories in FASTA format from the open-source National Center for Biotechnology Information database and transformed the sequences into numeric arrays. First, the signal was divided into overlapping blocks representing three amino acids. Tiny textural motif pattern, a new histogram-based feature extractor, was then applied to extract textural features using simple signum, lower, and upper ternary functions. 512 features were extracted for each protein sequence and fed to an iterative neighborhood component analysis function to select a study dataset-specific optimal number (34) of the most discriminative features for downstream classification using a shallow k-nearest neighbor classifier with 10-fold cross-validation. Novelties: An efficient linear time complexity is introduced for data classification, providing a robust classification approach, especially for complex datasets. Notably, this approach extends beyond the traditional binary classification focus, successfully distinguishing up to six distinct classes. Furthermore, a novel handcrafted feature extraction method is developed, significantly enhancing data analysis and yielding more precise results. Results: The model attained 99.71% overall 6-class classification accuracy in a data subset and 99.85% for binary classification of SARS-CoV-2 vs. HIV-1, outperforming a similar published model. Conclusions: Our simple model accurately classified the protein sequences of six pathogenic RNA viruses and can potentially be implemented in diagnostic applications to improve RNA virus disease screening.

Description

Erten, Mehmet/0000-0002-6664-4568; Aydemir, Emrah/0000-0002-8380-7891; Hafeez-Baig, Abdul/0000-0003-3848-8008; Dogan, Sengul/0000-0001-9677-5684;

Keywords

Protein Sequence Classification, SARS-CoV-2, Bioinformatics

Fields of Science

Citation

WoS Q

Q1

Scopus Q

N/A

Source

Expert Systems with Applications

Volume

242

Issue

Start Page

End Page

Google Scholar Logo
Google Scholar™

Sustainable Development Goals

SDG data could not be loaded because of an error. Please refresh the page or try again later.