SARS detection in chest CT scan images using the bootstrapped ViT-B/16 model

Justice Kwame Appati, Bless Ziamah, Herbert Ansah Akrofi, Albert Ankomah Dodoo

Research output: Contribution to journalArticlepeer-review

Abstract

This study investigates the application of vision transformer (ViT) models for the automated detection of COVID-19 from chest CT scan images. While convolutional neural networks (CNNs) have been widely used for this task, they face limitations in capturing long-range dependencies and global context. To address these challenges, we explore the potential of ViT models, which leverage self-attention mechanisms to analyze images as sequences of patches. We develop and evaluate two ViT-based approaches: a custom ViT model built from scratch and a fine-tuned pre-trained ViT-B/16 model. Using a dataset of 2482 chest CT scan images (1252 COVID-19 positive and 1230 negative), we compare the performance of these models against state-of-the-art CNN-based methods. Our results demonstrate the superiority of the ViT-based approach, with the fine-tuned ViT-B/16 model achieving an accuracy of 98.83%, precision of 99.29%, recall of 98.23%, and F1-score of 98.76%. This performance surpasses that of existing CNN-based models, including DenseNet201 and VGG19. The study highlights the effectiveness of transfer learning in adapting pre-trained ViT models for COVID-19 detection. It demonstrates the potential of ViT architecture in capturing subtle patterns and global context in medical images. These findings contribute to advancing AI-assisted COVID-19 diagnosis and pave the way for further exploration of transformer-based architectures in medical image analysis.

Original languageEnglish
JournalIran Journal of Computer Science
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • Convolutional neural network
  • Deep learning
  • Fine-tuning
  • Multi-head attention
  • Patch embeddings
  • Pre-trained
  • Vision transformer

Fingerprint

Dive into the research topics of 'SARS detection in chest CT scan images using the bootstrapped ViT-B/16 model'. Together they form a unique fingerprint.

Cite this