BUILDING PLANT PROTEIN SEQUENCE DATABASES FROM RNA-SEQ EXPERIMENTS

News and Events > News > BUILDING PLANT PROTEIN SEQUENCE DATABASES FROM RNA-SEQ EXPERIMENTS

University
03 March 2026

The event held on 18 February 2026 aimed to strengthen the capacity in bioinformatics and proteomics among researchers and postgraduate students. Professor Tabb, whose career in proteome informatics spans nearly three decades, led participants through the theoretical and practical foundations of transcriptome-based protein database construction.

Professor Tabb has contributed to research initiatives across the United States, South Africa, France and the Netherlands. His research interests include non-model organisms, quality assessment, bioinformatics tool evaluation and meta-analysis.

Contextualising his presentation, Professor Tabb reflected on a research challenge first encountered in 2017 at the University of the Western Cape. Researchers had generated proteomics data for Salvia hispanica, a plant species lacking a fully annotated genome.

UMP Dean of the Faculty of Agriculture and Natural Sciences, Professor Ndiko Ludidi, facilitated the seminar.

“In many plant research projects, especially those involving non-model organisms, we do not have the luxury of a complete genome annotation,” he explained. “Yet, for proteomics analysis to be meaningful, we require a reliable database of protein sequences. The question then becomes: how do we build that database?”

He demonstrated how RNA sequencing offers a powerful and practical solution. By leveraging RNA-Seq data, researchers can assemble transcriptomes and subsequently translate these transcripts into putative protein sequences suitable for proteomic analysis. Guiding participants step-by-step, he moved from raw sequencing reads through transcript assembly and quality control to protein translation. Central to this process is transcript assembly using specialised software such as Trinity.

Professor Tabb noted that transcriptome assembly is computationally intensive and requires high-memory computing infrastructure.

“These assemblies are not lightweight processes,” he said. “They often require machines with substantial RAM and extended run times. Planning and institutional support are essential.”

Beyond computational considerations, he emphasised that quality control is foundational to producing biologically reliable datasets. Mapping original RNA-Seq reads back to assembled transcripts enables researchers to assess the accuracy and completeness of their assemblies.

“Quality assessment is not optional,” he said. “If we do not evaluate our assemblies critically, we risk building protein databases on unstable foundations.”

From Assembly to Annotation

The workshop also addressed one of the persistent challenges in transcriptome-based proteomics: the presence of non-coding RNAs within assembled datasets.

Professor Tabb introduced tools for identifying and removing abundant non-messenger RNAs, including ribosomal and transfer RNAs, which can inflate datasets and complicate downstream analysis. He further discussed evaluating transcriptome completeness through assessment of single-copy orthologs, which serve as biological benchmarks for determining how comprehensively a species’ gene content has been captured.

“These completeness scores provide confidence,” he explained. “They help us understand whether our assembly represents the majority of expected genes or whether critical components may still be missing.”

Workshop participants engaged actively with the presented workflows.

A key highlight of the workshop was the explanation of six-frame translation and the role of TransDecoder in identifying genuine protein-coding sequences. Because RNA transcripts can theoretically be read in multiple frames, determining the correct reading frame is essential for accurate protein prediction. Professor Tabb emphasised that robust protein identification requires integrating multiple lines of evidence.

“No single criterion is sufficient,” he said. “Length alone is not enough. Homology alone is not enough. Domain matches alone are not enough. When we combine sequence length, similarity to known proteins and conserved domain information, we dramatically improve the reliability of our predictions.”

He underscored the importance of linking newly generated protein sequences to well-annotated reference species such as rice or Arabidopsis. Through orthology analysis and domain scanning tools such as InterPro and Pfam, researchers can assign functional descriptions and gene ontology terms that transform raw sequence data into biologically meaningful insights.

“A descriptive protein name carries scientific value,” he noted. “It allows researchers to interpret results in context. Accession numbers alone do not tell the biological story.”

UMP Dr Ali Elnaeim Elbasheir Ali, Senior Lecturer at UMP.

He also cautioned against inadvertently excluding short protein sequences during filtering, noting that even established databases have historically overlooked important short proteins. Maintaining a critical and analytical perspective remains essential in computational biology.

Throughout the workshop, participants engaged actively with the presented workflows, raising questions about implementation strategies, infrastructure requirements and potential applications within their own research contexts. The discussions reflected growing interest in applying advanced bioinformatics methodologies to plant science, agriculture and biodiversity research in South Africa.

The workshop strengthened technical understanding of RNA-Seq-based proteomics workflows and highlighted the value of interdisciplinary collaboration. By integrating molecular biology, computational analysis and functional annotation, researchers are better equipped to address complex biological questions, particularly in understudied or non-model organisms.

Hosting international experts such as Professor Tabb aligns with UMP’s strategic vision of building research capacity and fostering global academic partnerships. The workshop demonstrated how access to cutting-edge methodologies can empower researchers and postgraduate students to contribute meaningfully to scientific innovation at both national and international levels.

Professor Tabb’s visit provided a valuable platform for technical skills development and collaborative engagement, further supporting the growth of scholarly networks that will sustain future partnerships.

BUILDING PLANT PROTEIN SEQUENCE DATABASES FROM RNA-SEQ EXPERIMENTS

CONTACT US

General Enquiries:

CONNECT WITH US

INFORMATION

Application and Registration Information:

Disclaimer | Privacy Statement | PAIA Manual | Whistleblow Hotline