Modeling coding sequence design for virus-based expression in tobacco
Transient expression in Tobacco is a popular way to produce recombinant proteins in plants. The design of various expression vectors, delivered into the plant by Agrobacterium, has enabled high production levels of some proteins. To further enhance expression, researchers often adapt the coding sequence of heterologous genes to the host, but this strategy has produced mixed results in Tobacco.
To study the effects of different sequence features on protein yield, we compile a dataset of the yields and coding sequences of previously published expression studies of more than 200 coding sequences.
We evaluate various established gene expression models on a subset of the expression studies. We find that use of tobacco codons is only moderately predictive of protein yield as informative sequence features likely extend over multiple codons. Additionally, we show that codon usage of organisms that use tobacco as a host for expression of their proteins in a similar way as the synthetic system, like viruses and agrobacteria, can be used to predict heterologous expression. Other predictive features are related to tRNA supply and demand, the inclusion of a translational ramp of codons with lower adaptation to the tRNA pool at the beginning of the coding region, and the amino acid composition of the recombinant protein. A model based on all the features achieved a correlation of 0.57 with protein yield.
We believe that our study provides a practical guideline for coding sequence design for efficient expression in tobacco.
Comments