Comparative Sequence Analysis and Statistical Genetics

 

Evolutionary Patterns of ESTs Among Conifer Species

Recently we analyzed the spruce EST collection consisting of 26336 contigs and 52316 signlets, including 6464 full-length genes.  Together with the large loblolly pine EST collection,  we can make a number of new inferences about nucleotide variation and evolution in conifers, particularly with regard to genes involved in insect defense, which tend to run in families.  Our analyses of EST sequences in spruce and pine indicate a quite dramatic slowdown of nucleotide substitution rates, perhaps up to ten fold relative to angiosperms, in line with other studies that find reduced nucleotide diversity.  This provides the hypothesis that genome information (polymorphisms, QTL locations, gene function, etc.) may be transferred between species of the pine family.  Individual genes with high positive selection were identified, but it is often more useful to detect positive selection for a class of genes (e.g., defense, transport, cell wall formation, and signalling) which feeds back upon the likelihood of selection for individual genes.  Indeed, certain classes of genes such as those involved with defense do show elevated dN/dS ratios.  Placing this analysis into a phylogenetic context allows detection of lineage-specific trends: by adding Arabidopsis, we identified genes that differentiate angiosperms from conifers.  The expansion of gene families and extinction of gene lineages, and co-evolution among members of a gene family as revealed by resequencing, was also examined.  These investigations also serve to identify candidate genes to be used for association genetic studies.  

Analysis of ESTs Derived from Single-individual cDNA Libraries

To determine the extent of paralogy in the spruce genome, approximately 150,000 ESTs belonging to a Sitka Spruce cultivar (FB3-425) and 50,000 ESTs of a White Spruce cultivar (PG 29) were separately analyzed for haplotype structure using an in-house developed pipeline.  There were more than two haplotypes in some contigs of both species, suggesting paralogy of ca. 5%.  It is expected that this proportion of SNPs will be paralogous in our wet lab work.  
      In these same two EST collections, we are documenting the extent of genomic imprinting by looking for deviations from 1:1 ratios of the two haplotypes expected from a single gene locus.  A significant deviation is usually due to preferential transcription of one of the two alleles present at a diploid locus.
      
Identification of ORFs

Open reading frames (ORFs) are a major problem to predict in conifers, due to their high evolutionary distance from angiosperms. Prediction of ORFs is critical for correct predication of important SNPs. To provide improved prediction of ORFs, the consensus sequences of the assembly in one species were contrasted against the sequences of the other species by reciprocal blast.  A GenBank blast search was performed for each of the two contigs in the pair and 5-10 best hits sequences are extracted. This strategy improves the alignment of the two contigs and allows the gene prediction without the use of pre-specified translation matrix. The ORF prediction was done with BYPASSR, developed within our own project. This program implements a statistical inference of site specific substitution rates using a DNA substitution model with continuous variation of substitution rates across sites and a Bayesian Markov Chain Monte Carlo formulation. ORFs are predicted by higher evolutionary rates at third codon positions. Once ORFs are identified, estimated substitution rates are inferred at the codon level, and used for identifying important SNPs.

Combined QTL Mapping and Association Analysis Using MoF Pedigree Material

We have adopted modified a Transmission-Disequilibrium Test (TDT) to detect associations of SNPs with weevil resistance.  The parents are the 450 genotypes in the MoFR breeding populations, consisting of 140-170 trees for each of three breeding zones (Prince George, Prince Rupert, East Kootenay).  These trees have been clonally replicated in a number of localities including the Kalamaka forestry centre, allowing excellent evaluation of genetic values (environmental effects are reduced by clonal replication).  Open-pollinated (half-sib) progeny trials of these trees were established also in several localities in the year 1970, and these progenies have been evaluated for volume growth and weevil resistance at 20 and 30 years.  Sampling of these progenies will enable the TD test.  In June 2006 we sampled 6 progeny from each of 164 families representing the Prince George zone.  Three of the progeny were the highest volume trees among all weevil-free trees within the family, and the other tree progeny were the lowest volume trees among all weevil-infested trees.  This sampling of the tails of the distribution of weevil susceptibility is our modification of the TDT test, which will increase its power.  Besides association analysis, we can also do tests for QTLs linked to markers by estimating the within-family covariance of markers and weevil resistance (the TDT test uses the among-family association).