Thursday, June 30, 2005

Things that are not exons

I have thought for many years that the genomics community needs a term other than 'exon' for coding segments. This post points out how lacking such a name has led to misuse of the word 'exon'. I also suggest that the word 'croe' be used instead, but my primary purpose is to call attention to the need for new names. I would be happy to have other names used properly.

This was presented at the Alternative Splicing SIG at ISMB. My presentation in PowerPoint form is available here and is posted on my web site as Posting 3. My hope is that the term be introduced into the Sequence Ontology, but I'll leave it up to my friends there to get it right.

An exon is defined as a segment of a gene that is present in the mature mRNA product of that gene. Genes for noncoding RNAs that are spliced are divided into exons and introns (examples include tRNAs and rRNAs, as well as a variety of noncoding RNA polymerase II transcripts) and every spliced mRNA has at least two exons that are partly noncoding, containing the 5' UTR and the 3' UTR. However, the need to refer to isolated coding segments that are often complete exons but are sometimes only a part of an exon has led many people to use the term 'exon' inappropriately, and this has created confusion. In one extreme case, a published paper presents an "exon size distribution" which includes many coding segments that are only part of an exon. There are many other examples.

Some people are careful to get it right, and many of them use the term CDS to refer to these coding segments. For example, Michael Zhang, in his excellent 2002 review of computational genefinding (PubMed) writes "To discriminate CDS from intervening sequence, the best content measures are the so-called frame-specific hexamer frequencies" and "... hexamer frequencies alone can detect most [long] CDS regions." However, CDS has shortcomings as a word. Foremost among them is its ambiguous meaning. The same exact term is used to refer to the entire coding region of a gene. This is analogous to using the same word for exon and mRNA.

I am grateful to Myles Axton (Nature Genetics 37 :15 (01 Jan 2005) "Touching Base; Full Text | PDF |) for introducing the readers of Nature Genetics to his term for coding segments that are less than an entire exon, which is CROE ( coding region of an exon, pronounced as in "crow"). Because the term 'exon' never communicates anything about where coding information lies, it is important that the term 'croe' apply as well to coding regions that are coincident with an exon. People should be able to say "the croes of this gene" when they refer to the units that together make up a full CDS.

Alternatively spliced segments. I have a related concern that there be a term for segments that appear as indels when two alternatively spliced mRNAs (or cDNAs) are compared. This can be a complete exon, part of an exon (occurring between two alternative splice sites) or an intron, and need not be coding. Kondrashov and Koonin refer to these various mechanisms as generating LDAS (length difference alternative splicing; 2003 PubMed | Trends in Genetics 19:115-9) but do not suggest a name for the segments themselves (other than "alternative segment," or "inserted alternative segment," which terms they use repeatedly). One idea is 'asproe,' for alternatively spliced region of an exon, which has the advantage of being paired with croe (but the disadvantage that a single insertion may consist of two or more croes, alternatively spliced region of exons and will often be less than an entire croe). It is a useful concept. If one has in hand cDNA or EST sequences that differ by an insertion the mode of alternative splicing is unknown, but the alternatively spliced region is clear, even when genomic sequence is not available. Finally, there could be two terms here. One to refer to the alternative segment at the nucleotide level and another to refer to the alternative segment at the protein level. These need not correspond; an interesting case is where the length of the segment is not a multiple of 3 nucleotides, so that the coding of downstream regions is affected. A classical case, found in the first complete eukaryotic genome sequence (SV40), comes from the small t antigen, in which overlapping reading frames are created by alternative splicing.

Wednesday, June 29, 2005

ISMB 2005

I have been in Detroit for a week. The two-day Alternative Splicing meeting preceding ISMB was outstanding, and really crystallized a community of people who are working on genome scale analysis of alternative splicing.

Two things that really struck me at this meeting were:

1) the importance of ontologies (and, more generally, the formal description of scientific knowledge). There were 51 posters in the section on ontologies and NLP. One title that caught my eye was "Transforming Full-Text Literature to Formalized Facts." I was trained to believe that scientific publication was the formalization of facts! I see that it's not good enough anymore. Ewan Birney articulated this clearly in his Keynote address this morning when he said that databases are Biology, "the starting point and the end point of our understanding." I heard calls at this meeting for the formal annotation of data on function analogous to the submission of sequence data. This is clearly coming. Experimental scientists who want their results to be included in emerging system-wide descriptions will have to participate, and informaticians will have to find a way to collect formal descriptions of functional data (Janet Thornton, in her keynote, refered to this as data harvesting and showed a cartooon).

2) The idea that very few people can speak "both languages" (Biology and Computing) is outdated. Being at the alternative splicing workshop really brought this home. It reminds me of being in Miami, where virtually everyone speaks both English and Spanish perfectly. It's still true that the majority of Biologists are still inadequately familiar with databases and computers, and that the majority of computer scientists don't "get" biological questions, but virtually everyone here (a large meeting with well over 1,000 people) is completely bilingual. This is a change from just five years ago and it means that we can stop worrying about translation and get on with the research.

Another very interesting point was in the keynote by Jill Mesirov on the use of Gene Sets. By using predefined sets of genes (her "knowledge base") she was able to apply rank statistics to find signficant differences between microarray data sets between which no single gene shows a significant difference. She has published on these methods (e.g. Brunet et al. 2004) but it was new to me.

The hotel (Renaissance Marriott) was nice in many ways, but had its problems. When I arrived, they could not make keys; I had to be let into my room by a valet and come back later. Once in my room, I discovered that the phone didn't work. The internet was constantly going down (which caused problems for two of the three presentations I saw that used it). Twice (2/7 days), housecleaning did not replace the coffee packets. Access to the hotel itself, and navigation among the first three floors, was absurdly indirect. This design feature is apparently related to ideas of security more evocative of the middle ages (embattled castles protected by moats) than the Renaissance (intellectual excitement derived from an open exchange of people and ideas). The architecture reflects a philosophy which ignores the fact that inaccessibility leads to marginalization. This center houses the General Motors corporate headquaters and I was led to an image of GM executives cowering like Quasimoto in his tower, in this case the Detroit Dark Ages Center, while life goes on below them (and without them).

Saturday, June 11, 2005

Cultural Transmission of Fitness

It was more or less by chance that I read the recent article by Heyer, Sibert and Austerlitz in the April issue of Trends in Genetics about what they call cultural transmission of fitness as carefully as I did. I had it with me on a plane today, and the seats on Northwest Airlink were just too close together for me to get out my laptop. CTF is the nongenetic transmission of fitness, and they make an intuitively compelling case (PubMed) that CTF can have a huge effect on effective population size and coalescence times. Their model appears applicable not only to the transmission of true culture in human populations, but also to epigenetic changes and artificial selection. It's not every day that a new idea in population genetics is articulated, and I found this fairly exciting. However, the idea is more a formulation of ideas that I've been vaguely aware of for a long time than an entirely new idea. This does not to take anything away from them; a formal statement of a phenomenon and its consequences is what constitutes progress in population genetics (the real work is presented in Sibert, Austerlitz and Heyer, Theoretical Population Biology 2002; PubMed). Furthermore, their citations suggest that the idea has been around for a while (although it's new to me). In fact, the applicability of this model to my previous post has apparently already been tested and rejected ("CTF was [not detected] in Ashkenazi Jews")!

Saturday, June 04, 2005

Selection vs. differential allele flow

The media is reporting (NYTimes; Economist) that there is a paper in press in The Journal of Biosocial Science that attributes the pattern of inherited diseases among Ashkenazi Jews to selection for intelligence. This hypothesis breaks not one but several taboos by talking about race, selection and intelligence, so I'm reluctant to say anything at all about it. However, I think that they missed something (I won't be sure until I see the paper, which is not out yet). Selection, "red in tooth and claw," need not be invoked. Differential migration out of the population could have a powerful effect and seems to have been overlooked. In a minority population with asymmetric gene flow (in other words, whenever the rate of assimilation into greater society exceeds the rate of acquisition of new converts) any genetic variation that disfavors assimilation will increase in frequency in the minority population. It is plausible that intelligence could be enhanced by this (for example, if intelligence improved one's ability to learn Torah or become a rabbi and those things made assimilation less likely). It is also plausible that alleles causing non-lethal genetic diseases could actually be favored within a minority population by reducing the probability that affected individuals would leave, which seems likely if the community provided care not available outside and not needed by healthier relatives who were therefore more likely to leave.