Stats: More on discovering gene information (October 12, 2005)

More on discovering gene information (October 12, 2005).

This page is moving to a new website.

I was reading an interesting microarray article:

A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, Zahnow CA, Patterson N, Golub TR, Ewen ME. Cell 2003: 114(3); 323-34. [Medline] [Abstract] [PDF]

and was curious what information I could find about cyclin D1. The article mentions the gene symbol (CCND1) but provides no other obvious clues (at least clues that were obvious to me).

The easy way to start, of course, is to use Google or another Internet search engine. The first site mentioned is OMIM (Online Mendelian Inheritance in Man), which has its main page at

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

According to this page, OMIM

is a catalog of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and elsewhere, and developed for the World Wide Web by NCBI, the National Center for Biotechnology Information. The database contains textual information and references. It also contains copious links to MEDLINE and sequence records in the Entrez system, and links to additional related resources at NCBI and elsewhere.

The OMIM database uses a six digit code (MIM number) to index its entries.

Each OMIM entry is given a unique six-digit number whose first digit indicates the mode of inheritance of the gene involved:
1----- (100000- ) Autosomal loci or phenotypes (entries created before May 15, 1994)
2----- (200000- ) Autosomal loci or phenotypes (entries created before May 15, 1994)
3----- (300000- ) X-linked loci or phenotypes
4----- (400000- ) Y-linked loci or phenotypes
5----- (500000- ) Mitochondrial loci or phenotypes
6----- (600000- ) Autosomal loci or phenotypes (entries created after May 15, 1994).

An allelic variant is designated by the MIM number of its parent entry, followed by a decimal point and a unique 4-digit variant number. For example, allelic variants (mutations) at the factor IX (hemophilia B) locus are numbered 306900.0001 to 306900.0101. The beta-globin locus (HBB) is numbered 141900; sickle hemoglobin is numbered 141900.0243.

Symbols that preceed the MIM number give additional information about the gene.

An asterisk (*) before an entry number indicates a gene of known sequence.

A number symbol (#) before an entry number indicates that it is a descriptive entry, usually of a phenotype, and does not represent a unique locus. The reason for the use of the #-sign is given in the first paragraph of the entry. Discussion of any gene(s) related to the phenotype resides in another entry(ies) as described in the first paragraph.

A plus sign (+) before an entry number indicates that the entry contains the description of a gene of known sequence and a phenotype.

A percent sign (%) before an entry number indicates that the entry describes a confirmed mendelian phenotype or phenotypic locus for which the underlying molecular basis is not known.

No symbol before an entry number generally indicates a description of a phenotype for which the mendelian basis, although suspected, has not been clearly established or that the separateness of this phenotype from that in another entry is unclear.

A caret symbol (^) before an entry number means the entry no longer exists because it was removed from the database or moved to another entry as indicated.

The description of cyclin D1 in OMIM is

www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=168461

The second Google link was to Entrez Gene. The main page for Entrez Gene explains its role:

Entrez Gene is a searchable database of genes, from RefSeq genomes, and defined by sequence and/or located in the NCBI Map Viewer. www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene

and this page identifies a variety of ways you can search for a gene:

free text searcing,
species,
chromosome,
sequence access number,
gene name (symbol),
Gene Ontology (GO) terms or identifiers,
Enzyme Commission (EC) numbers,

or any combination of the above.

The FAQ page for Entrez Gene mentions the integration of information between it and LocusLink, a database that is being discontinued in favor of Entrez Gene.

The actual page at Entrez Gene for cyclin D1 is

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=595

a link off this page points to another interesting resourc,e the HUGO Gene Nomencalture Committee (HGNC).

www.gene.ucl.ac.uk/nomenclature/

The HGNC provides unique gene symbols to every gene in the human genome. According to their FAQ,

The purpose of approved nomenclature is to enable scientists to access all data pertaining to a specific gene of interest, across species. Ideally, this will be possible by searching databases using a unique gene symbol. At present, some genes are published under more than one name/symbol and also one symbol/name is sometimes used for several unrelated genes.

Rather than use a numerical identifier, HGNC approves a short-form abbreviation known as a gene symbol, and also a longer and more descriptive name. Each symbol is unique and the committee ensures that each gene is only given one approved gene symbol. It is necessary to provide a unique symbol for each gene so that we can talk about them, and to facilitate electronic data retrieval from publications. In preference, each symbol maintains parallel construction in different members of a gene family and can also be used in other species, especially the mouse. www.gene.ucl.ac.uk/nomenclature/information/FAQs.html

The gene symbol is defined by HGNC as

a unique series of Latin (upper case in human) letters and Arabic numbers which should preferably be no longer than six characters in length.

Some gene names have a common stem, such as

CYP#: cytochrome P450;
HOX#: homeo box;
DUSP#: dual specificity phosphatase;
SCN2A#: sodium channel, voltage-gated, type II, alpha 2 polypeptide and
SH3GL#: SH3-domain GRB2-like.

This page lists similar nomenclature homepages for other species:

Mouse: www.informatics.jax.org/
Rat: ratmap.gen.gu.se/ratmap/WWWNomen/Nomen.html
Chicken: www.ri.bbsrc.ac.uk/chickmap/ChickMapHomePage.html
Fly: flybase.bio.indiana.edu/docs/nomenclature/lk/nomenclature.html
Yeast: genome-www.stanford.edu/Saccharomyces/gene_guidelines.html

The HGNC pages also refere to a "CD nomenclature" which I need to find more information on. Possible explanatory pages are at

www.hlda8.org/CD1toCD247.htm
www.hlda8.org/HLDAtoHCDM.htm

Apparently CD stands for Cell Differentiation and this group is very interested in antibodies to various cell differentiation molecules.

Looking further down the Google list, a description of Cyclin D1 also appears in the Human Protein Reference Database. According to the main page of this site,

The Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data. HPRD has been created using an object oriented database in Zope, an open source web application server, that provides versatility in query functions and allows data to be displayed dynamically. www.hprd.org

Eventually, I want to integrate information from this search with my weblog entries:

Naming conventions for genes, proteins, etc. (September 8, 2005)
Finding more information about a gene (September 6, 2005)

and some of the material on data management of a microarray experiment:

Stats: Importing data from a microarray experiment

in particular, the section titled "Selecting a subset of genes from the prenatal liver study."