We are combining the power of network thinking with cross-disease analysis to generate a clearer genotype-to-phenotype map for autism.
Diseases and gene lists
We used the Medical Subject Headings (MeSH) from NLM to derive a list of diseases with possible genetic causes. This list is comprised of the MeSH headings from the following subtrees in the 2011 version of MeSH:
- Neoplasms [C04]
- Musculoskeletal Diseases [C05]
- Digestive System Diseases [C06]
- Stomatognathic Diseases [C07]
- Respiratory Tract Diseases [C08]
- Otorhinolaryngologic Diseases [C09]
- Nervous System Diseases [C10]
- Eye Diseases [C11]
- Male Urogenital Diseases [C12]
- Female Urogenital Diseases and Pregnancy Complications [C13]
- Cardiovascular Diseases [C14]
- Hemic and Lymphatic Diseases [C15]
- Congenital, Hereditary, and Neonatal Diseases and Abnormalities [C16]
- Skin and Connective Tissue Diseases [C17]
- Nutritional and Metabolic Diseases [C18]
- Endocrine System Diseases [C19]
- Immune System Diseases [C20]
- Mental Disorders [F03]
For each disease we generated lists of genes by combining the genes returned from 11 external data repositories, including HuGE Navigator, OMIM  and GeneCards . For each disease we combined the results from these databases for each MeSH entry term associated with the disease.
We took this approach to gene-disease mapping and developed it into a separate tool called Genotator. Using this tool we keep Autworks up to date with the most recent genes associated with each disease.
Disease relationship network
For each set of disease-related genes (gene set) we calculated the overlapping genes for each gene set pair. We then calculated the hypergeometric cumulative probability distribution using the statistical tools provided in the GNU Scientific Library. A p-value for each pair was computed, with the number of overlapping genes between the two sets representing the positive set, and the remaining protein-coding genes in the human genome based on the set provided by the HUGO Gene Nomenclature Committee representing the negative set. This method enables us to build networks like the one below.
We have combined the disease associated gene sets with the Search Tool for Retrieval of Interacting Genes/Proteins (STRING) v 8.2 to build an interactome for each disease consisting of edges from six separate lines of evidence: Neighborhoods, Co-occurrence, Co-expression, Experiments, Databases, and Text-mining. Briefly these lines of evidence consist of the following:
- Neighborhoods: synteny derived from SwissProt and Ensembl
- Cooccurrence: phylogenetic profiles derived from COG database
- Coexpression: coregulation of genes measured using microarrays imported from ArrayProspector
- Experiments: protein-protein interaction inferred or confirmed by experiments
- Databases: validated small scale interactions, protein complexes, and annotated pathways from BIND , KEGG  and MIPS 
- Textmining: comention of gene names from PubMed abstracts
The networks were assembled to include links only among the original set of genes, i.e., no additional nodes were added to increase the connectedness of the networks. An edge between two genes was defined when experimental or database confidence score was over 'medium' (400, as defined by STRING), or the overall confidence score was higher than 900.
STRING's edge confidence score is calculated by using KEGG as a benchmark. Any predicted association for which both proteins are assigned to the same 'KEGG pathway' is counted as a true positive .
User Created Sets and Hypothesis Testing
Using the same procedures used above, it is possible to upload and compare user-generated gene sets in Autworks. P-values are generated for user-uploaded sets using the hypergeometric test described above. However, it is possible for the user to specify a custom background for this calculation, rather than use the set of all human genes.
- OMIM - Online Mendelian Inheritance in Man
- GeneCards - The Human Genome Compendium
- Genotator -- a Disease-Gene Meta Database
- von Mering, C., et al., "STRING: known and predicted protein-protein associations, integrated and liansferred across organisms", Nucleic Acids Res, 2005, 33(Database issue):D433-7.
- Bader, G.D., et al., "BIND--The Biomolecular Interaction Network Database", Nucleic Acids Res, 2001, 29(1):242-5.
- Kanehisa, M., et al., "The KEGG resource for deciphering the genome.", Nucleic Acids Res, 2004. 32(Database issue):D277-80.
- MIPS - Munich Information Center for Protein Sequences