ROUGE |
Description of the Gene/Protein Characteristic Table
|
Features of the cloned DNA sequence
This section describes features of the nucleotide sequences of
cDNA clones actually characterized. Although the actual clones
contained an attB2-dT adapter primer sequence and an attB1 adapter
sequence at their 3'- and 5'-extremities, respectively
(Ohara O, Nagase T, Mitsui G, Kohga H, Kikuno R, Hiraoka S, Takahashi Y, Kitajima S, Saga Y, Koseki H. "
Characterization of size-fractionated cDNA libraries generated by the in vitro recombination-assisted method. " DNA Res. 2002; 9: 47-57)
, the nucleotide sequences of these adapters are not shown here.
This section is
intended to provide clone users with detailed information of clones,
which is not available from the public databases.
(1) Physical map
- The physical maps were constructed on the basis of the
sequence data of the cDNA clones. The horizontal scale
represents the cDNA length in kb. The longest ORF predicted by
GeneMark and untranslated regions
are shown by solid and open boxes, respectively. The positions of
the first ATG codons are indicated by solid and open triangles
to indicate respectively those that lie within and outside
the confines of Kozak's rule. RepeatMasker, which is a program
that screens DNA sequences for interspersed repeats known to
exist in mammalian genomes, was applied to detect repeat sequences
in cDNA sequences (Smit, A. F. A. and Green, P., RepeatMasker at
http://ftp.genome.washington.edu/RM/RepeatMasker.html
). SINE sequences and other
repetitive sequences detected in this way are displayed
by dotted and hatched boxes, respectively.
(2) Restriction map
- Commercially available restriction enzymes
(REBASE;
Roberts, R. J., Macelis, D.
"REBASE - restriction enzymes and methylases"
Nucleic Acids Res. 1998; 26: 338-350).
) are sorted according
to the number of the restriction sites present in the cDNA insert.
(3) Prediction of the protein coding region (GeneMark analysis)
- The graphic outputs of the GeneMark-RC analysis are displayed. Vertical
lines given in the graphs indicate the positions of termination codons.
If you would like to know more about the GeneMark-RC analysis,
please read the paper by Hirosawa et al.
(Hirosawa, M., Isono, K., Hayes, W., Borodovsky, M.
"Gene identification and classification in the Synechocystis genomic
sequence by recursive gene mark analysis" DNA Seq. 1997;
8(1-2): 17-29).
The GeneMark analysis gives the following warnings:
(a) Warning for N-terminal truncation of the coding region;
(b) Warning for spurious interruption of the coding region.
(4) Prediction of the genomic structure of the cDNA
- The cDNA sequence was subjected to BLAST search
(Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z.,
Miller, W., and Lipman, D.J.
"
Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs." 1997; Nucleic Acids Res 25: 3389-3402)
against
the mouse genome draft sequences in EBI.
When a genomic fragment was found to be considerably similar to the cDNA
sequence (E-value = 0.0 and sequence identity is 90% or greater), the genomic
structure of the cDNA was assigned by
SIM4
(Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. "
A computer program for aligning a cDNA sequence with a genomic DNA sequence
"
1998; Genome Res. 8: 967-974)
on the genomic fragment.
GENSCAN
(Burge, C. and Karlin, S. 1997; "
Prediction of complete gene structures
in human genomic DNA." J. Mol. Biol. 268: 78-94)
was also applied to detect the plausible gene structure on the genomic
fragment. The result of comparison of the gene structures deduced from
the cDNA and that predicted by GENSCAN were displayed in graphics.
(5) Comparison of structure with the corresponding human KIAA cDNA
- Each mKIAA cDNA in the ROUGE database is a mouse homologue of human
KIAA cDNA. Three types of alignment were prepared:
i) DNA sequence based, ii) AA sequence based and iii) Physical map.
i) DNA sequence-based alignment was made by comparing the DNA sequences
between mKIAA and KIAA cDNAs by GAP program in the GCG package. The longest
coding region predicted by GeneMark was determined as a CDS for each of
mouse and human KIAA cDNAs in this alignment. However, when the CDS positions
were not identical between the mouse and human KIAA cDNAs on the
aligned DNA sequences, the mouse cDNA sequence was translated
based on the human CDS information.
Thus the mouse amino acid sequence here may not be
identical to the protein sequence used for the protein sequence
analysis below (Features of the predicted protein sequence).
ii) To construct AA sequence-based alignment, we translated every CDS
of mKIAA and KIAA cDNA sequences according to the GeneMark analysis at first.
All of the predicted CDS whose length were longer than 150 bp were translated.
Then the amino acid sequences were compared and aligned between mouse
and human. When the sequence identity exceeded 50%, we aligned the cDNA
sequences of the corresponding regions between mouse and human based on
the amino acid sequence alignment.
iii) In the comparable physical map, the corresponding CDS between mouse
and human KIAA DNA which was obtained by ii) was connected by thine lines.
The longest and other CDSs were colored in dark
and light blue, respectively.
The conservation of polyA signal obtained from i) was also indicated.
When the positions of polyA signal were conserved between mouse and human
KIAA cDNAs, they were colored in red, otherwise,
they were colored in green.
If another polyA signal was found upsteream from -35-bp region in either
human or mouse KIAA cDNA, and the position was conserved between the two
species, it was colored in orange as an
indication of possible alternative polyA signal.
Features of the predicted protein sequence
This section describes the features of the predicted protein sequence.
(1) FASTA homology searches against the nr database and ROUGE database
Top 5 entries given the expectation value smaller than 0.001 in nr
database and ROUGE database are shown.
"nr" stands for the non-redundant amino acid sequence database
that has been constructed by NCBI.
The numbers on the left and right
sides of a black line in the graphical overview indicate the lengths
(in amino acid residues) of the non-homologous N-terminal and
C-terminal portions flanking the homologous region (indicated by the
black line), respectively. The FASTA output and the multiple alignment
of these entries can be obtained by clicking.
(2) Analysis of Motifs, Domains, and Membrane-spanning regions
The predicted protein sequences were examined for motifs present
in the InterPro database.
Because weakly defined sequence motifs appear too many times in
the ROUGE database and are, thus, unlikely to be informative,
the following motifs were excluded from the
analysis: amidation site; N-glycosylation site; cAMP- and
cGMP-dependent protein kinase phosphorylation site; casein kinase II
phosphorylation site; N-myristoylation site; protein kinase C
phosphorylation site; and tyrosine kinase phosphorylation site.
Motifs/Domains in the InterPro database were searched for by InterProScan. (Zdobnov, EM, and Apweiler, R. InterProScan--an integration platform for the signature-recognition methods in InterPro" Bioinformatics 2001; 17:847-848).
Membrane-spanning region were predicted by
SOSUI
(Hirokawa, T., Boon-Chieng, S., Mitaku, S.
"SOSUI: classification and
secondary structure prediction system for membrane proteins"
Bioinformatics 1998; 14:378-379).