logo
logo

Motivation

Column format files are text based files for biological sequences. The idea being, that they should be easy to work with, rather than compact. Another important feature is that many different kinds of information can be contained in the same file. Also converting this format into anything else should be simple.


General description

The data is organized in columns. Each entry (sequence) is along with its various assignments arranged in columns. Entry consist of a header that contains information about the column organization as well as miscellaneous information about the sequences. The entire file (database) contain a main header that can be used to describe overall features. In column format files, everything but the sequence positions is on lines that begin with i semicolon.

The entry information has some fields that are compulsory. The first line of the entry info must show what type of molecule is in the entry (the ``TYPE'' field). The next lines show what information is in the different columns of the entry. After this, the entry name comes. This is followed by additional information:

A column format file starts with a header containing info on the file. This header is followed by line with a semicolon and at least 10 equality signs (=). A header could look like this:

; This file contains a database of globin genes
;
; It is located at http://www.xyz.xyz
;
; ========================================================================

The sequence entry headers begin with a "TYPE" field indicating whether the entry is an RNA sequence or a specific entry describing the basepairings of the alignment. That is the first line in an entry should be of the form

; TYPE                 <type>
where type here is defined as either "RNA, DNA, DNA_blast, and PROTEIN". There is no distinction between upper and lower case letters. Lines on the form
; COL <number>         <word>
indicate that column <word> is described in column <number>. Each Entry have an "ENTRY" field on the form
; ENTRY                <one_word>
Other lines in the header describe miscellaneous features and have the form
; <ONE_TAG>            <string>
Header and columns are separated by a line of the type
; ---------
with at least 10 dashes. The column lines are organized on form
<word(COL 1)>  <word(COL 2)>  . . .  <word(COL N)>
for N columns. Entries are ended by a line of the type
; **********
with at least 10 *'s.

A description of the column types can be found on the colusage page along with a listing of which programs from the rnadbtool page. Examples on how to use those programs can found here.

Database example

An example from the tmRNA database alignment of RNA sequences.
;  The tmRNA Database version 043 (January 2001):
;  ----------------------------------------------
;
;
;  Availability:
;  -------------
;
; . . . .
; . . . . 
;
; ========================================================================
; TYPE              pairingmask
; COL 1             label
; COL 2             residue
; COL 3             alignpos
; ENTRY             pairingmask
; ----------
M     a     1
M     a     2
M     a     3
M     a     4
M     a     5
.     .     .
.     .     .

M     -   675
M     -   676
; **********
; TYPE              RNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; COL 4             alignpos
; COL 5             align_bp
; ENTRY             AQU.AEO.
; ORGANISM          Aquifex aeolicus
; ACCESSION         AE000657 + AE000749
; WWW-ACCESS        http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&
                    db=Nucleotide&list_uids=6626248&dopt=GenBank + http://www.ncbi.nlm.nih.gov:80/
                    entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=02983975&dopt=GenBank
; LINEAGE (NCBI)    CELLULAR ORGANISMS; BACTERIA; AQUIFICALES; AQUIFICACEAE; AQUIFEX
; ----------
N     G     1     1   672
N     G     2     2   671
N     G     3     3   670
N     G     4     4   669
N     G     5     5   668
N     C     6     6   667
N     G     7     7   666
N     g     8     8     .
N     a     9     9     .
G     -     .    10     .
N     a    10    11     .
N     a    11    12     .
N     g    12    13     .
N     g    13    14     .
.     .     .     .     .
.     .     .     .     .
The first entry describes the pairing mask (stem helix mapping) of the structural alignment (and is obtained using txt2col with "-m" option). In the second entry the first column indicates whether the entry (sequence) at a particular alignment positions contains a gap or a nucleotide. When TYPE is RNA or DNA, label takes the values "N" and "G". "residue" refers to the individual nucleotides

Examples of program output in col format

Shown here are examples of column file output from diffferent programs. Only the beginning of the files are shown.

Output from txt2col:

; Generated by txt2col
; ========================================================================
; TYPE              RNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; COL 4             alignpos
; COL 5             align_bp
; ENTRY             Aqu.aeo.
; ----------
N     G     1     1   665
N     G     2     2   664
N     G     3     3   663
N     G     4     4   662
N     G     5     5   661
N     C     6     6   660
N     G     7     7   659
N     g     8     8     .
N     a     9     9     .
G     -     .    10     .
N     a    10    11     .

Output from ct2col:

; Generated by ct2col
; ========================================================================
; TYPE              RNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; COL 4             align_bp
; ENTRY             mtu
; LENGTH            168
; ----------
N     C     1     .
N     U     2     .
N     U     3    17
N     C     4    16
N     G     5    15
N     C     6    14
N     A     7     .
N     U     8     .
N     C     9     .
N     A    10     .

Output from gb2col:

; File generated by gb2col
; ========================================================================
; TYPE              DNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; ENTRY             MTU88049
; LENGTH            2805
; ACCESSION         U88049
; ----------
N     t     1
N     t     2
N     g     3
N     g     4
N     g     5
N     c     6
N     c     7
N     g     8
N     c     9
N     c    10

Output from blast2col:

; Generated by blast2col
; ========================================================================
; TYPE              DNA_blast
; COL 1             label
; COL 2             query_residue
; COL 3             match
; COL 4             subject_residue
; COL 5             query_seqpos
; COL 6             subject_seqpos
; ENTRY             MTU88049_vs_U88049
; BLAST_VERSION     BLASTN 2.0.11 [Jan-20-2000]
; QUERY             MTU88049
; QUERY_LENGTH      200
; SUBJECT           U88049
; SUBJECT_COMMENT   MTU88049 2805 bp DNA BCT U88049 .
; SUBJECT_STRAND    Plus
; SUBJECT_LENGTH    2805
; ALIGNMENT_LENGTH  200
; SCORE             396
; EXPECT            1e-109
; IDENTITIES        200
; ----------
N T  -  T       1     1
N T  -  T       2     2
N G  -  G       3     3
N G  -  G       4     4
N G  -  G       5     5
N C  -  C       6     6
N C  -  C       7     7
N G  -  G       8     8
N C  -  C       9     9
N C  -  C      10    10


Comments, questions, etc., email gorodkin@genome.ku.dk.


Last updated March 26th, 2007 by Jan Gorodkin