RNA db Tools

  Home

  Programs

  Examples

  Sarse

  Colformat

  Resources

About these examples

These have been written as an introduction to the programs, their output, as well as the column format. More detailed description can be found on the man pages.

The basics

Generating a col file

To generate a col file, we can start with a number of different sequence files. One possibility is to use fasta format ad the col2fasta program. Another possibility is to use the following txt format:

   pairing_mask  --3---311111111----222222-----111111111--2222222-----
   orga          -gAcugUcugUCGAUgauu-AAcGuac--cAUCGAccgug-aCaUUuagua--
   orgb          agAccgUcg-UCGAUGauu-AAcGuaca-CAUCGA-cgug-aCaUUuagua--
   orgc          cgAcacUcuguCCAUgauu-AAcGuac--aAUCGaacggu-aCaUUuaguacc
   orgd          -gAcchUgcguCGAUgauu-AAcGuac-uaAUCGaacgug-aCaUUuagua--

In this format the structure is given by a pairing mask as the first sequence. The bases which form pairs are given as capital letters in the RNA sequences. Pairs are formed between bases that has the same symbols in the pairing mask. From this format, a col file can be made by writing:

txt2col example.txt > example.col

Now example.col contains the col file. The start of the resulting col file looks like this:

   ; Generated by txt2col
   ; ========================================================================
   ; TYPE              pairingmask
   ; COL 1             label
   ; COL 2             residue
   ; COL 3             seqpos
   ; ENTRY             pairing_mask
   ; ----------
   M     -     1
   M     -     2
   M     3     3
   M     -     4
   M     -     5
   M     -     6
   M     3     7
   M     1     8
   M     1     9
   M     1    10
   M     1    11
   M     1    12

The file starts with a header that describes where the file came from. This one was generated by txt2col. This area of the file could also contain information written by the person who made the file. In a database, this could be a reference to the article describing the article, which version it is etc. The header is ended by the line of equality sign. Notice how all lines starting with semicolons are comments.

The first entry of the file has the type pairingmask which describes RNA structure. It was the -m option to txt2col that made the program include the pairing mask as an entry. It is not this entry that specifies the structure of the following RNA sequences, this information is kept in each entry as shown below. This means that the pairing mask is not necessary, but it is only kept as a reference. The first column in this entry is a label that describes what is in each position. Here it is all M's for pairingmask. The seconde column called residue contains the symbols of the sequences. The third column contains the position numbers in the sequence.

The next entry in the col file is the first real sequence, that starts like this:

   ; TYPE              RNA
   ; COL 1             label
   ; COL 2             residue
   ; COL 3             seqpos
   ; COL 4             alignpos
   ; COL 5             align_bp
   ; ENTRY             orga
   ; ----------
   G     -     .     1     .
   N     g     1     2     .
   N     A     2     3     7
   N     c     3     4     .
   N     u     4     5     .
   N     g     5     6     .
   N     U     6     7     3
   N     c     7     8     .
   N     u     8     9     .
   N     g     9    10     .
   N     U    10    11    35
   N     C    11    12    34

This is of type RNA. The first column is again a label, here all the nucleotides have N in this column, while gaps have G's. Column two contains the sequence symbols and column three contains the sequence positions. The fourth column has positions relative to the alignment. The fifth column is called align_bp, for align basepair. This has the secondary structure of the RNA, specified as pairs relative to the alignpos column. A dot in the column means that the nucleotide is unpaired.

The entire col file can be found here.

Making postscript alignments

To make a nice looking figure of an alignment, the program col2psalign is useful:

col2psalign --figure example.col > example_1_1.ps

This makes a postscript file that looks like this:

example_1_1.ps

The figure can also be made to look like this:

example_1_2.ps

This is done by the command:

col2psalign --figure --space --textwidth=32 example.col > example_1_2.ps

The --space option inserts spaces for every ten positions in the alignment. The --textwidth option is used to specify how wide the alignment should be. The value of textwidth is the number of characters after the sequence name, including the spaces input by the --space option.

Another figure:

example_1_3.ps

This was made with the command:

col2psalign --figure --range=10-20,30-40 example.col > example_1_3.ps

This is useful to illustrate interesting parts of the alignment.

Checking RNA structures

Strange nucleotides

A col file can be checked for non-standard nucleotides. The program unknown can do this:

unknown example.col | col2psalign --figure > example_2_1.ps

Here, the col file is altered by stdpair and then sent to col2psalign to make a figure that looks like this:

example_2_1.ps

This shows that organism orgd contain an h is it sequence. This could be an error. Let us change the h to a gap and call the file example2.txt. From this a col file is generated:

txt2col -m example2.txt > example2.col

Strange base pairs

A col file with RNA sequences can be checked for non-standard base pairs. This is done with the program stdpair:

stdpair --color example2.col | col2psalign --figure > example_2_2.ps

It looks like this:

example_2_2.ps

This has highlighted a C-C pair in the orgc sequence. It was the --color option that made the program color the nucleotides. Without this option, the strange pair would have been removed (notice that the change happens only in the align_bp column, it does not change the nucleotides to lowercase letters). The sequence orgb has highlighted nucleotides as well. This basepair looks fine, but no pairing mask tells txtcol which columns should pair in these positions. txt2col sets the bases to pair with themselves, to show that there is no pairing mask in these positions.

To see this in the col file, the programs grepcol and greppos are useful:

grepcol --range=orgb example2.col | greppos --range=14-32 > example_2_1.col

This gives the following result:

   ; Generated by txt2col
   ;
   ; 'grepcol --range=orgb' was run on this file
   ;
   ; 'greppos --range=14-32' was run on this file
   ; ========================================================================
   ; TYPE              RNA
   ; COL 1             label
   ; COL 2             residue
   ; COL 3             seqpos
   ; COL 4             alignpos
   ; COL 5             align_bp
   ; ENTRY             orgb
   ; ----------
   N     A    13    14    32
   N     U    14    15    31
   N     G    15    16    16
   N     a    16    17     .
   N     u    17    18     .
   N     u    18    19     .
   G     -     .    20     .
   N     A    19    21    46
   N     A    20    22    45
   N     c    21    23     .
   N     G    22    24    43
   N     u    23    25     .
   N     a    24    26     .
   N     c    25    27     .
   N     a    26    28     .
   G     -     .    29     .
   N     C    27    30    30
   N     A    28    31    15
   N     U    29    32    14
   ; **********

Notice that alignpos 16 and 30 has themselves as pairs.

This could be errors in the txt file and could be correct to give a new file, example3.txt:

   pairing_mask  --3---3111111111---222222----1111111111--2222222-----
   orga          -gAcugUcugUCGAUgauu-AAcGuac--cAUCGAccgug-aCaUUuagua--
   orgb          agAccgUcg-UCGAUGauu-AAcGuaca-CAUCGA-cgug-aCaUUuagua--
   orgc          cgAcacUcuguCcAUgauu-AAcGuac--aAUcGaacggu-aCaUUuaguacc
   orgd          -gAcc-UgcguCGAUgauu-AAcGuac-uaAUCGaacgug-aCaUUuagua--

From this, a col file is generated:

txt2col -m example3.txt > example3.col

Stems that can be extended

To illustrate which stems that could be extended, the extendstem program is useful:

extendstem --gupair --color example3.col | col2psalign --figure > example_2_3.ps

This command colors the positions that could be paired to extend existing stems. The --gupair option makes the program treat G-U pairs like the standard A-U and G-C pairs.

example_2_3.ps

This could lead to the following text file, example4.txt:

   pairing_mask  -33---3311111111---22222222--1111111111222222222-----
   orga          -GAcugUCuGUCGAUGauu-AAcGUac--CAUCGACcgug-ACaUUuagua--
   orgb          agAccgUc-GUCGAUGauu-AAcGUaca-CAUCGAC-gug-ACaUUuagua--
   orgc          cGAcacUCugUCcAUgauu-AAcGUAC--aAUcGAacg-GUACaUUuaguacc
   orgd          -gAcc-UgcgUCGAUgauu-AAcGUac-uaAUCGAacgug-ACaUUuagua--

From this, yet another col file is generated:

txt2col -m example4.txt > example4.col

To give an impression of the result of the changes made to example.txt, the following commands can be run:

txt2col -m example.txt | unknown | stdpair --color | extendstem --color | col2psalign --figure > example_2_4.ps

txt2col -m example4.txt | unknown | stdpair --color | extendstem --color | col2psalign --figure > example_2_5.ps

This shows that many commands can be combined. The results are shown here:

Before:

example_2_4.ps

After:

example_2_5.ps

The nucleotide that is colored cyan can be part of two stem extensions. This is the reason that its structure was not changed.

Other programs

Making fasta files

Making fasta files from col files:

col2fasta example3.col > example.fasta

This gives:

   >pairing_mask
   --3---3111111111---222222----1111111111--2222222-----
   >orga
   -GACUGUCUGUCGAUGAUU-AACGUAC--CAUCGACCGUG-ACAUUUAGUA--
   >orgb
   AGACCGUCG-UCGAUGAUU-AACGUACA-CAUCGA-CGUG-ACAUUUAGUA--
   >orgc
   CGACACUCUGUCCAUGAUU-AACGUAC--AAUCGAACGGU-ACAUUUAGUACC
   >orgd
   -GACC-UGCGUCGAUGAUU-AACGUAC-UAAUCGAACGUG-ACAUUUAGUA--

If the gaps are not wanted, use:

nogap example3.col | col2fasta > example.nogap.fasta

To give:

   >pairing_mask
   --3---3111111111---222222----1111111111--2222222-----
   >orga
   GACUGUCUGUCGAUGAUUAACGUACCAUCGACCGUGACAUUUAGUA
   >orgb
   AGACCGUCGUCGAUGAUUAACGUACACAUCGACGUGACAUUUAGUA
   >orgc
   CGACACUCUGUCCAUGAUUAACGUACAAUCGAACGGUACAUUUAGUACC
   >orgd
   GACCUGCGUCGAUGAUUAACGUACUAAUCGAACGUGACAUUUAGUA

The pairing mask is probably not wanted in this case, and txt2col should be used without the -m option:

txt2col example3.txt | nogap | col2fasta > example2.nogap.fasta

To give:

   >orga
   GACUGUCUGUCGAUGAUUAACGUACCAUCGACCGUGACAUUUAGUA
   >orgb
   AGACCGUCGUCGAUGAUUAACGUACACAUCGACGUGACAUUUAGUA
   >orgc
   CGACACUCUGUCCAUGAUUAACGUACAAUCGAACGGUACAUUUAGUACC
   >orgd
   GACCUGCGUCGAUGAUUAACGUACUAAUCGAACGUGACAUUUAGUA

Showing structure

The secondary structure of RNA can be shown using the addparen program. This adds a sequence of parentheses after each RNA sequence:

addparen example.col | col2psalign --figure > example_3_1.ps

This makes the alignment look like this:

example_3_1.ps

The matching parentheses shows which nucleotides form pairs. Notice that two types of parentheses are used, because these structures have pseudoknots. The positions that pairs with themselves (see above) are indicated with an `x'.

If a program like stdpair is run without the --color option, the structure is changed instead of colored. When pairs are removed, the letters are not changed to lower case, since the case of the letter were only used by the txt2col to find the structure. From that point on, the structure was given in the align_bp column of the col file:

stdpair example.col | addparen | col2psalign --figure > example_3_2.ps

This makes the alignment look like this:

example_3_2.ps

Text alignments

Sometimes, it can be useful to make text alignments that are easy to look at. This can be done with the col2txtalign program:

col2txtalign example4.col

The output from this looks like this:

              1                                                  53
   pairing_ma -33---3311111111---22222222--1111111111222222222-----
   orga       -GAcugUCuGUCGAUGauu-AAcGUac--CAUCGACcgug-ACaUUuagua--
   orgb       agAccgUc-GUCGAUGauu-AAcGUaca-CAUCGAC-gug-ACaUUuagua--
   orgc       cGAcacUCugUCcAUgauu-AAcGUAC--aAUcGAacg-GUACaUUuaguacc
   orgd       -gAcc-UgcgUCGAUgauu-AAcGUac-uaAUCGAacgug-ACaUUuagua--

The relevant commands from col2psalign can also be used for col2txtalign:

col2txtalign --space --textwidth=33 --namewidth=15 example4.col

This gives the following output:

                   1                             30
   pairing_mask    -33---3311 111111---2 2222222--1
   orga            -GAcugUCuG UCGAUGauu- AAcGUac--C
   orgb            agAccgUc-G UCGAUGauu- AAcGUaca-C
   orgc            cGAcacUCug UCcAUgauu- AAcGUAC--a
   orgd            -gAcc-Ugcg UCGAUgauu- AAcGUac-ua
   
                   31                     53
   pairing_mask    1111111112 22222222-- ---
   orga            AUCGACcgug -ACaUUuagu a--
   orgb            AUCGAC-gug -ACaUUuagu a--
   orgc            AUcGAacg-G UACaUUuagu acc
   orgd            AUCGAacgug -ACaUUuagu a--

Comments, questions, etc., email gorodkin@rth.dk.

Last updated March 26th, 2007 by Jan Gorodkin