Chapter 5 Assignments

5.1 Chromosome alignment with Mummer

Using wget <url>, download the following files:

https://mummer4.github.io/tutorial/exampleFiles/2.1/in/H_pylori26695_Eslice.fasta.
https://mummer4.github.io/tutorial/exampleFiles/2.1/in/H_pyloriJ99_Eslice.fasta.

We will use mummer to align them and view the dotplot.

$ pixi init
$ pixi project channel add bioconda
$ pixi add mummer4
$ pixi run mummer -help
Usage: /home/sibbe/Documents/career/colaborations/unza-workshops/public/worked-assignments/comparison/.pixi/envs/default/bin/mummer [options] <reference-file> <query-files>

Find and output (to stdout) the positions and length of all
sufficiently long maximal matches of a substring in
<query-file> and <reference-file>

Options:
-mum           compute maximal matches that are unique in both sequences
-mumcand       same as -mumreference
-mumreference  compute maximal matches that are unique in the reference-
               sequence but not necessarily in the query-sequence (default)
-maxmatch      compute all maximal matches regardless of their uniqueness
-n             match only the characters a, c, g, or t
               they can be in upper or in lower case
-l             set the minimum length of a match
               if not set, the default value is 20
-b             compute forward and reverse complement matches
-r             only compute reverse complement matches
-s             show the matching substrings
-c             report the query-position of a reverse complement match
               relative to the original query sequence
-F             force 4 column output format regardless of the number of
               reference sequence inputs
-L             show the length of the query sequences on the header line
-h             show possible options
-help          show possible options

You should put them in the data folder like this (you can use wget <argument> -O data):

ls data
data/H_pylori26695_Eslice.fasta  data/H_pyloriJ99_Eslice.fasta

Now, use mummer to make the alignment

 mummer -mum -b -c <fasta1> <fasta2 > mummer.mums

Then we can make the alignment:

mummerplot -x "[0,275287]" -y "[0,265111]" -png <nums-file>

Which regions are inverted?

5.2 Finding the longest protein in a fasta file.

Make a bash script that reports the identifier of the longest protein in a fasta file. You can use the tools from the emboss suite. Work in a different directory.

$ mkdir protein
$ pixi init
$ pixi project channel add bioconda
$ pixi add emboss
$ pixi run mummer -help

You can use the following commands

cat
sizeseq
nthseq
pepstats
grep
cut

For example,

$ echo abcdef | cut -c 2-3
echo ab cd ef | cut -d ' ' -f 2

Then the emboss programs (such as sizeseq, nthseq, pepstats) can be used with the -filter argument to accept input from stdin:

cat fasta.fna | sizeseq -filter > fasta-sorted.fna

You can write this script using test data from uniprot, for example:

https://www.uniprot.org/proteomes/UP000464024.