Chapter 5 Assignments
5.1 Chromosome alignment with Mummer
Using wget <url>, download the following files:
https://mummer4.github.io/tutorial/exampleFiles/2.1/in/H_pylori26695_Eslice.fasta.https://mummer4.github.io/tutorial/exampleFiles/2.1/in/H_pyloriJ99_Eslice.fasta.
We will use mummer to align them and view the dotplot.
$ pixi init
$ pixi project channel add bioconda
$ pixi add mummer4
$ pixi run mummer -help
Usage: /home/sibbe/Documents/career/colaborations/unza-workshops/public/worked-assignments/comparison/.pixi/envs/default/bin/mummer [options] <reference-file> <query-files>
Find and output (to stdout) the positions and length of all
sufficiently long maximal matches of a substring in
<query-file> and <reference-file>
Options:
-mum compute maximal matches that are unique in both sequences
-mumcand same as -mumreference
-mumreference compute maximal matches that are unique in the reference-
sequence but not necessarily in the query-sequence (default)
-maxmatch compute all maximal matches regardless of their uniqueness
-n match only the characters a, c, g, or t
they can be in upper or in lower case
-l set the minimum length of a match
if not set, the default value is 20
-b compute forward and reverse complement matches
-r only compute reverse complement matches
-s show the matching substrings
-c report the query-position of a reverse complement match
relative to the original query sequence
-F force 4 column output format regardless of the number of
reference sequence inputs
-L show the length of the query sequences on the header line
-h show possible options
-help show possible optionsYou should put them in the data folder like this (you can use wget <argument> -O data):
Now, use mummer to make the alignment
Then we can make the alignment:
Which regions are inverted?
5.2 Finding the longest protein in a fasta file.
Make a bash script that reports the identifier of the longest protein in a fasta file. You can use the tools from the emboss suite. Work in a different directory.
$ mkdir protein
$ pixi init
$ pixi project channel add bioconda
$ pixi add emboss
$ pixi run mummer -helpYou can use the following commands
catsizeseqnthseqpepstatsgrepcut
For example,
Then the emboss programs (such as sizeseq, nthseq, pepstats) can be used with the -filter argument to accept input from stdin:
You can write this script using test data from uniprot, for example:
https://www.uniprot.org/proteomes/UP000464024.