Bioinformatics – Neill White

Most bioinformatics work that I have done is intellectual property belonging to my employers. This includes the work I did at Incyte, which involved a lot of primary sequence analysis and annotation of human chromosomes using Perl and BioPerl. At Ibis, the core task was more computational in nature and I primarily used Matlab for its signal processing capabilities. For those tasks that required more string processing and manipulation, I used Python and BioPython. In all cases, I determined if the needed analysis tools needed to be written or if they already existed.

Rosalind.info is a platform for solving bioinformatics problems. I have solved over 70 problems on this site and am currently ranked in the top 100 in the United States. This is a small selection of different problems and my solutions. My profile: http://rosalind.info/users/nwhite/

Base Quality Distribution
Base Filtration by Quality
Matching Random Motifs
Speeding Up Motif Finding
Inferring Protein from Spectrum
Comparing Spectra with the Spectral Convolution
Motzkin Numbers and RNA Secondary Structures

Base Quality Distribution

You need a Frames Capable browser to view this content.

Solution

#!/usr/bin/python 

from Bio import SeqIO
import numpy as np
import sys

input_file = sys.argv[1]

f = open( input_file )
# first line is quality threshold
header = f.readline().rstrip()
q_thresh = int(header)

fq = SeqIO.parse(f, 'fastq')
m = np.array([i.letter_annotations['phred_quality'] for i in fq])

print(sum(m.mean(0) < q_thresh)

Base Filtration By Quality

You need a Frames Capable browser to view this content.

Solution

#!/usr/bin/python

import sys
import numpy as np
from Bio import SeqIO

input_file = sys.argv[1]

f = open( input_file )
# first line is quality threshold
header = f.readline().rstrip()
q_thresh = int(header)

trimmedRecords = []

for record in SeqIO.parse( f, "fastq" ):
    seq_qual = np.array( record.letter_annotations["phred_quality"] )
    # find first letter with quality over threshold
    overThresh = seq_qual >= q_thresh
    idx = np.array( range( len( overThresh ) ) )
    indicesOverThresh = idx[ overThresh ]
    sliceStart = min( indicesOverThresh )
    sliceEnd = max( indicesOverThresh ) + 1
    print( "{}-{}".format( sliceStart, sliceEnd ) )
    trimmedRecord = record[sliceStart:sliceEnd]
    trimmedRecords.append( trimmedRecord )

SeqIO.write( trimmedRecords, "rosalind_submission.fastq", "fastq" );

Base Filtration By Quality

You need a Frames Capable browser to view this content.

Solution

#!/usr/bin/python

import sys
import numpy as np
from Bio import SeqIO

input_file = sys.argv[1]

f = open( input_file )
# first line is quality threshold
header = f.readline().rstrip()
q_thresh = int(header)

trimmedRecords = []

for record in SeqIO.parse( f, "fastq" ):
    seq_qual = np.array( record.letter_annotations["phred_quality"] )
    # find first letter with quality over threshold
    overThresh = seq_qual >= q_thresh
    idx = np.array( range( len( overThresh ) ) )
    indicesOverThresh = idx[ overThresh ]
    sliceStart = min( indicesOverThresh )
    sliceEnd = max( indicesOverThresh ) + 1
    print( "{}-{}".format( sliceStart, sliceEnd ) )
    trimmedRecord = record[sliceStart:sliceEnd]
    trimmedRecords.append( trimmedRecord )

SeqIO.write( trimmedRecords, "rosalind_submission.fastq", "fastq" );

Matching Random Motifs

You need a Frames Capable browser to view this content.

Solution

#!/usr/bin/python

import sys

input_file = sys.argv[1]

with open( input_file ) as f:
    tokens = f.readline().strip().split()
    num = int(tokens[0])
    gc_content = float(tokens[1])
    seqString = f.readline().strip()

prob = 1
for letter in seqString:
    if letter == "G" or letter == "C":
        prob *= gc_content/2
    else:
        prob *= (1-gc_content)/2

print( round( 1 - (1-prob)**num, 3 ) )

Inferring Protein from Spectrum

You need a Frames Capable browser to view this content.

Solution

#!/usr/bin/python
import sys
import numpy as np
from operator import itemgetter

input_file = sys.argv[1]

mono_mass_lookup = {
    'A': 71.03711,
    'C': 103.00919,
    'D': 115.02694,
    'E': 129.04259,
    'F': 147.06841,
    'G': 57.02146,
    'H': 137.05891,
    'I': 113.08406,
    'K': 128.09496,
    'L': 113.08406,
    'M': 131.04049,
    'N': 114.04293,
    'P': 97.05276,
    'Q': 128.05858,
    'R': 156.10111,
    'S': 87.03203,
    'T': 101.04768,
    'V': 99.06841,
    'W': 186.07931,
    'Y': 163.06333
}

with open( input_file ) as f:
    masses = np.array([float(x) for x in f.read().strip().split('\n')])
mass_deltas = [masses[i+1] - masses[i] for i in range(len(masses)-1)]
weight_to_res = {weight: res for res, weight in mono_mass_lookup.viewitems()}
protein = []
for mass_delta in mass_deltas:
    res_mass_diffs = [(abs(mass_delta - weight), res) for weight, res in weight_to_res.iteritems()]
    protein.append(min(res_mass_diffs, key=itemgetter(0))[1])
print("".join(protein))

Comparing Spectra with the Spectral Convolution

You need a Frames Capable browser to view this content.

Solution

#!/usr/bin/python 

import sys
import numpy as np

input_file = sys.argv[1]

with open( input_file ) as f:
    s1 = np.array( [float(n) for n in f.readline().split()] )
    s2 = np.array( [float(n) for n in f.readline().split()] )

spectral_conv = []
for m1 in s1:
    for m2 in s2:
        # this won't work without rounding!
        spectral_conv.append( round( m1-m2, 9 ) )

x = max(set(spectral_conv), key=spectral_conv.count)
count = spectral_conv.count(x)
print( "{}\n{}".format( count, x ) )

Motzkin Numbers and RNA Secondary Structures

You need a Frames Capable browser to view this content.

Solution

#!/usr/bin/python

import sys
import numpy as np
from Bio import SeqIO

input_file = sys.argv[1]

goodPairs = [ 'AU', 'CG', 'GC', 'UA' ]

seqDict = {}
def getMotzNum(seqString):
    seqLen = len( seqString );
    if seqLen <= 1:
        return 1;

    # see if we already have this num
    if seqString in seqDict:
        return seqDict[ seqString ]

    seqDict[ seqString ] = getMotzNum(seqString[1:])
    for i in range(1,len(seqString)):
        if seqString[0] + seqString[i] in goodPairs:
            seqDict[seqString] += getMotzNum(seqString[1:i]) * getMotzNum(seqString[i+1:])
    seqDict[seqString] %= 1e6
    return seqDict[seqString]


for record in SeqIO.parse( input_file, "fasta" ):
    print( int( getMotzNum( record.seq ) ) )