Round 2.3 --- manipulating sequence data with python

Jan 16, 2013 • Karin Lagesen

A while ago a colleague of mine asked me for a bit of help. He knew I often work with sequencing data, and he hoped I could assist him with sorting out his sequencing sets. What he wanted help with was a phenomenon known as tag switching. Sequencing is quite expensive, so it is quite common to attach a sequence tag to both ends of sequences coming from a certain sample. These sequence tags can then be used in postprocessing to sort sequences from different samples into different pools. However, sometimes sequences get intertwined, and they can end up exchanging ends. This can be detected by finding sequences where the sequence tag in both ends are not the same. They wanted to quantify and filter out these sequences, since they did not want to use these further. Additionally, to ensure that all of the tags had been sequenced, they also add two Gs in front of one of the sequence tags. Sometimes these Gs would end up in their sequences, sometimes not. They needed the sequences without them.

Since my go-to language for any coding is python, I assembled a script that would enable me to

  1. read in the tag sequences,
  2. read in the fasta and qual sequences,
  3. examine each sequence to find the marker sequence,
  4. cut both fasta and qual files at that point, and
  5. output that to file.

Shown below is where I iterate through all of the sequences (i.e. part 2). The doRecord function examines each of the records, looking for the tag (tags are called mids here, hence the middict) within the region of the sequence where it should be (i.e. part 3).

    output_handleQ = open(otag + ".qual", "w")
    for record in PairedFastaQualIterator(open(fastafile), open(qualfile)):
        reportFMid, reportRMid, editedRecord = doRecord(record, middict, search_for, maxsize, varsize)
        if editedRecord == None:
            continue
        SeqIO.write(editedRecord, output_handleF, "fasta")
        SeqIO.write(editedRecord, output_handleQ, "qual")
        noseqs += 1

    output_handleQ.close()
    output_handleF.close()

One method of going through fasta and qual files would have been to iterate over the fasta file and having the qual file as a dictionary, looking up the relevant qual entry as you go. However, I used the PairedFastaQualIterator fasta/qual parser instead, which ensures that the variable (record) that I subsequently end up working on is an object wich contains both the fasta sequence and the quality scores belonging to that sequence.

The next thing I had to handle was to how to best find the tags in the sequences. One way of doing it would have be loop over the tags, looking for each of them in each sequence. However, this would have required another for loop within the already existing for loop (always a big no-no), would also require that I search with each of them twice, once in the normal direction and once for the reverse complement, and last but not least, I would most likely have ended up using the in operator (as in “tag in sequence”) in this case would make things even slower.

The solution was in this case to be found in using regular expressions. Regular expressions are often used for flexible pattern matching and for finding multiple patterns. In this case, we had multiple tags which could be interpreted as a pattern, so I figured out that it should be possible to find a way to search for all of them at the same time. I found that I could make a pattern that I could search for, like this:
mids = middict.keys()
search_for = re.compile(“|”.join(mids))

A further explanation of the can be found in the python documentation:
'|' A|B, where A and B can be arbitrary REs, creates a regular expression that will
    match either A or B. An arbitrary number of REs can be separated by the '|' in 
    this way. This can be used inside groups (see below) as well. As the target string 
    is scanned, REs separated by '|' are tried from left to right. When one pattern 
    completely matches, that branch is accepted. This means that once A matches, B will
    not be tested further, even if it would produce a longer overall match. In other 
    words, the '|' operator is never greedy. 

OK, now for how I used it. I chose to create a method, called doRecord, which would handle all of the sequence processing itself. I did it this way to avoid having parts of code that are 1. too long and 2. too nested. Neither are in my opinion directly wrong, but it does give unwieldy code that is difficult to read — that is, easy to get bugs in.

So, per record, I searched first for the pattern in the string as is, then I reverse complemented it and searched again. Note the maxsize variable. Since we allowed a variable number of letters in front of the tag (in this case Gs) I set maxsize to be the sum of the longest tag plus the number of possible letters in front of the tag. Due to this I could narrow my search space to only that portion of the sequences. I then, later on, check that the tag sequence begins at a position which is less or equal than the allowed number of letters before the tags, and also that the tags match. If that is the case, I also cut the sequence where the match begins (hereby doing part 4 of my list above).

# These two are for reporting purposes, and will be returned with the record
    reportFMid = "NoF"
    reportRMid = "NoR"
   
    # This is where the matching happens. Note: I only search through the part
    # of the string that the mid should be in.
    match_objForward = search_for.search(str(record.seq[:maxsize]))
    match_objReverse = search_for.search(str(record.reverse_complement().seq[:maxsize]))

    # Here, we have found a forward mid, and we get what it is (using group), how long 
    # it is (forwardMidSize), and where it is (using start). I also get what the 
    # "real name" of the mid is, by looking it up in the dictionary
    whichmidForward = match_objForward.group()
    forwardMidSize = len(whichmidForward)
    indexForward = match_objForward.start()        
    reportFMid = middict[whichmidForward] 

    # Here, we have a reverse mid, and we use those
    whichmidReverse = match_objReverse.group()
    reverseMidSize = len(whichmidReverse)
    indexReverse = match_objReverse.start()
    reportRMid = middict[whichmidReverse]

  # Here we test first, if the mids are the same, second, that they
  # are where we expect them to be.  
  if indexForward <= varsize and indexReverse  varsize:
            reportFMid = "LongF"
        if indesReverse > varsize:
            reportRMid = "LongR"
            
    return reportFMid, reportRMid, editedRecord

The program (containing some more code than this for dealing with input, output, etc) was tested, and since it worked nicely, sent to those who needed it.

*Also posted to codingbiology.wordpress.com</li>