Menu Close

File Parsing and Data Extraction in Python

Because python is a great tool for parsing plain text files and manipulating strings, it is often my go to resource for extracting data regardless of the file’s format. In this post I will go over some simple examples of how I do this in the field of bioinformatics.

File Parsing

>Sequence1 this is sequence number 1
aaggcgcggaccgctccaaggccgctcgatttccccactcttcccactcagcgcgttcgt
>Sequence2 this is sequence number 2
cgtcatcacccaggtccatcgacaacaaccgcctggtcgcatcggccttcactgctaccg
>Sequence3 this is sequence number 3
cacctcccccgtcatggcttccagcacccgctggaatcgttcgacggctttccgcacccg

Using the above FastA file, sequences.fa, let’s perform the following procedures:

  1. Open the file.
  2. Read the file line by line.
  3. Print each line to the console.
with open('sequences.fa') as f:  # 1. open the file
    for line in f:               # 2. read the file line by line
        print(line)              # 3. print each line to the console
>Sequence1 this is sequence number 1

aaggcgcggaccgctccaaggccgctcgatttccccactcttcccactcagcgcgttcgt

>Sequence2 this is sequence number 2

cgtcatcacccaggtccatcgacaacaaccgcctggtcgcatcggccttcactgctaccg

>Sequence3 this is sequence number 3

cacctcccccgtcatggcttccagcacccgctggaatcgttcgacggctttccgcacccg

String Manipulation

Oops! it looks like we have some unintended consequences here. Although no information is missing, there is now whitespace inserted between each line. This is because a “line” is terminated by a \n (newline) character. The following representation of a text file:

line1
line2
line3

is actually encoded as: line1\nline2\nline3\n. As we parse the file and extract each line, we are also including the newline character. Then, python’s built in print() function is appending another newline character to the string by default. So the output with white spaces is the effect of two newline characters, ie: line1\n\nline2\n\nline3\n\n. This can be handled using two different approaches:

  1. Remove the \n character from line using the built-in string method rstrip() before we print it.
  2. Instruct print() to use an empty string ('') in place of \n.
# 1. Remove \n character from line
with open('sequences.fa') as f:
    for line in f:
        line = line.rstrip('\n')  # Remove \n from line
        print(line)

# 2.  Print each line with empty string in place of \n
with open('sequences.fa') as f:
    for line in f:
        print(line, end='')  # Set end char as empty string

Both method #1 and #2 produce the following result:

>Sequence1 this is sequence number 1
aaggcgcggaccgctccaaggccgctcgatttccccactcttcccactcagcgcgttcgt
>Sequence2 this is sequence number 2
cgtcatcacccaggtccatcgacaacaaccgcctggtcgcatcggccttcactgctaccg
>Sequence3 this is sequence number 3
cacctcccccgtcatggcttccagcacccgctggaatcgttcgacggctttccgcacccg

Counting

Great! But not very useful… A common question that comes up regarding FastA files is: “How many sequences does it contain?” To answer this question, we will are essentially asking “How many lines start with the > character?” So the approach is pretty simple:

  1. Create a variable called n to keep count.
  2. Read the file line by line.
  3. If the line starts with > increase n by 1.
  4. Print results.
n = 0
with open('sequences.fa') as f:
    for line in f:
        if line.startswith('>'):   # str.startswith()
            n += 1
print(f'There are {n} sequences')  # f'string'
There are 3 sequences

Perfect! In this example I introduced a few new tools: The built-in string method: startswith() as well as Literal String Interpolation (f-string). F-strings are a convenient and clean way to include variables like {n} in a string. I will not be giving an in depth explanation for every built-in function that I use, however, I will continue to provide links to the python documentation. It is important that you learn to RTFM or “read the manual” when you encounter issues or want to discover new and exciting implementations!

Data Extraction

Let’s say we are asked to create a list of the sequence id’s for the three sequences in our file. Note that a FastA header is formatted as:

>SequenceID description is everything else

So, we will need to:

  1. Split the header into a list of strings using default whitespace separator.
  2. Get the first str in the list (index 0).
  3. Keep ONLY characters past > (from index 1 to end)
  4. Add the string to a list of headers
  5. Print the list of headers
id_list = []                      # create empty list
with open('sequences.fa') as f:
    for line in f:
        if line.startswith('>'):  # We've got a header!
            id = line.split()[0]  # Split the header and get string 0
            id = id[1:]           # Keep chars from index 1 to end
            id_list.append(id)    # Add id to id_list
print(id_list)
['Sequence1', 'Sequence2', 'Sequence3']

Leave a Reply

Your email address will not be published. Required fields are marked *