IDR Part 1: Parsing a GFF File

Parsing a GFF file in Python using classes.

IDR_Data_Club

Part 1: Parsing the genome file

First, download the Drosophila genome file. This is the file we'll be working on.

Make sure you login to your berkeley email and download the GFF file here: https://drive.google.com/open?id=1WZ8Zxhjbxj8kYuzganRJmb6KPKgqT73W

About GFF files

This is an annotated genome file from FlyBase that contains gene names and other features. The begining of file is a bunch of "junky" scaffolds, then there are columns that describe various areas of the of the genome. The last column is where a bunch of "other" classifications for the GFF file. Also in the last column are the nested hierarchical information that relates to other rows in the dataframe. This last column is what makes parsing the GFF file hard.

Goal: Get the protein names so we can get the amino acid sequences. Create a data structure where for each gene, list all the proteins products for each gene.

Why use Classes?

One of the advantages of using classes is that you can reuse a class with many different GFFS files.

  • [ ] Read into The logic of object oriented programming. In object oriented programming everything is done around classes.

The Code

In [2]:
# Use class to define the gff object
# It seperates each column into a particular attribute of the gff class

class gff:
    
    def __init__(self):
        
        self.seq = ""
        self.source = ""
        self.feature = ""
        self.start = ""
        self.end = ""
        self.score = ""
        self.strand = ""
        self.frame = ""
        self.attribute = {}

## This is a function that reads in each row of the gff, and pases them into the proper 
## Attribute class

    @classmethod
    def from_fields(cls,fields):
        gff = cls()
        if len(fields) == 9:
            gff.seq = fields[0]
            gff.source = fields[1]
            gff.feature = fields[2]
            gff.start = int(fields[3])
            gff.end = int(fields[4])
            if fields[5] != ".":
                gff.score = float(fields[5])
            gff.strand = fields[6]
            gff.frame = fields[7]
            
            atts = fields[8].split(';')
            for att in atts:
                (k,v) = att.split('=')
                gff.attribute[k] = v.split(',')
        return gff
In [3]:
gff_file = "dmel-all-no-analysis-r6.25.gff"

gffs = {}

## Only store the ones in these category.

features = [
              'CDS',
              'enhancer',
              'exon',
              'five_prime_UTR',
              'gene',
              'golden_path_region',
              'insulator',
              'intergenic',
              'intron',
              'mRNA',
              'mature_peptide',
              'miRNA',
              'ncRNA',
              'none',
              'polyA_site',
              'pre_miRNA',
              'protein',
              'pseudogene',
              'rRNA',
              'regulatory_region',
              'repeat_region',
              'silencer',
              'snRNA',
              'snoRNA',
              'tRNA',
              'tandem_repeat',
              'three_prime_UTR',
              'transposable_element',
              ]

for feature in features:
    gffs[feature] = []

gffo = open(gff_file,"r")


for line in gffo:
    if line[0:2] != "##":
        line = line.strip()
        line = line.split("\t")
        if len(line) == 9: 
            if line[2] in features: ## makes sure the row is of a feature we are interested in   
                gffs[line[2]].append(gff.from_fields(line))
                         

The values below is a dictionary with and then the location of the objects. Is there a better way to view this data structure?

In [4]:
gffs
Out[4]:
{'CDS': [<__main__.gff at 0x108ba8c10>,
  <__main__.gff at 0x108bac110>,
  <__main__.gff at 0x108c58e50>,
  <__main__.gff at 0x108c5a190>,
  <__main__.gff at 0x108c5a750>,
  <__main__.gff at 0x108ded9d0>,
  <__main__.gff at 0x108def090>,
  <__main__.gff at 0x108def510>,
  <__main__.gff at 0x108defb10>,
  <__main__.gff at 0x108e6b690>,
  <__main__.gff at 0x108e6b850>,
  <__main__.gff at 0x108e6bf90>,
  <__main__.gff at 0x108e6d150>,
  <__main__.gff at 0x108e6d5d0>,
  <__main__.gff at 0x108e6dd50>,
  <__main__.gff at 0x108e7d0d0>,
  <__main__.gff at 0x108e7d250>,

 
  ...]}

Now it is subsets only the proteins and pulls out the parsed attribute field and stores in a better data structure that can be useful later.

In [5]:
class protein:
    
    def __init__(self):
        
        self.id = ""
        self.name = ""
        self.parent = ""
        self.iep = ""
        self.mw = ""
        self.xref = {}
        self.seq = ""
        self.iupred = 0.0
        self.gff = ""

    @classmethod
    def from_atts(cls,atts):
        protein = cls()
        txt_fields = [('ID')]
        if 'ID' in atts:
            protein.id = atts['ID'][0]
        if 'Name' in atts:
            protein.name = atts['Name'][0]
        if 'Derives_from' in atts:
            protein.parent = atts['Derives_from'][0]
        if 'derived_isoelectric_point' in atts:
            protein.iep = float(atts['derived_isoelectric_point'][0])
        if 'derived_molecular_weight' in atts:
            protein.mw = float(atts['derived_molecular_weight'][0])
        if 'Dbxref' in atts:
            for xref in atts['Dbxref']:
                xrefs = xref.split(':')
                protein.xref[xrefs[0]] = xrefs[1]
    
        return protein
    
class mRNA:
    
    def __init__(self):
        
        self.id = ""
        self.name = ""
        self.parent = ""
        self.children = []
        self.score_text = ""
        self.xref = {}
        self.gff = ""
        self.CDS = []
        
    @classmethod
    def from_atts(cls,atts):
        mRNA = cls()
        if 'ID' in atts:
            mRNA.id = atts['ID'][0]
        if 'Name' in atts:
            mRNA.name = atts['Name'][0]
        if 'Parent' in atts:
            mRNA.parent = atts['Parent'][0]
        if 'score_text' in atts:
            mRNA.score_text = atts['score_text'][0]
        if 'Dbxref' in atts:
            for xref in atts['Dbxref']:
                xrefs = xref.split(':')
                mRNA.xref[xref[0]] = xref[1]
    
        return mRNA

class gene:
    
    def __init__(self):
        self.id = ""
        self.name = ""
        self.fullname = ""
        self.aliases = []
        self.ontologies = []
        self.children = []
        self.score_text = ""
        self.xref = {}
        self.matzyg = ""
        self.stage2 = 0
        self.tf = 0
        self.proteins = []
        self.iupred = 0.0
        self.gff = ""
        
    @classmethod
    def from_atts(cls,atts):
        gene = cls()
        if 'ID' in atts:
            gene.id = atts['ID'][0]
        if 'Name' in atts:
            gene.name = atts['Name'][0]
        if 'score_text' in atts:
            gene.iep = float(atts['score_text'][0])
        if 'Dbxref' in atts:
            for xref in atts['Dbxref']:
                xrefs = xref.split('=')
                gene.xref[xref[0]] = xref[1]
        if 'Alias' in atts:
            gene.aliases = atts['Alias']
        if 'Ontology_term' in atts:
            gene.ontologies = atts['Ontology_term']
    
        return gene

Process gff file for proteins, mRNAs, and genes

For every protein this code goe. for each protein in P and pull out the parent of that protein and look up the mRNA and find the object of the parent. Then created a children object for each class and appends the protein name and links the parent to the child. Everytime an object has a parent you give it the child.

In [6]:
proteins = {}

for protein_gff in gffs['protein']:
    p = protein.from_atts(protein_gff.attribute)
    p.gff = protein_gff
    proteins[p.id] = p

mRNAs = {}

for mrna_gff in gffs['mRNA']:
    p = mRNA.from_atts(mrna_gff.attribute)
    p.gff = mrna_gff
    mRNAs[p.id] = p
    
    
genes = {}

for gene_gff in gffs['gene']:
    p = gene.from_atts(gene_gff.attribute)
    p.gff = gene_gff
    genes[p.id] = p
    
In [7]:
# set parents/children

for p in proteins:
    mRNAs[proteins[p].parent].children.append(proteins[p].id)
    
for t in mRNAs:
    genes[mRNAs[t].parent].children.append(mRNAs[t].id)
    
for g in genes.values():
    for m in g.children:
        for p in mRNAs[m].children:
            g.proteins.append(p)
In [8]:
type(proteins)
Out[8]:
dict