IDR Part 1: Parsing a GFF File
Parsing a GFF file in Python using classes.
Part 1: Parsing the genome file¶
First, download the Drosophila genome file. This is the file we'll be working on.
Make sure you login to your berkeley email and download the GFF file here: https://drive.google.com/open?id=1WZ8Zxhjbxj8kYuzganRJmb6KPKgqT73W
About GFF files¶
This is an annotated genome file from FlyBase that contains gene names and other features. The begining of file is a bunch of "junky" scaffolds, then there are columns that describe various areas of the of the genome. The last column is where a bunch of "other" classifications for the GFF file. Also in the last column are the nested hierarchical information that relates to other rows in the dataframe. This last column is what makes parsing the GFF file hard.
Goal: Get the protein names so we can get the amino acid sequences. Create a data structure where for each gene, list all the proteins products for each gene.
Why use Classes?¶
One of the advantages of using classes is that you can reuse a class with many different GFFS files.
- [ ] Read into The logic of object oriented programming. In object oriented programming everything is done around classes.
The Code¶
# Use class to define the gff object
# It seperates each column into a particular attribute of the gff class
class gff:
def __init__(self):
self.seq = ""
self.source = ""
self.feature = ""
self.start = ""
self.end = ""
self.score = ""
self.strand = ""
self.frame = ""
self.attribute = {}
## This is a function that reads in each row of the gff, and pases them into the proper
## Attribute class
@classmethod
def from_fields(cls,fields):
gff = cls()
if len(fields) == 9:
gff.seq = fields[0]
gff.source = fields[1]
gff.feature = fields[2]
gff.start = int(fields[3])
gff.end = int(fields[4])
if fields[5] != ".":
gff.score = float(fields[5])
gff.strand = fields[6]
gff.frame = fields[7]
atts = fields[8].split(';')
for att in atts:
(k,v) = att.split('=')
gff.attribute[k] = v.split(',')
return gff
gff_file = "dmel-all-no-analysis-r6.25.gff"
gffs = {}
## Only store the ones in these category.
features = [
'CDS',
'enhancer',
'exon',
'five_prime_UTR',
'gene',
'golden_path_region',
'insulator',
'intergenic',
'intron',
'mRNA',
'mature_peptide',
'miRNA',
'ncRNA',
'none',
'polyA_site',
'pre_miRNA',
'protein',
'pseudogene',
'rRNA',
'regulatory_region',
'repeat_region',
'silencer',
'snRNA',
'snoRNA',
'tRNA',
'tandem_repeat',
'three_prime_UTR',
'transposable_element',
]
for feature in features:
gffs[feature] = []
gffo = open(gff_file,"r")
for line in gffo:
if line[0:2] != "##":
line = line.strip()
line = line.split("\t")
if len(line) == 9:
if line[2] in features: ## makes sure the row is of a feature we are interested in
gffs[line[2]].append(gff.from_fields(line))
The values below is a dictionary with and then the location of the objects. Is there a better way to view this data structure?
gffs
Now it is subsets only the proteins and pulls out the parsed attribute field and stores in a better data structure that can be useful later.
class protein:
def __init__(self):
self.id = ""
self.name = ""
self.parent = ""
self.iep = ""
self.mw = ""
self.xref = {}
self.seq = ""
self.iupred = 0.0
self.gff = ""
@classmethod
def from_atts(cls,atts):
protein = cls()
txt_fields = [('ID')]
if 'ID' in atts:
protein.id = atts['ID'][0]
if 'Name' in atts:
protein.name = atts['Name'][0]
if 'Derives_from' in atts:
protein.parent = atts['Derives_from'][0]
if 'derived_isoelectric_point' in atts:
protein.iep = float(atts['derived_isoelectric_point'][0])
if 'derived_molecular_weight' in atts:
protein.mw = float(atts['derived_molecular_weight'][0])
if 'Dbxref' in atts:
for xref in atts['Dbxref']:
xrefs = xref.split(':')
protein.xref[xrefs[0]] = xrefs[1]
return protein
class mRNA:
def __init__(self):
self.id = ""
self.name = ""
self.parent = ""
self.children = []
self.score_text = ""
self.xref = {}
self.gff = ""
self.CDS = []
@classmethod
def from_atts(cls,atts):
mRNA = cls()
if 'ID' in atts:
mRNA.id = atts['ID'][0]
if 'Name' in atts:
mRNA.name = atts['Name'][0]
if 'Parent' in atts:
mRNA.parent = atts['Parent'][0]
if 'score_text' in atts:
mRNA.score_text = atts['score_text'][0]
if 'Dbxref' in atts:
for xref in atts['Dbxref']:
xrefs = xref.split(':')
mRNA.xref[xref[0]] = xref[1]
return mRNA
class gene:
def __init__(self):
self.id = ""
self.name = ""
self.fullname = ""
self.aliases = []
self.ontologies = []
self.children = []
self.score_text = ""
self.xref = {}
self.matzyg = ""
self.stage2 = 0
self.tf = 0
self.proteins = []
self.iupred = 0.0
self.gff = ""
@classmethod
def from_atts(cls,atts):
gene = cls()
if 'ID' in atts:
gene.id = atts['ID'][0]
if 'Name' in atts:
gene.name = atts['Name'][0]
if 'score_text' in atts:
gene.iep = float(atts['score_text'][0])
if 'Dbxref' in atts:
for xref in atts['Dbxref']:
xrefs = xref.split('=')
gene.xref[xref[0]] = xref[1]
if 'Alias' in atts:
gene.aliases = atts['Alias']
if 'Ontology_term' in atts:
gene.ontologies = atts['Ontology_term']
return gene
Process gff file for proteins, mRNAs, and genes¶
For every protein this code goe. for each protein in P and pull out the parent of that protein and look up the mRNA and find the object of the parent. Then created a children object for each class and appends the protein name and links the parent to the child. Everytime an object has a parent you give it the child.
proteins = {}
for protein_gff in gffs['protein']:
p = protein.from_atts(protein_gff.attribute)
p.gff = protein_gff
proteins[p.id] = p
mRNAs = {}
for mrna_gff in gffs['mRNA']:
p = mRNA.from_atts(mrna_gff.attribute)
p.gff = mrna_gff
mRNAs[p.id] = p
genes = {}
for gene_gff in gffs['gene']:
p = gene.from_atts(gene_gff.attribute)
p.gff = gene_gff
genes[p.id] = p
# set parents/children
for p in proteins:
mRNAs[proteins[p].parent].children.append(proteins[p].id)
for t in mRNAs:
genes[mRNAs[t].parent].children.append(mRNAs[t].id)
for g in genes.values():
for m in g.children:
for p in mRNAs[m].children:
g.proteins.append(p)
type(proteins)