This semester, in SDS 348, I delved deeper into the statistical techniques provided by RStudio and Python, and learned how to successfully apply them to large sets of biological data. Previously, I only had minimal RStudio knowledge so getting to learn and explore a new coding language, Python, has been really exciting. I have thoroughly enjoyed learning how to use Python this semester and here are some examples of how I learned how to apply Python in class:
# Find all 3-mers in these two sequences
my_seq1 = "ATCATCATG"
my_seq2 = "CAGCCCAATCAGGCTCTACTGCCACTAAACTTACGCAGGATATATTTACGCCGACGTACT"
def count(string):
Dict = {}
for i in range(len(string)-2):
mer = string[i:i+3]
if(mer in Dict):
Dict[mer] = Dict[mer] + 1
else:
Dict[mer] = 1
print(Dict)
count(my_seq1)
## {'ATG': 1, 'CAT': 2, 'ATC': 2, 'TCA': 2}
count(my_seq2)
## {'CTT': 1, 'AAA': 1, 'ATC': 1, 'AAC': 1, 'ATA': 2, 'AGG': 2, 'CTC': 1, 'AGC': 1, 'AAT': 1, 'ATT': 1, 'CTG': 1, 'CTA': 2, 'ACT': 4, 'CAC': 1, 'ACG': 3, 'CAA': 1, 'CCA': 2, 'CCG': 1, 'CCC': 1, 'TAT': 2, 'CGA': 1, 'CAG': 3, 'TCT': 1, 'GAT': 1, 'TTT': 1, 'TGC': 1, 'GGA': 1, 'TAA': 1, 'GGC': 1, 'TAC': 4, 'TTA': 2, 'GAC': 1, 'CGT': 1, 'TCA': 1, 'GCA': 1, 'GTA': 1, 'GCC': 3, 'GCT': 1, 'CGC': 2}
In this example, I used python in bioinformatics, a field I am interested in, to find and return a dictionary containing all possible subsequences of length three from the two DNA sequences specified above.
import re
string7="ATGGCAATAACCCCCCGTTTCTACTTCTAGAGGAGAAAAGTATTGACATGAGCGCTCCCGGCACAAGGGCCAAAGAAGTCTCCAATTTCTTATTTCCGAATGACATGCGTCTCCTTGCGGGTAAATCACCGACCGCAATTCATAGAAGCCTGGGGGAACAGATAGGTCTAATTAGCTTAAGAGAGTAAATCCTGGGATCATTCAGTAGTAACCATAAACTTACGCTGGGGCTTCTTCGGCGGATTTTTACAGTTACCAACCAGGAGATTTGAAGTAAATCAGTTGAGGATTTAGCCGCGCTATCCGGTAATCTCCAAATTAAAACATACCGTTCCATGAAGGCTAGAATTACTTACCGGCCTTTTCCATGCCTGCGCTATACCCCCCCACTCTCCCGCTTATCCGTCCGAGCGGAGGCAGTGCGATCCTCCGTTAAGATATTCTTACGTGTGACGTAGCTATGTATTTTGCAGAGCTGGCGAACGCGTTGAACACTTCACAGATGGTAGGGATTCGGGTAAAGGGCGTATAATTGGGGACTAACATAGGCGTAGACTACGATGGCGCCAACTCAATCGCAGCTCGAGCGCCCTGAATAACGTACTCATCTCAACTCATTCTCGGCAATCTACCGAGCGACTCGATTATCAACGGCTGTCTAGCAGTTCTAATCTTTTGCCAGCATCGTAATAGCCTCCAAGAGATTGATGATAGCTATCGGCACAGAACTGAGACGGCGCCGATGGATAGCGGACTTTCGGTCAACCACAATTCCCCACGGGACAGGTCCTGCGGTGCGCATCACTCTGAATGTACAAGCAACCCAAGTGGGCCGAGCCTGGACTCAGCTGGTTCCTGCGTGAGCTCGAGACTCGGGATGACAGCTCTTTAAACATAGAGCGGGGGCGTCGAACGGTCGAGAAAGTCATAGTACCTCGGGTACCAACTTACTCAGGTTATTGCTTGAAGCTGTACTATTTTAGGGGGGGAGCGCTGAAGGTCTCTTCTTCTCATGACTGAACTCGCGAGGGTCGTGAAGTCGGTTCCTTCAATGGTTAAAAAACAAAGGCTTACTGTGCGCAGAGGAACGCCCATCTAGCGGCTGGCGTCTTGAATGCTCGGTCCCCTTTGTCATTCCGGATTAATCCATTTCCCTCATTCACGAGCTTGCGAAGTCTACATTGGTATATGAATGCGACCTAGAAGAGGGCGCTTAAAATTGGCAGTGGTTGATGCTCTAAACTCCATTTGGTTTACTCGTGCATCACCGCGATAGGCTGACAAAGGTTTAACATTGAATAGCAAGGCACTTCCGGTCTCAATGAACGGCCGGGAAAGGTACGCGCGCGGTATGGGAGGATCAAGGGGCCAATAGAGAGGCTCCTCTCTCACTCGCTAGGAGGCAAATGTAAAACAATGGTTACTGCATCGATACATAAAACATGTCCATCGGTTGCCCAAAGTGTTAAGTGTCTATCACCCCTAGGGCCGTTTCCCGCATATAAACGCCAGGTTGTATCCGCATTTGATGCTACCGTGGATGAGTCTGCGTCGAGCGCGCCGCACGAATGTTGCAATGTATTGCATGAGTAGGGTTGACTAAGAGCCGTTAGATGCGTCGCTGTACTAATAGTTGTCGACAGACCGTCGAGATTAGAAAATGGTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATTGCGGTGTCTTCACTGTGCTGCGGCTACCCATCGCCTGAAATCCAGCTGGTGTCAAGCCATCCCCTCTCCGGGACGCCGCATGTAGTGAAACATATACGTTGCACGGGTTCACCGCGGTCCGTTCTGAGTCGACCAAGGACACAATCGAGCTCCGATCCGTACCCTCGACAAACTTGTACCCGACCCCCGGAGCTTGCCAGCTCCTCGGGTATCATGGAGCCTGTGGTTCATCGCGTCCGATATCAAACTTCGTCATGATAAAGTCCCCCCCTCGGGAGTACCAGAGAAGATGACTACTGAGTTGTGCGAT"
(re.findall("A.TAAT|GC[AG][AT]TG", string7))
## ['GCGTTG', 'ATTAAT', 'GCAATG', 'ACTAAT']
len(re.findall("A.TAAT|GC[AG][AT]TG", string7))
## 4
In this example, I used python and RegEx functions, specifically findall, to return a list matching restriction enzyme binding sites ANTAAT and GCRWTG and to see how many cuts total and how many expected fragments will result in the sequence if both of these restriction ezymes are used to digest you digest with both of these restriction enzymes.