= 42
answer = 42.0
precise_answer = "Deep Thought" name
Jupyter Notebooks
What is a Jupyter Notebook?
When people talk about processing scientific data, they rarely mean hitting a button on an automated system and waiting for a finished result. As you work through your hypothesis, you may often find that your question changes based on how the data come together. This is true of scientific coding as well. In practice, you will constantly be designing functions and need to evaluate how well they work - or troubleshoot why it does not run at all.
The majority of the tutorials we will be sharing will be written in Python - but that doesn’t have to mean that we need to write a python script from scratch. What you are now reading is written in a Jupyter Notebook - a dynamic environment capabile of both processing code, and displaying results in a step-wise fashon. A Jupyter notebook is segmented, allowing you to write a specfic set of instructions in each cell and executing it to see the results without needing to re-run the entire script. From a practical perspective, you can think of a Jupyter notebook as a rapid prototyping sandbox. Once you have code that works as you expect, you can design larger applications that use the scripts you construct to automate a large protion of your processing tasks.
Lesson Objectives:
In this notebook, we will cover: - The basics of coding in python - How to read in your data - Python packages (and why you should use them) - Designing your own functions (and why you’ll need to)
Lesson Case Study:
We will search data from the NPAtlas using APIs to get information about a set of compounds by using a customized function.
Why Python:
Python is an incredibly flexible programming language that allows you to design solutions to problems very quickly. Unlike more complex coding languages, you can create variables “on the fly” as you need them without declaring them at the very begining of the script. This allows you to pass data through these variables to other processes in a way that can be read easily by other processes. You may hear that python is “slow”, but speed is rarely the main focus of designing a new tool in python. Because of it’s simple structure and easy readability - it is often the go-to language for scientific programming.
Python Data Structures:
Python has some in-built structures for data that we will be using to store, sort, and manipulate our data.
Strings, Integers, and Floats
The most basic data types we interact with in python are strings, integers, and floats. - Strings can be thought of as text and are surrounded by quotations so that python knows this is the kind of data we mean to provide. - Integers are whole numbers, and can be declared directly - Floats are numbers with decimals.
Variables
Variables can be thought of the same in coding as they are in math, physics, or chemistry. Here, we can ascribe an attribute to a variable and call upon it later. Run the cell below by hitting shift+enter to execute the code outlined in the next block.
In the cell above, we declared three new variables using each of the three data types we discussed above. Now, we can recall them at any time by simply declaring them. We can manipulate the data each of these variables contain, or use combinations of these variables to arrive at an answer to a question.
In-built Python Functions:
Python contains a number of helpful functions that allow you perform all kinds of tasks and view the results. Basic math functions can be done directly, and you can view the answer by using the in-built print() function. There are also tricks that you can use with functions such as combining variables and text in the print function.
print(answer)
print(precise_answer)
print(name, "says the answer is", answer)
42
42.0
Deep Thought says the answer is 42
Manipulating variables
The print function is one of python’s simplist functions and allows you to view any variable you’ve declared. In a jupyter notebook, you can call on a variable directly to see what it contains - but this is not something that can be done in python directly.
answer
42
If we wanted to change the answer, we can perform an operation on it:
- 24 answer
18
Be warned! We didn’t tell the system to store the new answer as a variable - so it won’t remember what the new answer actually is unless we tell it to remember:
print("The answer is still:",answer)
= answer - 24
new_answer
print("But the new answer is",new_answer)
The answer is still: 42
But the new answer is 18
If you want to see if one value is equal to another, we use two equals signs to tell python to evaluate a statement, rather than declaring a variable. In the example below, if the statement is true, then we will get the answer True - however, we are expecting it to say False.
== new_answer answer
False
This is very handy for checking to see if something satisfies some conditions we have for data filtration. We can also ask python if the values are NOT equal by using the following:
!= new_answer answer
True
Lists and Dictionaries:
Lists are collections of values or variables that you can use to store information. They can contain strings, integers, floats, and even complex objects - such as other lists. We declare lists with square brackets [] and separate elements of a list with a comma.
= ["Escherichia", "Salmonella", "Bacillus", "Staphylococcus", "Streptococcus","Bhurkholderia"] bacteria_genera
Lists can be manipulated directly - you can add or remove items adding .append() or .remove() to the name of the list. You can also fetch specific values in a list by referencing their location. In most data science areas, we start an index at position zero, so to fetch the second value in a list, we need to tell it to fetch the position at 1, not 2.
"Clostridium")
bacteria_genera.append(print(bacteria_genera)
'Streptococcus')
bacteria_genera.remove(print(bacteria_genera)
print('The second entry in our list is:',bacteria_genera[1])
['Escherichia', 'Salmonella', 'Bacillus', 'Staphylococcus', 'Streptococcus', 'Bhurkholderia', 'Clostridium']
['Escherichia', 'Salmonella', 'Bacillus', 'Staphylococcus', 'Bhurkholderia', 'Clostridium']
The second entry in our list is: Salmonella
If you wanted to interact with a certain element of the list, you can do that by referring to its place in the list (numerically)
print("The first item in the bacteria_genera list is:",bacteria_genera[0])
print("The last item in the bacteria_genera list is:",bacteria_genera[-1])
The first item in the bacteria_genera list is: Escherichia
The last item in the bacteria_genera list is: Clostridium
If you wanted to go through every item in the list, you can create a “for loop” to do that. This becomes very handy if you want to go through lots of data and do the same thing, and is not limited to lists. In the example below, we use a placeholder of ‘genus’ to hold the information we are getting each time we go through the loop - so it gets overwritten every time it goes through the next item. “For loops” are handy, but they can be inefficient in the long run - we’ll handle advanced ways to go through lists in the future.
for genus in bacteria_genera:
print(genus)
Escherichia
Salmonella
Bacillus
Staphylococcus
Bhurkholderia
Clostridium
Similar to lists, dictionaries store the values you pass to them - but they are indexed. This means that you can give it a key to remember a value by and quickly retrieve that value by using the key at any time. When creating a dictonary, we provide the keys and values in one step while using the curly brackets to tell python that we are dealing with these data in a dictionary. We use dictionaries to store data for organization and speed of data retrieval. Like using an index in the back of a book - rather than scanning every page - we can quickly find where the data we are looking for is and retrieve it.
= {"Escherichia":"intestine", "Salmonella":"intestine", "Bacillus":"soil", "Staphylococcus":"skin", "Streptococcus":"throat","Bhurkholderia":"soil"}
isolation_locations "Escherichia"] isolation_locations[
'intestine'
If you wanted to add a new value to the dictionary, you can do that after it is created by providing a new key and value pair:
"Clostridium"] = "soil"
isolation_locations[ isolation_locations
{'Escherichia': 'intestine',
'Salmonella': 'intestine',
'Bacillus': 'soil',
'Staphylococcus': 'skin',
'Streptococcus': 'throat',
'Bhurkholderia': 'soil',
'Clostridium': 'soil'}
Dictionaries and lists can be nested as well, so you can have a list of lists, or a dictionary of dictionaries. One format we will work with by the end of this lesson -JSON- can be manipulated like a dictionary of dictionaries!
Exercise 1:
Let’s say we isolated a new genera from our experiments and wanted to update both the list and dictionary. Add a new genera to the list and update the dictionary with its isolation location:
### Exercise 1 Workspace:
Reading in Data Files
Constructing a dictonary or list one element at a time can be useful - but very tedious. Often, we have excel spreadsheets or .csv/.tsv data files that contain the kinds of information we want to interact with. If you haven’t worked with a .tsv file before - it is very similar to a .csv except that each value is separated by a tab instead of a comma. Although excel files are very human readable - they’re very cumbersome for computation. TSV and CSV files are very efficent, and we can parse them using some in-built python functions. However, because python does not know that we will interact with a tsv or csv file, it has all of the functions we need to use tucked away to be more efficient. We can tell python to import this package so that we can fetch the data.
We will use a data file from the NP Atlas to construct a new list of bacteria that are relevant for natural product drug discovery. For that, let’s focus only on the genera reported for each compound’s initial discovery:
import csv
with open('NPAtlas_download.tsv', 'r') as file:
= csv.reader(file, delimiter='\t')
line_reader = next(line_reader)
headers print(headers)
['npaid', 'compound_id', 'compound_name', 'compound_molecular_formula', 'compound_molecular_weight', 'compound_accurate_mass', 'compound_m_plus_h', 'compound_m_plus_na', 'compound_inchi', 'compound_inchikey', 'compound_smiles', 'compound_cluster_id', 'compound_node_id', 'origin_type', 'genus', 'origin_species', 'original_reference_author_list', 'original_reference_year', 'original_reference_issue', 'original_reference_volume', 'original_reference_pages', 'original_reference_doi', 'original_reference_pmid', 'original_reference_title', 'original_reference_type', 'original_journal_title', 'synonyms_dois', 'reassignment_dois', 'synthesis_dois', 'mibig_ids', 'gnps_ids', 'cmmc_ids', 'npmrd_id', 'npatlas_url']
You may notice that none of the headers contain spaces. This is because certain types of functions and files do not behave very well with spaces and it is usually best practice to use an underscore instead. For our task, two headers are going to be very important: “origin_type” and “genus”
It is possible to use the csv package to parse all these data and populate a new list of genera that we can focus on. The csv file reader works on a line-by-line basis. It reads the lines we tell it (typically, every line) one at a time, fetches the data, and we can use it. Let’s construct a list of all of the genera contained in the NPAtlas and print the first 10 entries in the list:
= []
np_atlas_genera
with open('NPAtlas_download.tsv', 'r') as file:
= csv.reader(file, delimiter='\t')
line_reader for each_row in line_reader:
14]) #this adds the 15th column of the each record to the list, which is the genus column - the very first column is 0
np_atlas_genera.append(each_row[print(np_atlas_genera[0:10])
['genus', 'Curvularia', 'Diaporthe', 'Streptomyces', 'Vibrio', 'Fusarium', 'Microbispora', 'Chaetomium', 'Myxococcus', 'Penicillium']
You will notice that the first entry in the list is our header - ‘genus’. Since we don’t care about this, we can filter it out as we construct the list. Also, because a list doesn’t care about repeated values, we have MANY duplicates in our list. Let’s filter out duplicates at the end by converting our list to a set(). A set behaves very closely to a list, but you can only have each value in a set once.
= []
np_atlas_genera
with open('NPAtlas_download.tsv', 'r') as file:
= csv.reader(file, delimiter='\t')
line_reader = next(line_reader) # read in the column headers so that we know we can skip adding it to the list
headers for each_row in line_reader:
if each_row[14] not in headers: # exclude the header value (hint: you can also use )
14]) #this adds the 15th column of the each record to the list, which is the genus column - Remember: the very first column is 0
np_atlas_genera.append(each_row[
print('The NPAtlas database contains',len(np_atlas_genera),'genera from bacteria, fungi, and archaea.')
= set(np_atlas_genera)
np_atlas_set
print('After removing duplicate values, there are',len(np_atlas_set),'unique genera in the NPAtlas dataset.')
The NPAtlas database contains 36454 genera from bacteria, fungi, and archaea.
After removing duplicate values, there are 1246 unique genera in the NPAtlas dataset.
Exercise 2:
If we only wanted the genera of bacteria to be added to our list of bacteria_genera - how can you alter the function above?
## Exercise 2 Workspace:
To see the solution to this excercise, remove the # at the begining of the next line and run the cell.
# %load ./exercise_solutions/exercise_2.py
Python Packages
Python has quite a few in-built functions - but these are rarely all you need. Specific packages are created to tackle one problem or manipulate data faster than we could code ourselves.
There are packages for all kinds of scientific programming and data including: * Mass Spectrometry Data * NMR Data * Statistics and Bioinformatics * Figure Generation * Interaction with API’s
Some of these packages are already built-in to a standard python environment too, but are not always available unless you call them up. Requests is one of these packages that we’ll use to retrieve data from a website.
To import a package, you call it by name using an import statement.
import requests
Sometimes, you don’t want to type out the entire package - and we’ll see why later. For now, lets import requests as the varaible name “r”
import requests as r
Most websites have documentation on how to interact with their API’s, which we can use in conjunction with requests to find information VERY quickly.
The NPAtlas documentation can be found here
For now, we are going to focus on simply searching for a compound by it’s NPAID (Natural Products Atlas ID) and using a “GET” request to get the information.
When we construct the url, we can simply add in the variable we want and add the strings together, as is shown below. Alternatively, we could also use a concept called f-string construction (which we will not cover, but is shown below). f-strings can be very helpful if you have something change in the middle of a URL, or many variables - but for now, they are not required.
= "NPA024652"
npaid = r.get("https://www.npatlas.org/api/v1/compound/"+npaid)
response # response = r.get(f"https://www.npatlas.org/api/v1/compound/{npaid}")
We can view the data in several different ways, including text string or compile it into a json for easy and quick sorting of any values it returns. Usually, the documentation will tell you if you expect a quick and simple value - or a laundry list of properties, often stored as JSON. Since the Atlas contains a wealth of information, it’s easy to see the advantages - try flipping between the two by removing the #:
response.text# response.json()
'{"id":24652,"npaid":"NPA024652","original_name":"Streptomycin","mol_formula":"C21H39N7O12","mol_weight":"581.5800","exact_mass":"581.2657","inchikey":"UCSJYZPVAKXKNQ-HZYVHMACSA-N","smiles":"C[C@H]1[C@@]([C@H]([C@@H](O1)O[C@@H]2[C@H]([C@@H]([C@H]([C@@H]([C@H]2O)O)N=C(N)N)O)N=C(N)N)O[C@H]3[C@H]([C@@H]([C@H]([C@@H](O3)CO)O)O)NC)(C=O)O","cluster_id":592,"node_id":528,"has_exclusions":false,"synonyms":[],"inchi":"InChI=1S/C21H39N7O12/c1-5-21(36,4-30)16(40-17-9(26-2)13(34)10(31)6(3-29)38-17)18(37-5)39-15-8(28-20(24)25)11(32)7(27-19(22)23)12(33)14(15)35/h4-18,26,29,31-36H,3H2,1-2H3,(H4,22,23,27)(H4,24,25,28)/t5-,6-,7+,8-,9-,10-,11+,12-,13-,14+,15+,16-,17-,18-,21+/m0/s1","m_plus_h":"582.2730","m_plus_na":"604.2549","origin_reference":{"doi":"10.1021/ja01187a006","pmid":18875100,"authors":"Kuehl, FA; Peck, RL; Hoffhine Jr, CE;.Folkers, K","title":"Streptomyces antibiotics; structure of streptomycin.","journal":"Journal of the American Chemical Society","year":1948,"volume":"70","issue":"7","pages":"2325-2330"},"origin_organism":{"id":670,"type":"Bacterium","genus":"Streptomyces","species":"griseus","taxon":{"id":283,"name":"Streptomyces","rank":"genus","taxon_db":"lpsn","external_id":"517119","ncbi_id":1883,"ancestors":[{"id":1,"name":"Bacteria","rank":"domain","taxon_db":"lpsn","external_id":"0","ncbi_id":2},{"id":203,"name":"Actinobacteria","rank":"phylum","taxon_db":"lpsn","external_id":"0","ncbi_id":201174},{"id":204,"name":"Actinobacteria","rank":"class","taxon_db":"lpsn","external_id":"0","ncbi_id":null},{"id":275,"name":"Streptomycetales","rank":"order","taxon_db":"lpsn","external_id":"0","ncbi_id":85011},{"id":276,"name":"Streptomycetaceae","rank":"family","taxon_db":"lpsn","external_id":"0","ncbi_id":2062}]}},"syntheses":["10.7164/antibiotics.27.997"],"reassignments":[],"mol_structures":[{"current_structure":true,"reference_doi":"10.1021/ja01187a006","structure_smiles":"C[C@H]1[C@@]([C@H]([C@@H](O1)O[C@@H]2[C@H]([C@@H]([C@H]([C@@H]([C@H]2O)O)N=C(N)N)O)N=C(N)N)O[C@H]3[C@H]([C@@H]([C@H]([C@@H](O3)CO)O)O)NC)(C=O)O","is_reassignment":false,"version":1}],"exclusions":[],"external_ids":[{"external_db_name":"mibig","external_db_code":"BGC0000717"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00012112970%Suspect related to Massbank: Streptomycin (predicted molecular formula SIRIUS: C22H43N7O13 / BUDDY: C33H43NO10) with delta m/z 32.026 (putative explanation: unspecified; atomic difference: 1C,4H,1O) [M+H]+%4"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00000075309%Streptomycin%3!CCMSLIB00000075310%Streptomycin%3!CCMSLIB00000206377%Massbank: Streptomycin%3!CCMSLIB00000206378%Massbank: Streptomycin%3!CCMSLIB00000206379%Massbank: Streptomycin%3!CCMSLIB00000206380%Massbank: Streptomycin%3!CCMSLIB00000206381%Massbank: Streptomycin%3!CCMSLIB00000220513%Massbank:KO003997 Streptomycin%3!CCMSLIB00000220515%Massbank:KO003998 Streptomycin%3!CCMSLIB00000220517%Massbank:KO003999 Streptomycin%3!CCMSLIB00000220519%Massbank:KO004000 Streptomycin%3!CCMSLIB00000220521%Massbank:KO004001 Streptomycin%3!CCMSLIB00000220524%Massbank:KO004002 Streptomycin%3!CCMSLIB00000220526%Massbank:KO004003 Streptomycin%3!CCMSLIB00000220528%Massbank:KO004004 Streptomycin%3!CCMSLIB00000220530%Massbank:KO004005 Streptomycin%3!CCMSLIB00000220532%Massbank:KO004006 Streptomycin%3!CCMSLIB00000570650%MoNA:2366441 Streptomycin (TN)%3!CCMSLIB00000571996%MoNA:2303472 Streptomycin%3!CCMSLIB00000571998%MoNA:2303213 Streptomycin%3!CCMSLIB00000572125%MoNA:2312045 Streptomycin%3!CCMSLIB00000572135%MoNA:2354240 Streptomycin%3!CCMSLIB00000574208%MoNA:2366441 Streptomycin (TN)%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005723215%Streptomycin_20eV%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005723216%Streptomycin_40eV%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005723217%Streptomycin_50eV%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00009952766%Suspect related to Massbank: Streptomycin (predicted molecular formula: C20H39N11O9) with delta m/z 18.01 (putative explanation: Proline oxidation to 5-hydroxy-2-aminovaleric acid|water; atomic difference: 2H,1O|2H,1O)%4"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00009952767%Suspect related to Massbank: Streptomycin (predicted molecular formula: C22H43N7O13) with delta m/z 32.026 (putative explanation: unspecified; atomic difference: 1C,4H,1O)%4"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005728509%Massbank:KO001831 Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005728729%Massbank:KO001828 Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005729154%Massbank:KO001827 Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005729246%Massbank:KO001830 Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005729422%Massbank:KO001829 Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005771303%Massbank: Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005771106%Massbank: Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005771137%Massbank: Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005771154%Massbank: Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00005771334%Massbank: Streptomycin%3"},{"external_db_name":"gnps","external_db_code":"CCMSLIB00012112969%Suspect related to Massbank: Streptomycin (predicted molecular formula SIRIUS: C20H39N11O9 / BUDDY: C32H41NO10) with delta m/z 18.01 (putative explanation: Proline oxidation to 5-hydroxy-2-aminovaleric acid|water; atomic difference: 2H,1O|2H,1O) [M+H]+%4"},{"external_db_name":"npmrd","external_db_code":"NP0008060"}]}'
As you can see - there’s quite a lot here from one simple request. But APIs can offer quite a lot of information if you give it the right data to search. There are GET requests for quick inquiries, POST requests for specifying different types and levels of information (think about it as an ‘advanced search’ function), and PUT requests for updating databases or adding new information. Typically, PUT requests are locked down but with the right credentials, you can add new information for others to use.
Functions
Sometimes, we are running through an analysis and just want bits an pieces of information from specific inputs. Luckily, we can design functions to take a number of inputs and give us results so we do not have to do things one variable at a time.
To start out, we can define a function in a script and then re-use it later. In advanced applications, you can import functions from other places and use them directly. This is handy if you re-use functions all the time, but don’t want to waste time importing them every time you make a new script. Take a look at the function below:
def get_compound_data(npaid):
= r.get("https://www.npatlas.org/api/v1/compound/"+npaid)
response return response.json()
Here, we’re constructing a function to take an atlas ID and return the information we want about that compound. If we had a list of compounds, we can fetch information on each one, parse it, and add in the relevant information to a list outside of the function. See below for a quick example:
= ["NPA024602","NPA015585","NPA020595"]
npaid_list for npaid in npaid_list:
= get_compound_data(npaid)
compound_data print(compound_data['original_name'],"is produced by",compound_data['origin_organism']['genus'],"and has a molecular weight of",compound_data['mol_weight'],"Da.")
Lincomycin is produced by Streptomyces and has a molecular weight of 406.5450 Da.
Erythromycin B is produced by Streptomyces and has a molecular weight of 717.9380 Da.
Collismycin A is produced by Streptomyces and has a molecular weight of 275.3330 Da.
In the above example, we are combining a number of things we’ve already learned - list construction, for-loops, dictionary manipulation, and retrieving information from a JSON file as if it were a dictionary filled with other dictionaries. As you can see, putting these elements together means you can find all kinds of information systematically in just a few lines of code.
In these examples, we use the NPAID - the number associated with a compound - to look at information. But how can we construct an API inquiry to search for a compounds from a list of names?
HINT: use the NPAtlas API Documentation to see how to construct the URL
## Exercise 3 Workspace:
# %load ./exercise_solutions/exercise_3.py