## Proteomic background in Synechocystis

dataset

posted on 26.02.2018 by Andrew Landels#### dataset

Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.

This dataset/code forms part of Andrew Landels' thesis: "Improving proteomic methods and investigating H2 production in Synechocystis sp. PCC6803" http://etheses.whiterose.ac.uk/id/eprint/19034

The code for the methodology described below was written in Wolfram Mathematica (10.1) and the notebook file is "iTRAQ\_TMT-complexity\_emPAI.nb"

An in-depth proteomic dataset, comprised of 2 8-plex iTRAQ experiments investigating a mutant against WT \species{Synechocystis} under two different conditions, was generated on a Q-Exactive HF mass spectrometer (data not included in this repository due to size constraints). To calculate the emPAI scores, the ‘observable’ peptide values were calculated as follows. The complete proteome for Synechocystis PCC6803 – Kazusa strain, was downloaded as a fasta file from uniprot (taxonomy:1111708 – accessed August 2015, 3517 protein entries), which is available in this respository.

This was then merged with the spike-in proteins to make a singular database for analysing the data, by doing this, effects on statistical methods such as false discovery were equal between all analyses. The fasta file was processed in Wolfram Mathematica (version 10.1) to generate an in-silico digest of each of the proteins, excluding any peptides that fell outside a 1000 – 7500 dalton window to replicate the presence of 2+ or 3+ ions observable in the 500 – 2500 m/z window used during the mass spec experimental scan. The emPAI scores for all identified proteins were calculated using the following formula.

\[

emPAI = 10^{(\frac{N_{observed}}{N_{observable}})} -1

\]

Where $N_{observed}$ is the number of unique peptides observed for a given protein, and $N_{observable}$ is the total number of unique peptides that could be observed for a given protein.

This data was then graphed as a histogram to identify the protein concentration distribution and dynamic range. Dynamic range was calculated by taking the exponential of the difference between the maximal and minimal emPAI values.

The code for the methodology described below was written in Wolfram Mathematica (10.1) and the notebook file is "iTRAQ\_TMT-complexity\_emPAI.nb"

An in-depth proteomic dataset, comprised of 2 8-plex iTRAQ experiments investigating a mutant against WT \species{Synechocystis} under two different conditions, was generated on a Q-Exactive HF mass spectrometer (data not included in this repository due to size constraints). To calculate the emPAI scores, the ‘observable’ peptide values were calculated as follows. The complete proteome for Synechocystis PCC6803 – Kazusa strain, was downloaded as a fasta file from uniprot (taxonomy:1111708 – accessed August 2015, 3517 protein entries), which is available in this respository.

This was then merged with the spike-in proteins to make a singular database for analysing the data, by doing this, effects on statistical methods such as false discovery were equal between all analyses. The fasta file was processed in Wolfram Mathematica (version 10.1) to generate an in-silico digest of each of the proteins, excluding any peptides that fell outside a 1000 – 7500 dalton window to replicate the presence of 2+ or 3+ ions observable in the 500 – 2500 m/z window used during the mass spec experimental scan. The emPAI scores for all identified proteins were calculated using the following formula.

\[

emPAI = 10^{(\frac{N_{observed}}{N_{observable}})} -1

\]

Where $N_{observed}$ is the number of unique peptides observed for a given protein, and $N_{observable}$ is the total number of unique peptides that could be observed for a given protein.

This data was then graphed as a histogram to identify the protein concentration distribution and dynamic range. Dynamic range was calculated by taking the exponential of the difference between the maximal and minimal emPAI values.