## Cluster Analysis - Using GO terms

dataset

posted on 26.02.2018 by Andrew Landels, Caroline Evans#### dataset

Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.

This dataset/code forms part of Andrew Landels' thesis: "Improving proteomic methods and investigating H2 production in Synechocystis sp. PCC6803" http://etheses.whiterose.ac.uk/id/eprint/19034

This code attempts to cluster proteins by relative intensity under different conditions, then assign frequencies of Gene Ontology (GO) terms to each of the clusters. The theory behind the code in this section is described in detail in the aforementioned thesis in chapter 4.7. The scripts in this section are written in R and Mathematica, and require the use of the uniprot website to generate the GO terms.

The first part of this pipeline is written in R. The input to this code is tag-based proteomic data, where the first column lists uniprot IDs for the identified proteins, and the subsequent columns contain protein quantifications (these can be absolute or relative). Initially, a list of unique proteins are output from the data. This list of uniprot IDs is then uploaded into a uniprot search. The uniprot table is updated to include all GO terms, by clicking on the 'columns' button, selecting the GO Terms drop-down, and checking each of the boxes. These settings are applied by clicking 'save' in the top right-hand corner. Once this is done, the data is downloaded for use further along in the pipeline.

The proteomic data read by the R script is clustered, and using a K-means analysis a critical cut-off point for the number of clusters is selected. This value is chosen manually, and is selected based on a "within-groups sum of squares" graph. This graph calculates the sum of squares distance between all points to a central mean, then applies two means, creates two clusters, and calculates the sum of squares again. This process is iterated until 20 means have been applied to the data. This is plotted as the aforementioned graph, where the analyst is aiming to have the minimum possible number of clusters, but also the lowest sum of squares. In the worked example provided, 8 clusters were selected (as highlighted by a verticle line on the plot).

The proteins were grouped into 8 clusters and assigned a side-colour. These clusters were exported into a csv file for use later. Finally, a heatmap was generated, using the gplots package, was used from the data. This heatmap had 2 dendrograms - one showing relatedness of the labels, and the second for the proteins. The selected clusters highlighted with side colours.

The next part of the analysis was performed in Mathematica. The GO Terms downloaded from UniProt were linked to each of the proteins on the list. To de-clutter the data and simplify the analysis, only GO terms with 20 or more unique references from the dataset were extracted, and the remaining terms were discarded. The set of remaining GO terms within each cluster were tallied, producing a matrix of GO terms and a count for each cluster. The values for each cluster were divided by the sum of all observations, producing values between 0 and 1 for each term in each cluster.

This list was then plotted to show the GO distribution across each of the selected clusters, enabling analysis of GO term concentration within clusters of the dataset.

This code attempts to cluster proteins by relative intensity under different conditions, then assign frequencies of Gene Ontology (GO) terms to each of the clusters. The theory behind the code in this section is described in detail in the aforementioned thesis in chapter 4.7. The scripts in this section are written in R and Mathematica, and require the use of the uniprot website to generate the GO terms.

The first part of this pipeline is written in R. The input to this code is tag-based proteomic data, where the first column lists uniprot IDs for the identified proteins, and the subsequent columns contain protein quantifications (these can be absolute or relative). Initially, a list of unique proteins are output from the data. This list of uniprot IDs is then uploaded into a uniprot search. The uniprot table is updated to include all GO terms, by clicking on the 'columns' button, selecting the GO Terms drop-down, and checking each of the boxes. These settings are applied by clicking 'save' in the top right-hand corner. Once this is done, the data is downloaded for use further along in the pipeline.

The proteomic data read by the R script is clustered, and using a K-means analysis a critical cut-off point for the number of clusters is selected. This value is chosen manually, and is selected based on a "within-groups sum of squares" graph. This graph calculates the sum of squares distance between all points to a central mean, then applies two means, creates two clusters, and calculates the sum of squares again. This process is iterated until 20 means have been applied to the data. This is plotted as the aforementioned graph, where the analyst is aiming to have the minimum possible number of clusters, but also the lowest sum of squares. In the worked example provided, 8 clusters were selected (as highlighted by a verticle line on the plot).

The proteins were grouped into 8 clusters and assigned a side-colour. These clusters were exported into a csv file for use later. Finally, a heatmap was generated, using the gplots package, was used from the data. This heatmap had 2 dendrograms - one showing relatedness of the labels, and the second for the proteins. The selected clusters highlighted with side colours.

The next part of the analysis was performed in Mathematica. The GO Terms downloaded from UniProt were linked to each of the proteins on the list. To de-clutter the data and simplify the analysis, only GO terms with 20 or more unique references from the dataset were extracted, and the remaining terms were discarded. The set of remaining GO terms within each cluster were tallied, producing a matrix of GO terms and a count for each cluster. The values for each cluster were divided by the sum of all observations, producing values between 0 and 1 for each term in each cluster.

This list was then plotted to show the GO distribution across each of the selected clusters, enabling analysis of GO term concentration within clusters of the dataset.