top of page

Data Analysis

Our original hypothesis about the differences and similarities between the three hemagglutinin proteins that are either currently circulating in the population, H1 and H3, and the last circulating strain, H5, was that H1 and H3 would show more areas of homology and a more similar amino acid sequence than the H5 protein.

 

We first compared the amino acid frequencies of the three proteins. What we found disagreed with our original hypothesis. Rather than H1 and H3 sharing similar amino acid frequencies, the frequencies for H5 and H1 were more similar while H3 had a few key different percentages, notably in glutamic acid, isoleucine, and glutamine. The percentages for each amino acid for H1 and H5 did not differ by more than 1%.

 

Next, we ran a multiple sequence alignment with H1, H3, H5 and an H1 sequence from the 1980s. Again, we found that H1 and H5 showed more similarities to one another than either did to H3. The older H1 sequence also fit this pattern and did not show a drastic difference from the more recent H1 sequence, indicating that while mutation has occurred, it has not significantly altered the protein.

 

Then, we ran the genomes through a K-means clustering program. This sorted the protein sequence into groups, in our case 2, based upon the amino acid frequencies. The numbering of the groups indicated that proteins grouped together will share more similarities to one another than to members of different groups. The results of the clustering again showed that H1 and H5, which were grouped together, are more similar to one another than to H3, which was grouped separately.

 

Finally, hierarchical clustering was performed on the four-hemagglutanin proteins. This clustering relates the conserved regions of one protein to another, and our results were consistent with the k-means clustering. The visual representation of the data, which identifies zero regions in red and positive regions in turquoise, shows almost uniform lines through the H5 and two H1 genomes, but different lines through the H3 genome. This is further proof of the similar relationship between H5 and H1, and independence of the H3 genome.

 

To confirm that the genome similarities were not limited by geographic region and were, indeed, dependent on each protein's unique amino acid sequence, we performed the same analysis on H1, H3, and H5 protein sequences that were found in Thailand. It was again determined that the H1 and H5 proteins are more similar to one another than the H3 protein.

 

Based off of our results, we predict that, due to the great similarities between the H1 and H5 protein, that the two share an evolutionary relationship and may have a common ancestor. H3, on the other hand, possibly came from another viral precursor, or is more distantly related to the other two proteins.

 

 

as

bottom of page