Analyzing Large P Small N Data – Examples from Microbiome: In Partnership with Domino Data Lab
BioRankings wrote a guest blog for the Domino Data Lab Technical Blog on the curse of dimensionality in big data, with a specific example in microbiome data. Excerpt below.
Using automatic algorithms for data reduction has problems. Consider a dataset with 500 variables with PCA being used to reduce the dimensions to a handful for analysis. On close examination it is seen that PCA produces linear combinations of the variables – – which is uninterpretable. Understanding the meaning and interactions of 500 coefficients is not possible. Each PCA projection is uninterpretable so learning is not possible.
As we are still left with a Large P Small N problem, direct hypothesis testing is not possible. Instead, the goal at this stage of the analysis is interpretation and learning. It is important to treat this analysis as exploratory. This exploratory analysis is not conducted by analyzing all the data at once to see what stands out. Instead of using algorithms to reduce the data, pick subsets of variables that are biologically meaningful and can be interpreted. Anything that is discovered in subset analyses should be viewed as hypothesis generating and potentially testable in follow-up designed experiments.
Two analyses done by BioRankings are presented here as examples of this approach. Both involve high throughput screening data and hundreds or thousands of variables from a hundred or so subjects in a pre-diabetes study as part of the Integrative Human Microbiome Project. In the first analysis 12 cytokines were selected to test if changes in cytokines are associated with changes in the composition of the gut microbiome. In the second analysis a subset of the genes from gut microbioal taxa were selected to test if gene copy number was associated with conversion from insulin sensitivity (good outcome) to resistance (bad outcome).