Curse of Dimensionality: In partnership with Domino Data Lab
BioRankings wrote a guest blog for the Domino Data Lab Technical Blog on the curse of dimensionality in big data. Excerpt below.
Danger of Big Data
Big data is the rage. This could be lots of rows (samples) and few columns (variables) like credit card transaction data, or lots of columns (variables) and few rows (samples) like genomic sequencing in life sciences research. The Curse of Dimensionality, or Large P, Small N, ((P >> N)), problem applies to the latter case of lots of variables measured on a relatively few number of samples.
Each variable in a data set is a dimension with the set of variables defining the space in which the samples fall. Consider a two dimensional space defined by the height and weight of grade school students. Each student is represented as a point on a plot with the X axis (dimension) being height and the Y axis (dimension) being weight. In general, older students are taller and heavier so their points on a plot are more likely to be in the upper right region of the space. Statistical methods for analyzing this two-dimensional data exist. MANOVA, for example, can test if the heights and weights in boys and girls is different. This statistical test is correct because the data are (presumably) bivariate normal.
When there are many variables the Curse of Dimensionality changes the behavior of data and standard statistical methods give the wrong answers. This leads to increased costs from following up on results that are incorrect with expensive and timely experiments, and slows down the product development pipelines. In this blog we show what the changes in behavior of data are in high dimensions. In our next blog we discuss how we try to avoid these problems in applied data analysis of high dimensional data.
Read more at: https://blog.dominodatalab.com/the-curse-of-dimensionality/