Random projection

In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are powerful methods known for their simplicity and less erroneous output compared with other methods. According to experimental results, random projection preserve distances well, but empirical results are sparse.[1]

Dimensionality reduction

Dimensionality reduction, as the name suggests, is reducing the number of random variables using various mathematical methods from statistics and machine learning. Dimensionality reduction is often used to reduce the problem of managing and manipulating large data sets. Dimensionality reduction techniques generally use linear transformations in determining the intrinsic dimensionality of the manifold as well as extracting its principal directions. For this purpose there are various related techniques, including: principal component analysis, linear discriminant analysis, canonical correlation analysis, discrete cosine transform, random projection, etc.

Random projection is a simple and computationally efficient way to reduce the dimensionality of data by trading a controlled amount of error for faster processing times and smaller model sizes. The dimensions and distribution of random projection matrices are controlled so as to approximately preserve the pairwise distances between any two samples of the dataset.

The core idea behind random projection is given in the Johnson-Lindenstrauss lemma,[2] which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves the distances between the points.

In random projection, the original d-dimensional data is projected to a k-dimensional (k << d) subspace, using a random k \times d - dimensional matrix R whose rows have unit lengths. Using matrix notation: If X_{d \times N} is the original set of N d-dimensional observations, then X_{k \times N}^{RP}=R_{k \times d}X_{d \times N} is the projection of the data onto a lower k-dimensional subspace. Random projection is computationally simple: form the random matrix "R" and project the d \times N data matrix X onto K dimensions of order O(dkN). If the data matrix X is sparse with about c nonzero entries per column, then the complexity of this operation is of order O(ckN).[3]

The random matrix R can be generated using a Gaussian distribution.[4] The first row is a random unit vector uniformly chosen from S^{N-1}. The second row is a random unit vector from the space orthogonal to the first row, the third row is a random unit vector from the space orthogonal to the first two rows, and so on. In this way of choosing R, the following properties are satisfied:

Achlioptas[5] has shown that the Gaussian distribution can be replaced by a much simpler distribution such as

R_{i,j} = \sqrt{3} \begin{cases}
+1 & \text{with probability }\frac{1}{6}\\
0 & \text{with probability }\frac{2}{3}\\
-1 & \text{with probability }\frac{1}{6} \end{cases}

This is efficient for database applications because the computations can be performed using integer arithmetic.

References

  1. Ella, Bingham; Heikki, Mannila (2001). "Random projection in dimensionality reduction: Applications to image and text data". KDD-2001: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery. pp. 245–250. doi:10.1145/502512.502546. Retrieved 22 Dec 2015.
  2. Johnson, William B.; Lindenstrauss, Joram (1984). "Extensions of Lipschitz mappings into a Hilbert space". Conference in Modern Analysis and Probability (New Haven, Conn., 1982). Contemporary Mathematics 26. Providence, RI: American Mathematical Society. pp. 189–206. doi:10.1090/conm/026/737400. MR 737400..
  3. Bingham, Ella; Mannila, Heikki (May 6, 2014). "Random projection in dimensionality reduction: Applications to image and text data" (PDF).
  4. Ailon, N.; Chazelle, B. (2009). "The Fast Johnson–Lindenstrauss Transform and Approximate Nearest Neighbors". SIAM Journal on Computing 39: 302. doi:10.1137/060673096.
  5. Achlioptas, Dimitris (2001). "Database-friendly random projections". Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS '01. p. 274. doi:10.1145/375551.375608. ISBN 1581133618.
This article is issued from Wikipedia - version of the Tuesday, January 19, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.