RandomRDDs#
- class pyspark.mllib.random.RandomRDDs[source]#
- Generator methods for creating RDDs comprised of i.i.d samples from some distribution. - New in version 1.1.0. - Methods - exponentialRDD(sc, mean, size[, ...])- Generates an RDD comprised of i.i.d. - exponentialVectorRDD(sc, mean, numRows, numCols)- Generates an RDD comprised of vectors containing i.i.d. - gammaRDD(sc, shape, scale, size[, ...])- Generates an RDD comprised of i.i.d. - gammaVectorRDD(sc, shape, scale, numRows, ...)- Generates an RDD comprised of vectors containing i.i.d. - logNormalRDD(sc, mean, std, size[, ...])- Generates an RDD comprised of i.i.d. - logNormalVectorRDD(sc, mean, std, numRows, ...)- Generates an RDD comprised of vectors containing i.i.d. - normalRDD(sc, size[, numPartitions, seed])- Generates an RDD comprised of i.i.d. - normalVectorRDD(sc, numRows, numCols[, ...])- Generates an RDD comprised of vectors containing i.i.d. - poissonRDD(sc, mean, size[, numPartitions, seed])- Generates an RDD comprised of i.i.d. - poissonVectorRDD(sc, mean, numRows, numCols)- Generates an RDD comprised of vectors containing i.i.d. - uniformRDD(sc, size[, numPartitions, seed])- Generates an RDD comprised of i.i.d. - uniformVectorRDD(sc, numRows, numCols[, ...])- Generates an RDD comprised of vectors containing i.i.d. - Methods Documentation - static exponentialRDD(sc, mean, size, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean. - New in version 1.3.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- meanfloat
- Mean, or 1 / lambda, for the Exponential distribution. 
- sizeint
- Size of the RDD. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of float comprised of i.i.d. samples ~ Exp(mean). 
 
 - Examples - >>> mean = 2.0 >>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> bool(abs(stats.stdev() - sqrt(mean)) < 0.5) True 
 - static exponentialVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean. - New in version 1.3.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- meanfloat
- Mean, or 1 / lambda, for the Exponential distribution. 
- numRowsint
- Number of Vectors in the RDD. 
- numColsint
- Number of elements in each Vector. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism) 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean). 
 
 - Examples - >>> import numpy as np >>> mean = 0.5 >>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.asmatrix(rdd.collect()) >>> mat.shape (100, 100) >>> bool(abs(mat.mean() - mean) < 0.5) True >>> from math import sqrt >>> bool(abs(mat.std() - sqrt(mean)) < 0.5) True 
 - static gammaRDD(sc, shape, scale, size, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale. - New in version 1.3.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- shapefloat
- shape (> 0) parameter for the Gamma distribution 
- scalefloat
- scale (> 0) parameter for the Gamma distribution 
- sizeint
- Size of the RDD. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale). 
 
 - Examples - >>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> bool(abs(stats.mean() - expMean) < 0.5) True >>> bool(abs(stats.stdev() - expStd) < 0.5) True 
 - static gammaVectorRDD(sc, shape, scale, numRows, numCols, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution. - New in version 1.3.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- shapefloat
- Shape (> 0) of the Gamma distribution 
- scalefloat
- Scale (> 0) of the Gamma distribution 
- numRowsint
- Number of Vectors in the RDD. 
- numColsint
- Number of elements in each Vector. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional,
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale). 
 
 - Examples - >>> import numpy as np >>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> bool(abs(mat.mean() - expMean) < 0.1) True >>> bool(abs(mat.std() - expStd) < 0.1) True 
 - static logNormalRDD(sc, mean, std, size, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution. - New in version 1.3.0. - Parameters
- scpyspark.SparkContext
- used to create the RDD. 
- meanfloat
- mean for the log Normal distribution 
- stdfloat
- std for the log Normal distribution 
- sizeint
- Size of the RDD. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- RDD of float comprised of i.i.d. samples ~ log N(mean, std).
 
 - Examples - >>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> bool(abs(stats.mean() - expMean) < 0.5) True >>> from math import sqrt >>> bool(abs(stats.stdev() - expStd) < 0.5) True 
 - static logNormalVectorRDD(sc, mean, std, numRows, numCols, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution. - New in version 1.3.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- meanfloat
- Mean of the log normal distribution 
- stdfloat
- Standard Deviation of the log normal distribution 
- numRowsint
- Number of Vectors in the RDD. 
- numColsint
- Number of elements in each Vector. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std). 
 
 - Examples - >>> import numpy as np >>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect() >>> mat = np.matrix(m) >>> mat.shape (100, 100) >>> bool(abs(mat.mean() - expMean) < 0.1) True >>> bool(abs(mat.std() - expStd) < 0.1) True 
 - static normalRDD(sc, size, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of i.i.d. samples from the standard normal distribution. - To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use - RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)- New in version 1.1.0. - Parameters
- scpyspark.SparkContext
- used to create the RDD. 
- sizeint
- Size of the RDD. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0). 
 
 - Examples - >>> x = RandomRDDs.normalRDD(sc, 1000, seed=1) >>> stats = x.stats() >>> stats.count() 1000 >>> bool(abs(stats.mean() - 0.0) < 0.1) True >>> bool(abs(stats.stdev() - 1.0) < 0.1) True 
 - static normalVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution. - New in version 1.1.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- numRowsint
- Number of Vectors in the RDD. 
- numColsint
- Number of elements in each Vector. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0). 
 
 - Examples - >>> import numpy as np >>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> bool(abs(mat.mean() - 0.0) < 0.1) True >>> bool(abs(mat.std() - 1.0) < 0.1) True 
 - static poissonRDD(sc, mean, size, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean. - New in version 1.1.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- meanfloat
- Mean, or lambda, for the Poisson distribution. 
- sizeint
- Size of the RDD. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of float comprised of i.i.d. samples ~ Pois(mean). 
 
 - Examples - >>> mean = 100.0 >>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> bool(abs(stats.stdev() - sqrt(mean)) < 0.5) True 
 - static poissonVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean. - New in version 1.1.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- meanfloat
- Mean, or lambda, for the Poisson distribution. 
- numRowsfloat
- Number of Vectors in the RDD. 
- numColsint
- Number of elements in each Vector. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism) 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean). 
 
 - Examples - >>> import numpy as np >>> mean = 100.0 >>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.asmatrix(rdd.collect()) >>> mat.shape (100, 100) >>> bool(abs(mat.mean() - mean) < 0.5) True >>> from math import sqrt >>> bool(abs(mat.std() - sqrt(mean)) < 0.5) True 
 - static uniformRDD(sc, size, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0). - To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use - RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)- New in version 1.1.0. - Parameters
- scpyspark.SparkContext
- used to create the RDD. 
- sizeint
- Size of the RDD. 
- numPartitionsint, optional
- Number of partitions in the RDD (default: sc.defaultParallelism). 
- seedint, optional
- Random seed (default: a random long integer). 
 
- sc
- Returns
- pyspark.RDD
- RDD of float comprised of i.i.d. samples ~ U(0.0, 1.0). 
 
 - Examples - >>> x = RandomRDDs.uniformRDD(sc, 100).collect() >>> len(x) 100 >>> max(x) <= 1.0 and min(x) >= 0.0 True >>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions() 4 >>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions() >>> parts == sc.defaultParallelism True 
 - static uniformVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]#
- Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0). - New in version 1.1.0. - Parameters
- scpyspark.SparkContext
- SparkContext used to create the RDD. 
- numRowsint
- Number of Vectors in the RDD. 
- numColsint
- Number of elements in each Vector. 
- numPartitionsint, optional
- Number of partitions in the RDD. 
- seedint, optional
- Seed for the RNG that generates the seed for the generator in each partition. 
 
- sc
- Returns
- pyspark.RDD
- RDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0). 
 
 - Examples - >>> import numpy as np >>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect()) >>> mat.shape (10, 10) >>> bool(mat.max() <= 1.0 and mat.min() >= 0.0) True >>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions() 4