random-sample - Clojure Standard Library

For a quick intro to this series of blog posts check out Clojure Standard Library - Intro. It includes a lot of useful info, including notes about presentation of examples and more.

random-sample

Quick Overview

Description

clojure.core/random-sample is a function that can be used to take a random sample of a collection. It samples without replacement and considers each element within the collection independently.

Examples

(random-sample 0.5 [0 1 2 3 4 5 6 7 8 9 10])
; => (0 4 6)
(random-sample 0.5 (range 11))
; => (2 4 7 8 9)
(random-sample 1/10 (range 30))
; => (2 11 16 21)


Note: With most of the examples, if you run them yourself, your output will likely be different. random-sample is nondeterministic.

How To Use

Parameters and Return Values

random-sample is a multiple-arity function, which means that it uses arity overloading in order to provide different behavior based on the number of arguments provided. random-sample accepts one or two arguments.

• One Argument: (random-sample prob)

Providing only prob (probability), returns a transducer; this is not the same as partial application or currying. prob can be any type of number (e.g. integer, float, ratio, etc.), or anything else that < can compare to a float. Note that only numbers between 0 and 1 make sense; providing 0 (or smaller) results in no items being selected and 1 (or greater) results in all items being selected.

random-sample considers each item in the collection individually, so prob is the probability that any given item will be returned.

Ex:

;; -- Transducer Examples --

;; Creat a transducer that selects items 50% of the time
;; and bind it to sample-about-half.
(def sample-about-half (random-sample 0.5))

;; Use sample-about-half to sample items from a collection
;; of 20 integers.
(into [] sample-about-half (range 20))
; => [1 9 10 11 12 13 15 19] ;; Notice that only 8 items
;; are selected from the
;; population of 20 items.

;; Calculating the sums of random samples from a collection
;; of numbers 0 through 19.
(transduce sample-about-half + 0 (range 20))
; => 108
(transduce sample-about-half + 0 (range 20))
; => 67

• Two Arguments: (random-sample prob coll)

Providing both prob and a collection (coll), returns a lazy sequence of items randomly selected from coll without replacement. Each item in coll is ‘considered’ independently from the other items, to better understand this check out the break down of the source code in the Expanded Description section below.

Ex:

;; Randomly Selecting items from a collection of numbers 0
;; through 19, with prob = 0.5.
(random-sample 0.5 (range 20))
; => (0 4 6 7 8 11 15 17 18 19)
(random-sample 0.5 (range 20))
; => (0 6 8 9 11 12 14 17)  ;; Notice how both the items
;; selected and the size of the
;; sample are different even
;; though the arguments are the
;; same.

;; This is the same as supplying 0.5 for prob.
(random-sample 1/2 (range 20))
; => (0 3 5 8 10 11 14 16 17 18 19)

;; Out of 100,000 random samples, with prob set to 0.01,
;; from populations of 20 items about how many samples will
;; be non-empty. Refer to the footnote below for more info
;; on the functions used.
(count (filter not-empty
(repeatedly 100000
#(random-sample 0.01 (range 20)))))
; => 18223
(count (filter not-empty
(repeatedly 100000
#(random-sample 0.01 (range 20)))))
; => 18241
;; About 18.2% of the samples will be non-empty.

;; Ex: prob >= 1
(random-sample 1 (range 20))
; => (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19)
(random-sample 20 (range 20))
; => (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19)

;; Ex: prob <= 0
(random-sample 0 (range 20))
; => ()  ;; Returns an empty collection.
(random-sample -200 (range 20))
; => ()
(random-sample -0.5 (range 20))
; => ()


1

Expanded Description

Examining Simplified Source Code

Below is a simplified version of the source code for random-sample. This simplified version is called smpl-random-sample. The main difference between the original and smpl-random-sample, is that I have changed it from being a multiple-arity function to a two-arity funcion. This means that it no longer accepts a variable number of arguments. It requires both prob and coll and it behaves identically to random-sample when both prob and coll are provided. I made this change in order reduce the number of concepts involved and to make it easier to understand. 2

(defn smpl-random-sample                   ; Name
"Simplified version of random-sample."   ; Doc String
[prob coll]                              ; Parameters
(filter (fn [_] (< (rand) prob)) coll))  ; Function Body


Breaking It Down:

There are Two main parts to the body of this function, the Filter Expression and the Predicate function.

(filter                     ; Filter Expression
(fn [_] (< (rand) prob))  ; Predicate Function
coll)                     ; End of Filter Expression

Predicate Function:(fn [_] (< (rand) prob))

The predicate function used is an anonymous function – it’s like any other function except that it has not been bound to a name.

Looking at the body of the predicate function you see the mechanism that is the heart of the random-sample function. (< (rand) prob), this expression returns true when the random number, produced by rand, is less than prob – remember prob is the probability supplied to random-sample that represents the chance that any individual item will be selected. Because rand returns a number between 0 (inclusive) and 1 (exclusive), any number supplied for prob that is less than 0 is effectively 0 and any number greater than 1 is effectively 1.

Here is an example that should help to provide a better understanding of how the (< (rand) prob) expression works.:

(def r (rand))           ; Bind a random float to the name "r".
; => #'user/r
r                        ; Check the value bound to "r".
; => 0.6482491474337703
(def prob 0.5)           ; Bind the value 0.5 to the name "prob".
; => #'user/prob
(< r prob)               ; Test if "r" is smaller than "prob".
; => false


For more information about rand and how it affects which items are selected check out the section “More About rand And Randomness” below.

Filter Expression:(filter predicate-function coll)

The filter expression uses the function filter which iterates over the collection coll, calling the predicate function on each item. When the predicate function returns ‘true’ the item is included in the output, if it returns ‘false’ the item is not included – it is ‘filtered’ out.

More About rand And Randomness

The pseudorandom number generator used by rand approximates a Uniform Distribution (UD) and rand only produces numbers between 0 and 1. 3 With rand following a UD, each number between 0 and 1 has the same probability of being returned. As a result there is a 50% chance that a number produced by rand will be less than 0.5, because 50% of numbers that could be chosen are less than 0.5 and they are all equiprobable. The same is true for other values, e.g. 20% less than 0.2, 60% less than 0.6, etc..

How does this effect the behavior of random-sample? Each item is considered independently so, this process can not make any guarantees as to the number of items that will ultimately be sampled form col. The only guarantee is that each item will have a probability of being chosen equal to prob.

Because the random numbers used are sampled from a UD, as the size of the population increases the distribution of the sample sizes (as a percentage of the population) produced will become increasingly normally distributed with a mean equal to prob.4 This behavior is clearly shown in the plots below.

The plots below show the distributions of samples sizes (as a percentage of the population) produced by random-sample with prob set to 0.1. All of the plots were produced using 100,000 samples but, each one shows samples taken from a different population size. The top plot shows samples taken from a population of 100, the middle a population of 500 and the bottom a population of 1,000. 5

In the plots, the vertical bars show a histogram of the samples, the blue curve shows the probability density function estimated using Kernel Density Estimation and the black curve is a normal distribution used for comparison.

What all of this ends up meaning is that, the larger the size of col (the population) the more likely it is that the resulting sample will be a percentage of col that is equal to prob. So the larger the collection is the more accurately you will be able to target a specific sample size. Note that there are better methods for sampling large populations (e.g. Reservoir Sampling) but, random-sample can still prove to be useful when the sample size does not need to be exact.

Example Use Case

A possible use case might be to use random-sample to create a random sample of an incoming stream of data of unknown size. This can be especially useful if you want to perform some kind of analysis of a live data stream and you can not handle all of the data at the rate it is steaming in – you want to turn the fire hose into a plain garden hose.

Ex:

;; Create a stream of 1 million integers.
(def big-stream (range 1000000))
; => #'user/big-stream

;; Create a sample of big-stream; each item has a
;; probability of being chosen equal to 0.1.
(def sample (random-sample 0.1 big-stream))
; => #'user/sample

;; Filter the items in sample using some expensive
;; predicate, to find out which items fit some category.
(def fit-category (filter expensive-calculation sample))

;; Find out the percentage of items that fit in some
;; category.
(def percent-that-fit-category (/ (count fit-category)
(count sample)))

;; The output of this can be used to estimate the percentage
;; of the population that would fit in some category.


Copyright (C) 2017 Steven Cutting - License

1. Check out posts on: count, filter, not-empty, range and repeatedly ↩︎

2. For the most up to date version of the actual source code click on the source code link in this page for random-sample ↩︎

3. For even more information about rand check out my other post that covers it in more detail: rand - Clojure Standard Library↩︎

4. This process is described by the Central Limit Theorem↩︎

5. The data for the plots was produced using this code: code ↩︎