randomsample  Clojure Standard Library
For a quick intro to this series of blog posts check out Clojure Standard Library  Intro. It includes a lot of useful info, including notes about presentation of examples and more.
This posts function:
randomsample
Quick Overview
Description
clojure.core/randomsample
is a function that can be used to take a random sample of a collection. It samples without replacement and considers each element within the collection independently.
Examples
(randomsample 0.5 [0 1 2 3 4 5 6 7 8 9 10])
; => (0 4 6)
(randomsample 0.5 (range 11))
; => (2 4 7 8 9)
(randomsample 1/10 (range 30))
; => (2 11 16 21)
Note: With most of the examples, if you run them yourself, your output will likely be different. randomsample
is nondeterministic.
How To Use
Parameters and Return Values
randomsample
is a multiplearity function, which means that it uses arity overloading in order to provide different behavior based on the number of arguments provided. randomsample
accepts one or two arguments.

One Argument:
(randomsample prob)
Providing only
prob
(probability), returns a transducer; this is not the same as partial application or currying.prob
can be any type of number (e.g. integer, float, ratio, etc.), or anything else that<
can compare to a float. Note that only numbers between 0 and 1 make sense; providing 0 (or smaller) results in no items being selected and 1 (or greater) results in all items being selected.randomsample
considers each item in the collection individually, soprob
is the probability that any given item will be returned.Ex:
;;  Transducer Examples 
;; Creat a transducer that selects items 50% of the time
;; and bind it to sampleabouthalf.
(def sampleabouthalf (randomsample 0.5))
; => #'user/sampleabouthalf
;; Use sampleabouthalf to sample items from a collection
;; of 20 integers.
(into [] sampleabouthalf (range 20))
; => [1 9 10 11 12 13 15 19] ;; Notice that only 8 items
;; are selected from the
;; population of 20 items.
;; Calculating the sums of random samples from a collection
;; of numbers 0 through 19.
(transduce sampleabouthalf + 0 (range 20))
; => 108
(transduce sampleabouthalf + 0 (range 20))
; => 67

Two Arguments:
(randomsample prob coll)
Providing both
prob
and a collection (coll
), returns a lazy sequence of items randomly selected fromcoll
without replacement. Each item incoll
is ‘considered’ independently from the other items, to better understand this check out the break down of the source code in the Expanded Description section below.Ex:
;; Randomly Selecting items from a collection of numbers 0
;; through 19, with prob = 0.5.
(randomsample 0.5 (range 20))
; => (0 4 6 7 8 11 15 17 18 19)
(randomsample 0.5 (range 20))
; => (0 6 8 9 11 12 14 17) ;; Notice how both the items
;; selected and the size of the
;; sample are different even
;; though the arguments are the
;; same.
;; This is the same as supplying 0.5 for prob.
(randomsample 1/2 (range 20))
; => (0 3 5 8 10 11 14 16 17 18 19)
;; Out of 100,000 random samples, with prob set to 0.01,
;; from populations of 20 items about how many samples will
;; be nonempty. Refer to the footnote below for more info
;; on the functions used.
(count (filter notempty
(repeatedly 100000
#(randomsample 0.01 (range 20)))))
; => 18223
(count (filter notempty
(repeatedly 100000
#(randomsample 0.01 (range 20)))))
; => 18241
;; About 18.2% of the samples will be nonempty.
;; Ex: prob >= 1
(randomsample 1 (range 20))
; => (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19)
(randomsample 20 (range 20))
; => (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19)
;; Ex: prob <= 0
(randomsample 0 (range 20))
; => () ;; Returns an empty collection.
(randomsample 200 (range 20))
; => ()
(randomsample 0.5 (range 20))
; => ()
^{1}
Expanded Description
Examining Simplified Source Code
Below is a simplified version of the source code for randomsample
. This simplified version is called smplrandomsample
. The main difference between the original and smplrandomsample
, is that I have changed it from being a multiplearity function to a twoarity funcion. This means that it no longer accepts a variable number of arguments. It requires both prob
and coll
and it behaves identically to randomsample
when both prob
and coll
are provided. I made this change in order reduce the number of concepts involved and to make it easier to understand. ^{2}
(defn smplrandomsample ; Name
"Simplified version of randomsample." ; Doc String
[prob coll] ; Parameters
(filter (fn [_] (< (rand) prob)) coll)) ; Function Body
Breaking It Down:
There are Two main parts to the body of this function, the Filter Expression and the Predicate function.
(filter ; Filter Expression
(fn [_] (< (rand) prob)) ; Predicate Function
coll) ; End of Filter Expression
Predicate Function: (fn [_] (< (rand) prob))
The predicate function used is an anonymous function – it’s like any other function except that it has not been bound to a name.
Looking at the body of the predicate function you see the mechanism that is the heart of the randomsample
function. (< (rand) prob)
, this expression returns true when the random number, produced by rand
, is less than prob
– remember prob
is the probability supplied to randomsample that represents the chance that any individual item will be selected. Because rand
returns a number between 0 (inclusive) and 1 (exclusive), any number supplied for prob
that is less than 0 is effectively 0 and any number greater than 1 is effectively 1.
Here is an example that should help to provide a better understanding of how the (< (rand) prob)
expression works.:
(def r (rand)) ; Bind a random float to the name "r".
; => #'user/r
r ; Check the value bound to "r".
; => 0.6482491474337703
(def prob 0.5) ; Bind the value 0.5 to the name "prob".
; => #'user/prob
(< r prob) ; Test if "r" is smaller than "prob".
; => false
For more information about rand
and how it affects which items are selected check out the section “More About rand
And Randomness” below.
Filter Expression: (filter predicatefunction coll)
The filter expression uses the function filter
which iterates over the collection coll
, calling the predicate function on each item. When the predicate function returns ‘true’ the item is included in the output, if it returns ‘false’ the item is not included – it is ‘filtered’ out.
More About rand
And Randomness
The pseudorandom number generator used by rand
approximates a Uniform Distribution (UD) and rand
only produces numbers between 0 and 1. ^{3} With rand
following a UD, each number between 0 and 1 has the same probability of being returned. As a result there is a 50% chance that a number produced by rand
will be less than 0.5, because 50% of numbers that could be chosen are less than 0.5 and they are all equiprobable. The same is true for other values, e.g. 20% less than 0.2, 60% less than 0.6, etc..
How does this effect the behavior of randomsample
? Each item is considered independently so, this process can not make any guarantees as to the number of items that will ultimately be sampled form col
. The only guarantee is that each item will have a probability of being chosen equal to prob
.
Because the random numbers used are sampled from a UD, as the size of the population increases the distribution of the sample sizes (as a percentage of the population) produced will become increasingly normally distributed with a mean equal to prob
.^{4} This behavior is clearly shown in the plots below.
The plots below show the distributions of samples sizes (as a percentage of the population) produced by randomsample
with prob
set to 0.1. All of the plots were produced using 100,000 samples but, each one shows samples taken from a different population size. The top plot shows samples taken from a population of 100, the middle a population of 500 and the bottom a population of 1,000. ^{5}
In the plots, the vertical bars show a histogram of the samples, the blue curve shows the probability density function estimated using Kernel Density Estimation and the black curve is a normal distribution used for comparison.
What all of this ends up meaning is that, the larger the size of col
(the population) the more likely it is that the resulting sample will be a percentage of col
that is equal to prob
. So the larger the collection is the more accurately you will be able to target a specific sample size. Note that there are better methods for sampling large populations (e.g. Reservoir Sampling) but, randomsample
can still prove to be useful when the sample size does not need to be exact.
Example Use Case
A possible use case might be to use randomsample
to create a random sample of an incoming stream of data of unknown size. This can be especially useful if you want to perform some kind of analysis of a live data stream and you can not handle all of the data at the rate it is steaming in – you want to turn the fire hose into a plain garden hose.
Ex:
;; Create a stream of 1 million integers.
(def bigstream (range 1000000))
; => #'user/bigstream
;; Create a sample of bigstream; each item has a
;; probability of being chosen equal to 0.1.
(def sample (randomsample 0.1 bigstream))
; => #'user/sample
;; Filter the items in sample using some expensive
;; predicate, to find out which items fit some category.
(def fitcategory (filter expensivecalculation sample))
;; Find out the percentage of items that fit in some
;; category.
(def percentthatfitcategory (/ (count fitcategory)
(count sample)))
;; The output of this can be used to estimate the percentage
;; of the population that would fit in some category.
Copyright (C) 2017 Steven Cutting  License

Check out posts on:
count
,filter
,notempty
,range
andrepeatedly
↩︎ 
For the most up to date version of the actual source code click on the source code link in this page for randomsample ↩︎

For even more information about
rand
check out my other post that covers it in more detail: rand  Clojure Standard Library. ↩︎ 
This process is described by the Central Limit Theorem. ↩︎

The data for the plots was produced using this code: code ↩︎