Random String in PostgreSQL

Mon, May 30, 2011

by Szymon Lipiński

tags: database linkedin postgresql

Last time I wrote about common problem with the random function. PostgreSQL has only the random(). There is no function like random(from, to) or random(length), which returns a random text of the given length.

This time I just needed a function which returns a random text of the given length. Then it can be used like this: SELECT random_text(42). The random text will be used as a slug in a web application, so I need only numbers and letters and let’s also have only uppercase.

Let’s create a PL/pgSQL procedure for returning a random text with the given length.

A Naive Solution

There is a very common solution to that problem: take an array of characters and choose randomly from them. Something like this:

CREATE OR REPLACE FUNCTION random_text_simple(length INTEGER)
    RETURNS TEXT
    LANGUAGE PLPGSQL
    AS $$
    DECLARE
        possible_chars TEXT := '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
        output TEXT := '';
        i INT4;
        pos INT4;
    BEGIN

        FOR i IN 1..length LOOP
            pos := random_range(1, length(possible_chars));
            output := output || substr(possible_chars, pos, 1);
        END LOOP;

        RETURN output;
    END;
    $$;

Looks great. The only thing missing there is the function random_range.

The function I use looks like this:

CREATE OR REPLACE FUNCTION random_range(INTEGER, INTEGER)
    RETURNS INTEGER
    LANGUAGE SQL
    AS $$
        SELECT ($1 + FLOOR(($2 - $1 + 1) * random() ))::INTEGER;
    $$;

The idea behind this function is described in one of my older posts: common problem with random min max.

Let’s check the efficiency of the function. How much time does it take to run it 1k times with different parameters:

  param                              time [s]
  ---------------------------------- -------------------------------------
       1                                0.013
       2                                0.013
      10                                0.052
      50                                0.115
     100                                0.223
   1,000                                2.224
  10,000                               39.885
  30,000                              217.746

One question. Does that look OK? Maybe not OK? If not OK, then why is that bad? Times are bad… but compared to what?

A One Step Better Solution

How to improve the function? My first idea was to get rid of the variable pos and inline it into the next line. After such a change the function is:

CREATE OR REPLACE FUNCTION random_text_simple_1(length INTEGER)
    RETURNS TEXT
    LANGUAGE PLPGSQL
    AS $$
    DECLARE
        possible_chars TEXT := '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
        output TEXT := '';
        i INT4;
    BEGIN

        FOR i IN 1..length LOOP
            output := output || substr(possible_chars, random_range(1, length(possible_chars)), 1);
        END LOOP;

        RETURN output;
    END;
    $$;

What about the timings?

  param                time [s]               previous [s]
  -------------------- ---------------------- ----------------------------
       1                0.010                  0.013
       2                0.011                  0.013
      10                0.024                  0.052
      50                0.087                  0.115
     100                0.168                  0.223
   1,000                1.677                  2.224
  10,000               34.677                 39.885
  30,000              199.434                217.746

Looks a little bit better. Not bad, but could we do some more about that?

A One Step Better Solution

My second try was to calculate the length of the string before the loop. Maybe PL/pgSQL can optimize that, maybe not, I will check.

Here is the function and timings:

CREATE OR REPLACE FUNCTION random_text_simple_2(length INTEGER) RETURNS TEXT LANGUAGE PLPGSQL
AS $$ 
DECLARE 
  possible_chars TEXT := '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'; 
  output TEXT := '';
  i INT4;
  chars_size INTEGER := length(possible_chars);
BEGIN 
  FOR i IN 1..length LOOP
    output := output || substr(possible_chars,
                               random_range(1, chars_size), 1); 
  END LOOP;
  RETURN output;
END; $$;

  param                time [s]               previous [s]
  -------------------- ---------------------- ----------------------------
       1                 0.011                  0.010
       2                 0.012                  0.011
      10                 0.023                  0.024
      50                 0.081                  0.087
     100                 0.153                  0.168
   1,000                 1.496                  1.677
  10,000                32.353                 34.677
  30,000               193.639                199.434

This doesn’t seem to be a big gain compared to the second one.

A One Step Better Solution

Another idea was to inline the random_range function.

CREATE OR REPLACE FUNCTION random_text_simple_3(length INTEGER)
RETURNS TEXT
LANGUAGE PLPGSQL
AS $$ 
DECLARE 
  possible_chars TEXT := '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'; 
  output TEXT := '';
  i INT4;
  chars_size INTEGER; 
BEGIN
  chars_size := length(possible_chars);
  FOR i IN 1..length LOOP 
    output := output || substr(
      possible_chars, 
      (1 + FLOOR((chars_size - $1 + 1) * random() ))::INTEGER, 1);
  END LOOP;
  RETURN output;
END; $$;

  param                time [s]               previous [s]
  -------------------- ---------------------- ----------------------------
       1                0.011                   0.011
       2                0.012                   0.012
      10                0.023                   0.023
      50                0.069                   0.081
     100                0.127                   0.153
   1,000                1.172                   1.496
  10,000               11.672                  32.353
  30,000               35.028                 193.639

Wow. I’ve never thought that inlining such a simple function would give me so huge difference.

The Best Solution… So Far

While I was thinking about the cons of PL/pgSQL, I thought about writing the code in C and just use a C library.

Then I thought about using something strange like md5(random()::TEXT). The whole idea behind this solution is to use two functions written in C. There will be some PL/pgSQL glue code, but the whole logic would be written in C.

The algorithm is:

Calculate the number of MD5 sums to get at least length chars.

Call md5(random()) code and concatenate the results.

Use substring() and upper() for the final result.

The function for the above algorithm:

CREATE OR REPLACE FUNCTION random_text_md5(length INTEGER)
RETURNS TEXT
LANGUAGE PLPGSQL
AS $$ 
DECLARE 
  -- how many md5's we need to have at least length chars
  loop_count INTEGER := CEIL(length / 32.); 
  output TEXT := ''; -- the result text 
  i INT4; -- loop counter 
BEGIN
  FOR i IN 1..loop_count LOOP
    output := output || md5(random()::TEXT);
  END LOOP; 
  -- get the substring for the exact number of characters
  -- and upper them up 
  RETURN upper(substring(output, length)); 
END; $$;

Nothing special, it just works. What about the timing? The timing will be a little bit later.

The SQL Version

Pavel wrote a very interesting comment about using a pure SQL function. I’ve changed his function a little bit and the new function looks like this:

CREATE OR REPLACE FUNCTION random_string_pavel(int)
RETURNS text
AS $$ 
  SELECT array_to_string(
    ARRAY (
      SELECT substring(
        '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ' 
        FROM (random() *36)::int FOR 1)
      FROM generate_series(1, $1) ), '' ) 
$$ LANGUAGE sql;

Additional SQL Functions

The timings for the new function were added to the below table. The chart has been updated.

Pavel posted three other versions of that function. The code is in the comment below, so I won’t place it here. I’ve run the tests once again for all the functions. The below data and charts have been updated.

My md5 version is still the best.

Combined md5 And SQL

Pavel also suggested combining the md5 version and the buildstr03. I’ve created the below version:

CREATE OR REPLACE FUNCTION random_text_md5_v2(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $$ 
  select upper(
    substring(
      (SELECT string_agg(md5(random()::TEXT), '')
       FROM generate_series(
           1,
           CEIL($1 / 32.)::integer) 
       ), 1, $1) );
$$;

This is the best solution so far, thanks Pavel. It is twice faster than the old md5 version.

Final Results

length of text	simple	simple_1	simple_2	simple_3	md5	md5_v2
1	0.013	0.010	0.011	0.011	0.023	0.039
2	0.013	0.011	0.012	0.012	0.018	0.033
10	0.030	0.024	0.023	0.023	0.017	0.034
50	0.115	0.087	0.081	0.069	0.020	0.036
100	0.223	0.168	0.153	0.127	0.025	0.041
1,000	2.244	1.677	1.496	1.172	0.106	0.098
10,000	39.885	34.677	32.353	11.672	1.462	0.672
30,000	217.746	199.434	193.639	35.028	7.448	1.962

And the chart for this.

Final Notes

PL/pgSQL is not very efficient, but can be easily improved

functions are not inlined and calls are expensive (but inlining by copying is not the best idea)

the fastest way is to use some functions written in C

I think the fastest way would be to use some function written in C

Interesting thing is that the last two functions scale linearly when the text size increases.

Pavel’s suggestions with the SQL function has the same timings as my random_test_simple_3

Another Pavel’s suggestion resulted in the best version random_text_md5_v2

The best solution random_text_md5_v2 is 12 times faster than the best PL/pgSQL solution.