Code snippets to calculate percentiles in databases

As a Datavis Engineer at Shutterstock, I dive into a lot of data everyday and routinely answer questions regarding customer behaviors, site security, and latency issues. I keep a list of SQL snippets to copy and paste into my work, and I found that keeping the list of examples is easier than memorizing a bunch of similar-but-different SQL dialects.

Calculating percentiles comes up regularly in my work, and it’s also a popular area of confusion. So, I’m going to be breaking down the appliation of calculating percentiles for you. I find a business context helps me understand the tech behind the topic, so I organized queries into four case studies. I’ll state the situation, present the technology, give the query, then wrap it up.

Note: I am very specifically not comparing technologies. The internet is filthy with those posts. I am offering copiable code without the hype.

Also note: all data is fake. I show numbers to give a sense of the output to expect, not a scale for comparison.

CASE STUDY, USING HP VERTICA: HOW LONG DOES A CONTRIBUTOR WAIT FOR THEIR PHOTO TO GO PUBLIC ON THE SITE?

SITUATION:

Shutterstock is a two-sided marketplace. We license crowd-sourced media. Contributors do not receive money for their images until it is published on our website. Tens of thousands of images are submitted daily, and they all need to be inspected carefully. Hence, an entire team is dedicated to inspecting the images, and they aim to maintain a thorough and speedy process.

TECHNOLOGY:

We use Vertica for analytic duties. Like most columnar database technologies, Vertica’s row lookups are slow, but the aggregates are blazingly fast. It is the only pay-for-use database in this post.

QUERY:

Vertica’s flavor of SQL is fairly similar to MySQL’s, with analytic functions similar to other OLAP databases. Vertica has many built in analytical functions. I use PERCENTILE_DISC() for this case study.

SELECT
  DISTINCT
  added_date,
  PERCENTILE_DISC(.9)
    WITHIN GROUP(ORDER BY datediff(minute, added_datetime, approved_datetime))
    OVER (PARTITION BY added_date)
    AS '90th',
  PERCENTILE_DISC(.5)
    WITHIN GROUP(ORDER BY datediff(minute, added_datetime, approved_datetime))
    OVER (PARTITION BY added_date)
    AS 'median'
FROM
  photo_approval_table
WHERE
  added_date >= current_date() - interval '4 day'

RESULTS (as a reminder, there are 1440 minutes in a day):

added_date median 90th
2014-01-01 2880 5000
2014-01-02 1440 6000
2014-01-03 2000 4800
2014-01-04 3000 5500

Half of the photos uploaded on January 1 took two days to show up on our website.  There is a big gap between the median and 90th percentile approval times on January 2. If this data was real, I would investigate why January 2 is different from other days.

 

CASE STUDY, USING Apache HIVE: HOW MANY TIMES IN A DAY DOES A SPECIFIC ELEMENT GET CLICKED?

SITUATION:

We track the efficacy of our designs. Knowing how often an element is clicked gives us insight into what our customers actually see on a page. If customers click an element that is not a link, then we can make changes to our HTML.

TECHNOLOGY:

We store raw click data in HDFS. Hive is a familiar SQL interface to data stored in HDFS. It has a built in PERCENTILE() UDF.

QUERY:

I begin by running an inner query to get customer click counts on the specific element per day. I wrap that inner query in the daily percentile main query to find percentiles of customer behavior. I need to build the inner query because PERCENTILE() does not accept COUNT() as its first argument.

SELECT
  day,
  percentile(count, 0.25),
  percentile(count, 0.50),
  percentile(count, 0.75),
  percentile(count, 0.99)
from
  (
  -- inner query
    select
      day,
      visit_id,
      count(*) as count
    from
      click_tracking_table
    where
      element = 'header_div'
      and page = '/search_results.html'
      and year = 2014 and month = 4
    group by
      visit_id,
      day
  ) c
group by
  day

RESULTS:

Sample data result of inner query:

   visitor_id | day | count
       1      |  1  |  5
       2      |  1  |  7
       2      |  2  |  9

All results:

day _c1 _c2 _c3 _c4
01 1.0 3.0 15.0 52.0
02 1.0 3.0 15.0 64.0
03 1.0 3.0 14.0 68.0

Judging by median click counts, _c2, customers click on a non-link element about three times in a session. Some click as many as fifteen times. Wow. The header_div element should be made clickable.

 

CASE STUDY, USING MySQL: ARE CUSTOMERS SIGNING UP FOR ACCOUNTS FASTER NOW THAN THEY WERE A YEAR AGO?

SITUATION:

Shutterstock’s B.I. team does an excellent job of analyzing marketing spends and conversions. Sometimes it is easier for me to get the data than pull an analyst away from their work.

TECHNOLOGY:

MySQL is a widely used transactional database. It’s fast and full-featured, but does not have built-in support for calculating percentiles.

QUERY:

I need to compare a rank, meaning a position in an ordered list, against a total count of rows. Position over total count gives me a percentile.

Complex queries are built inside out. This query starts with inner query t, which counts the total number of accounts per language. I join the per language counts to the accounts table, and pull out a customer’s signup date.

There are plenty of resources for calculating row level ranks in MySQL around the web. The techniques boil down to setting variables in an ordered result set. Here, I order the inner query r by language and keep track of the current language in @lang. When language changes, @rank resets to 0 in _resetRank column. Neat!

The outer query compares signup dates to milestone dates.

Given that we’re only looking at sign ups that occurred in the past year, we would expect twenty-five percent of signups to happen exactly nine months ago under consistent signups. 50% six months ago. If 25% of signups happens before the nine month milestone, the first quarter of the year saw “fast” signup. This query returns “gut check” data; it’s not vigorously tested or verified.

select
  r.language,
  datediff(
    date(max(case when percentile <= 25 then signup_datetime end)),
    current_date - interval 9 month
  ) AS '25th',
  datediff(
    date(min(case when percentile >= 50 then signup_datetime end)),
    current_date - interval 6 month
  ) as 'median',
  datediff(
    date(min(case when percentile >= 75 then signup_datetime end)),
    current_date - interval 3 month
  ) as '75th'
from
  (
    /* build ranks and percentiles */
    select
      a.signup_datetime,
      @rank := if( @lang = a.language, @rank, 0 ) as _resetRank,
      @lang := a.language as language,
      @rank := @rank+1 as rank,
      round( 100 * @rank / t.cnt, 3 ) as percentile
    from
      accounts a
    /* t counts rows per language */
    join
      (
         select
           language,
           count(*) as cnt
         from
           accounts
         where
           signup_datetime > current_date - interval 1 year
         group by
           language
       ) t on t.language = a.language
    where
      a.signup_datetime > current_date - interval 1 year
    order by
      a.language,
      a.signup_datetime
  ) r
group by
  r.language
order by
  min(case when percentile >= 25 then signup_datetime end),
  min(case when percentile >= 50 then signup_datetime end);

RESULTS:

+----------+------+--------+------+
| language | 25th | median | 75th |
+----------+------+--------+------+
| de       |  -18 |     -9 |    2 |
| en       |    0 |      0 |    0 |
| vu       |   82 |     54 |   39 |
+----------+------+--------+------+

German language sign ups hit first the 25th percentile 18 days early and median nine days early, but third quartile was not reached until two days after expected. German language signups are slowing down. Vulcan, which was a slow trickle at the beginning of the year, hit a boom in the recent 3 months; guess that booth at the convention worked out.

 

CASE STUDY, USING MONGO: HOW MANY TAGS DO ANNOTATIONS HAVE?

SITUATION:

As events happen on our site, domain experts post comments to our internal annotation service. Comments are tagged with multiple keywords, and those tags form the structure for our knowledge base. It is one way we can link homepage latency with conversion rates. In such a system, keyword breadth is highly important, so I want to know how many annotations keywords link together.

TECHNOLOGY:

MongoDB is a document store, NoSQL technology. It does not have out-of-the-box percentile functionality, but it does have a well documented MapReduce framework. Up until now, all the percentile solutions, even MySQL, have been a DSL; MapReduce is full-on programming.

QUERY:

“tags” is an array field for a AnnotationCollection document. I emit() each tag, summing them up in the reducer. Basic MapReduce word counting. I inline the output of the MapRedue job, ‘{ inline : 1 }‘, to capture the results in an array of key, value tuples. I then sort the tuple array, biggest first. Finally, use the percentile times the total number of records to get the index of the tuple.

mapper = function() {
  if ( this.tags ) {
    for ( var i = 0; i < this.tags.length; i++ ) {
      emit( this.tags[i],1 )
    };
  }
}

reducer = function(pre,curr) {
  var count = 0;
  for (var i in curr) {
    count += curr[i]
  };
  return count;
}

out = db.runCommand({
   mapreduce  : 'AnnotationCollection',
   map        : mapper,
   reduce     : reducer,
   out        : { inline : 1 }
});

/* sort them */
out.results.sort( function(a,b ) { return a.value - b.value } )

/* these percentiles */
[ 50, 80, 90, 99 ].forEach(function(v) {
   print( v + 'th percentile of annotations linked by keyword tags is ', out.results[Math.floor(v * out.counts.output / 100) ].value );
})

RESULTS:

Sample results from out:

...
		{
			"_id" : "german seo",
			"value" : 98
		},
		{
			"_id" : "glusterfs",
			"value" : 145
		},
		{
			"_id" : "googlebot",
			"value" : 2123
		},
...
	"counts" : {
		"input" : 711475,
		"emit" : 1543510,
		"reduce" : 711475,
		"output" : 738
	},

All results:

50th percentile of annotations linked by keyword tags is  4
80th percentile of annotations linked by keyword tags is  8
90th percentile of annotations linked by keyword tags is  27
99th percentile of annotations linked by keyword tags is  853

Keyword tags link, on average, 4 annotations. Decent coverage, but could be better. But the top 1% of our (fake-data) keywords link 800+ annotations? “deployment” is the top keyword; automated processes are good knowledge sharers.

This post is a reference for calculating percentiles in different data stores. I keep a list of working queries to copy from at work. And now I’m pleased to share some of this list with you. Try them out, and let me know what you think.