Here at Shutterstock we love digging into data. We collect large amounts of it, and want a simple, fast way to access it. One of the tools we use to do this is Apache Solr.
Most users of Solr will know it for its power as a full-text search engine. Its text analyzers, sorting, filtering, and faceting components provide an ample toolset for many search applications. A single instance can scale to hundreds of millions of documents (depending on your hardware), and it can scale even further through sharding. Modern web search applications also need to be fast, and Solr can deliver in this area as well.
The needs of a data analytics platform aren’t much different. It too requires a platform that can scale to support large volumes of data. It requires speed, and depends heavily on a system that can scale horizontally through sharding as well. And some of the main operations of data analytics—counting, slicing, and grouping—can be implemented using Solr’s filtering and faceting options.
One example of how we’ve used Solr this way is for search analytics. Instead of indexing things like website content, or image keywords, the index for this system consists of search events. Let’s say we want to analyze our search data based on the language of the query and country where the user is located. A single document would contain fields for the search term, language, country, and the timestamp of the search (and an auto-generated uuid to identify each unique search event).
<fields>
<field name="uid" type="string"/>
<field name="search_term" type="string"/>
<field name="country" type="string"/>
<field name="language" type="string"/>
<field name="search_date" type="date"/>
</fields>
Our Solr schema for this example
{
"search_term": "navidad",
"country": "Spain",
"city": "Madrid",
"language": "Spanish",
"uid": 123412341234,
"search_date": '2012-12-04T10:30:45Z'
}
A document representing the search event that we’re storing in Solr
If we ran a query on this data, and faceted on search_term, Solr would give us an ordered list of the most frequent searches, and their counts. Now we can take a slice of this data and filter by country:Spain. Now we have the top searches from users in Spain.
Taking this further, we can filter by a date range, and look at, say, searches that occurred in December. And now, surprisingly, we see the top search_term Navidad percolate to the top.
http://localhost:8983/solr/select?q=*:*&fq=country:Spain&
facet=true&facet.field=search_term&facet.mincount=1&
facet.limit=100
This Solr query will get us the top search terms used in Spain
We can take our analysis yet further by utilizing Solr’s powerful date range faceting. This lets us group our results by some unit of time. Let’s say we set our interval to a week (set our facet.range.gap to “+7DAY”). Also, instead of facting on search_term, let’s filter by search_term:Navidad. Now our results give us the number of times this query was used each week. Send these numbers to a graph and we can generate a trendline telling us when our Spanish users started getting interested in Christmas last year.
http://localhost:8983/solr/select?q=*:*&fq=country:Spain&
fq=search\_term:navidad&facet=true&facet.range=search\_date&
facet.range.gap=%2B7DAYS&facet.range.end=2013-01-01T00:00:00Z&
facet.range.start=2012-01-01T00:00:00Z
This query will tell us how many times per week navidad was searched for in Spain.
In essence, what we’ve built is a simple OLAP cube, except on commodity hardware, using open-source software. And although cubes and the MDX query language provide a much richer set of features, some of the core pieces can be replicated in a Solr-based architecture:
Hierarchies: Tiers of attributes, such as year > month > date, or Continent > Country > City, can be represented as multiple fields that are populated at index time. Solr won’t know the relation between each tier, so you may want to index the values more verbosely – e.g. your “country” would be “Europe.Spain”, and your “city” field would be “Europe.Spain.Madrid”—so there’d be no mixing of Madrid, Alabama in your results when filtering or faceting by city.
Measures: this is your unit of measurement of whatever you’re counting. In our example, the only thing we measured was number of searches. Some more complex units of measurement might be the 95th percentile response time of a given search, or the total number of dollars spent after a user performed a given search. These types of calculations for now are beyond the realm of Solr, so we’ll stick to the basic counting we can get through faceting.
Rows and Columns: In our first example we simply faceted on search term. This just gave us a single column of data where one dimension was search term, and the other was basically time—spanning the age of the entire data set. If we want to explore retrieving multidimensional datasets in a single query, then the place to look is pivot facets.
If the performance of pivot facets is too much of an obstacle, you can also build your dataset by running multiple queries while filtering on each value in one of your axes.
Filtering: Solr may have OLAP beat in this one area at least. Whether you use Solr’s standard filters, range filters, fuzzy matches, or wildcards, you should have all the tools you need to grab a decent slice of your data.
If you’re looking to get your feet wet in data analytics, Solr may be a good tool to start with. It’s easy to get going, and it’s totally free. And once you’ve invested your time in it, its strong community, and precedence for scaling for high volumes of search traffic and data make it a tool you can grow with.