Solr HyperLogLog

Solr 5.2 introduces HyperLogLog, the probabilistic approach for counting distinct values.

Solr already had provision to count distinct values using unique facet function or countDistinct LocalParam in stats component. But this approach doesn’t scale well, as it needs to fit everything into memory and memory required grows linearly with the cardinality. If Solr gets many requests for counting distinct values, system may even throw OOM Exception soon.

HyperLogLog model for estimating the distinct value was introduced in paper HyperLogLog, which is an extension of LogLog algorithm. This algorithm uses Hash function based approach using limited memory but the count is an estimation of the distinct value instead of exact count.

Solr 5.2 introduces new facet function hll() and new LocalParam cardinality in stats component, to estimate the distinct values. Below is the syntax.

hll(<fieldname>) // facet function

{!cardinality=true}<fieldname> //stats component

Another tradeoff of HyperLogLog can be performance which can be bit slower than unique and countDistinct but uses significantly less memory, which can make it an ideal choice for scenrio such as finding unique visitors to site etc.

Write a comment
Cancel Reply