Solr HyperLogLog


By: Dikshant Shahi | June 12, 2015

Solr 5.2 introduces HyperLogLog, the probabilistic approach for counting distinct values.

Solr already had provision to count distinct values using unique facet function or countDistinct LocalParam in stats component. But this approach doesn’t scale well, as it needs to fit everything into memory and memory required grows linearly with the cardinality. If Solr gets many requests for counting distinct values, system may even throw OOM Exception soon.

HyperLogLog model for estimating the distinct value was introduced in paper HyperLogLog, which is an extension of LogLog algorithm. This algorithm uses Hash function based approach using limited memory but the count is an estimation of the distinct value instead of exact count.

Solr 5.2 introduces new facet function hll() and new LocalParam cardinality in stats component, to estimate the distinct values. Below is the syntax.

hll(<fieldname>) // facet function

{!cardinality=true}<fieldname> //stats component

Another tradeoff of HyperLogLog can be performance which can be bit slower than unique and countDistinct but uses significantly less memory, which can make it an ideal choice for scenrio such as finding unique visitors to site etc.

Dikshant Shahi (8 Posts)

Dikshant works as Solution Architect at The Digital Group. He takes interest in Semantic Search, Information Retrieval, Natural Language Processing and Machine Learning. He is the author of book "Apache Solr: A Practical Approach to Enterprise Search".


This post has been viewed 3,004 times

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>