How does Reddit, Instagram and Google calculate unique view count so efficiently?
Let us see how their impression counting works internally.
If you ever post on Reddit, LinkedIn, or maybe run ads on Google, there is a critical metric that is often observed that tells about the number of users who viewed your ad or post.
In Linkedin we see this as impressions while in reddit this this seen as views.
Some similar examples of impression counting are:
Google Search: Count impressions on a search result.
Instagram: Count views on a reel.
These are engagement metrics that are calculated by the platform and shown to the user as analytics.
In this blog, we will be looking into how we can design such a system. I will be covering a very basic implementation of how this can be done. Along with an implementation that I think companies with huge amounts of traffic might be doing.
Problem
For a very long time, social media websites like Reddit and Instagram only showed engagement metrics in the form of likes and comments. But over time, we have seen that not all people engage with the content in the form of likes or comments. Over time, these companies have built a system that could capture this activity by counting the number of views a post has received. This number is then shown to content creators to provide them with better insight into the activity on specific posts.
Requirements
Some of the major requirements that we want to fulfil while designing a system that can determine the views that the post or ad gets are:
Counts must be in real time or near-real time.
The user should be able to query, “Select the number of unique users who saw the post in the last n times.”
The system should be able to run at a production scale and process events easily with minimum cost.
The major requirement of this system is that within any given range, the user should be able to check the number of unique users that interacted with our system.
Basic Solution
In this solution, we will have a mapping table where, for a particular post ID and timestamp, we can store the userIds of users who viewed a particular post. For example, if the post ID is 1, and the timestamp at a particular minute is 1722823867.
Our table would look something like this, where our primary key will be a mixture of postId and timestamp, while we store the set of users who viewed or interacted with the post.
Pros
At each minute, we know how many users interacted with the post.
When the request comes from the customer to get the number of unique users within a particular time range, we just need to fetch the sets and do the set merge.
This solution works well for a simple application that does not receive much traffic. However, when the traffic starts to increase, this solution will become a bottleneck. Let us see how:
If any ad or post becomes viral and has thousands or millions of users impressions, then merging the sets will become really expensive. For example, we would have to load all the sets in memory and then do the set operations. This will become extremely taxing on the CPU. This would become worse as more and more posts started to get unique viewers from people.
Putting this information in terms of numbers, let us say one post was viewed by over 1 million unique users. If we had to store 1 million unique user IDs, and each user ID is 8 bytes long, then we would require 8 megabytes of memory just to count the unique users for a single post!
To address this specific issue, we will dive deep into a new data structure.
HyperLogLog
HyperLogLog is a probabilistic data structure that estimates the cardinality of a set. As a probabilistic data structure, HyperLogLog trades perfect accuracy for efficient space utilisation.
The magic of this algorithm is that you no longer need to use an amount of memory proportional to the number of items counted and instead can use a constant amount of memory — 12 kb in the worst case, or a lot less if your HyperLogLog (we’ll just call them HLL from now on) has seen very few elements.
We can use the hyperloglog implementation from Redis, which uses up to 12 KB and provides a standard error of 0.81%. Redis provides us with three basic commands for HyperLogLog, which we can use for our use case.
PFADD
adds an item to a HyperLog.PFCOUNT
returns an estimate of the number of items in the set.PFMERGE
combines two or more HyperLogLogs into one.
When we store the data in Redis, we can have postId_timestamp as our key and a HyperLogLog of users who viewed the post at that timestamp. So when a user views a post at a particular time, we can add the user to the HyperLogLog using the PFADD command. Whenever there is a command to get the number of unique users for a post in a particular timestamp, we get the HLL for that time range. Post that, we do PFMERGE to combine to or more HLL, and using the PFCOUNT command on the final set, we return the number of users.
This application answers these questions:
How many unique visits has this page had on this day?
How many unique users have played this song?
How many unique users have viewed this video?
So our system would look something like this. We have events related to user views which are sent to amazon Kinesis. Those events are then picked by the Counting service which interacts with a rule service to check if the view needs to be considered as a view or not based on certain rules. One reason we may not count an event is if it’s the result of repeat views from the same user over a short period of time.
If the event is marked for counting, then counting service first checks if there is an HLL counter already existing in Redis for the post corresponding to the event. If the counter is already in Redis, then counting service makes a PFADD request to Redis for that post. If the counter is not already in Redis, then counting service makes a request persist the HLL counters and the raw count numbers, and makes a SET request into Redis to add the filter.
This data is then fetched from analytics service to show the information about unique impressions to the user.
Problems with this system?
The current system looks fine but there are few gaps in the system. Please take a minute to think about the gaps?
If you thought about redis stroring the data in memory and there could be a dataloss when the redis goes down, you are kind of right. But if redis goes down there are mechanisms in which the data can be loaded back in memory.
The real issue with redis in this system is “COST” . Since redis stores everything in memory we will have a lot of money being spent on keeping the data in Redis.
To Address this we need to have a cost effective data store where we can store our older data as well at a cost effective rate.
With the addition of DynamoDb, our system looks like this counting service checks for HLLs in Redis, and if the HLL exists in Redis, it updates or gets the data from Redis. If the HLL is not present in Redis, it will check for it in DynamoDB, load it in Redis, and return or update the value. If the HLL does not exist in Redis and DynamoDB, then the HLL is created and then stored in Redis. Writes to DynamoDB are batched in n-second groups per post in order to avoid overloading the cluster.
Conclusion
In this blog, I have tried to cover how we can build a system that answers the following questions in a cost-effective way.
How many unique visits has this page had on this day?
How many unique users have played this song?
How many unique users have viewed this video?
This way of using HLL was a new experience for me as well this system is extensively used by reddit and other platform.
References
https://www.redditinc.com/blog/view-counting-at-reddit/