Storing and processing vast amounts of data is a big challenge. Whether you are tracking user activity, monitoring system health across thousands of devices, or analyzing high-frequency financial transactions, you are dealing with millions of records, all of which must be processed efficiently. A traditional row-based storage system could buckle under the strain and incur heavy costs.
Bitmaps offer a powerful solution to this challenge by allowing you to compress data into minimal memory and enabling fast and efficient analysis.
In this blog post, we’ll explore the advantages of using bitmap operations for large-scale data analytics and demonstrate how they can significantly reduce memory usage and compute time. We’ll also discuss how this methodology can be integrated with Databricks and show how complex queries can be easily handled with minimal overhead. By the end, you'll see how bitmaps can be a game-changer in various scenarios that require large-scale data processing.
Real-World Scenario: Campaign Tracking
Let's assume you have a user base of 3 million users. You run four campaigns every month, and for each campaign, you have to track the following:
- Opens (Did the user open the message?)
- Clicks (Did the user click a specific link?)
- Video Started (Did the user start watching a video?)
- Video Completed (Did the user watch the entire video?)
You want to track these events for each user, which means:
4 campaigns per month × 3 million users × "X" data points per campaign, where X represents the number of tracked items (like opens, clicks, video completions, etc.).
Storing this data using traditional methods would result in massive tables with rows upon rows of data, with each user’s interaction stored separately. It would mean millions of rows per campaign, which is costly both in terms of storage and processing time.
Bitmap to the Rescue!
Bitmaps provides an incredibly efficient alternative. Instead of storing 3 million rows (one per user) for each interaction, you can store a single bitmap per event, where each bit in the bitmap represents a user.
- Message1_Click would have a bitmap, where each bit indicates whether the respective user clicked the message or not.
- Video1_Completed would have a bitmap, where each bit indicates if the user completed the video.
This way, you're not storing individual records for each user but simply flipping bits in a bitmap to represent user actions.
Why Bitmaps Are Efficient
- Memory Efficiency: A bitmap uses just 1 bit to track each user’s interaction with a specific item. So, for 3 million users, a bitmap would only require ~375KB of memory per campaign event, which is significantly less than storing rows of data.
- Fast Querying: With bitmap operations, you can quickly calculate aggregates and run queries on combinations of events like: Message1_Click_AND_Video1_Completed (users who clicked on Message1 and completed watching the video).
- Scalability: Bitmaps can scale seamlessly with large user bases and multiple campaigns, reducing database I/O overhead.
Item-Centric Approach Vs User-Centric Approach
There are two main approaches for tracking events like email opens across millions of users with bitmaps: item-centric and user-centric.
Item-Centric Approach
Here, the focus is on individual items such as an email or message. A single bitmask is maintained for each item, where each bit corresponds to a unique user ID. This approach is more space-efficient when you are dealing with millions of users. If a user interacts with the item (for example, opens an email), the bit corresponding to that user ID is set to 1. You can quickly compute the total number of unique interactions by counting the number of set bits in the bitmask. Since bitmask is stored per item, it’s easy to aggregate or analyze interaction data at the item level (for example, total opens for Email 1).
User-Centric Approach
As the name suggests, the user-centric approach focuses on each user. A bitmask is maintained for each user, where each bit represents an interaction with a particular item. If a user opens Email 1, the corresponding bit for that email is set in the user’s bitmask. This method is more suitable when the analysis is centered around users and their overall behavior across multiple items. It is, however, less efficient at storing large datasets.
Item-Centric Bitmap Management with Redis
Let's explore how Redis bitmap operations work with an item-centric approach, where each user's unique ID is represented by a bit in the bitmap. Using specific user IDs (8, 10, 15, 18, and 20), we'll demonstrate how to set, check, and combine user actions. We’ll also discuss the BITCOUNT and complex bitwise operations for analyzing user behavior across multiple actions.
1. Setting a Bit for a User Interaction
To track if a user performed an action (such as clicking on a message), we set the corresponding bit in the bitmap. In Redis, each bit position in the bitmap represents a user by their unique ID.
Example: Users who clicked on Message1
We have 5 users with the following IDs:
- User ID 8
- User ID 10
- User ID 15
- User ID 18
- User ID 20
To set these bits in Redis, where a 1 indicates that the user clicked the message, you would use the following SETBIT command:
SETBIT Message1_Click 8 1 # User ID 8 clicked
SETBIT Message1_Click 10 1 # User ID 10 clicked
SETBIT Message1_Click 15 1 # User ID 15 clicked
SETBIT Message1_Click 18 1 # User ID 18 clicked
SETBIT Message1_Click 20 1 # User ID 20 clicked
This command sets the bit at the positions corresponding to each user ID to 1.
- Message1_Click: The key in Redis for the bitmap tracking clicks on Message1.
- 8, 10, 15, 18, 20: These positions in the bitmap represent the respective user IDs.
- 1: This indicates that the user performed the action (clicked on the message).
The resulting bitmap would look like this from positions 0 to 20:
000000001010000100101
2. Checking if a User Interacted with an Item
To check whether a user interacted with an item, you can query the bitmap using the GETBIT command. For example, to check if User ID 10 clicked on Message1:
GETBIT Message1_Click 10
- If the result is 1, it means the user clicked on the message.
- If the result is 0, it means the user did not click.
You can perform this check for any user ID, for example:
GETBIT Message1_Click 15 # Check if User ID 15 clicked on the message
3. Combining User Actions Across Multiple Items
You can combine bitmaps to identify users who performed multiple actions. For example, if you're also tracking users who completed watching a video in Message2, you can use the BITOP command to perform a bitwise AND between the two bitmaps to find users who performed both actions.
Example: Users who clicked on Message1 and completed watching a video in Message2
Assume we have another bitmap called Video2_Completed that tracks video completions for the same 5 users. Here are the positions for video completions:
- User ID 8: watched the video → Set the bit at position 8 to 1.
- User ID 18: watched the video → Set the bit at position 18 to 1.
SETBIT Video2_Completed 8 1 # User ID 8 completed the video
SETBIT Video2_Completed 18 1 # User ID 18 completed the video
Now, we can use the BITOP command to perform a bitwise AND between the two bitmaps (Message1_Click and Video2_Completed):
BITOP AND result Message1_Click Video2_Completed
This creates a new bitmap called result where only the bits representing users who clicked on Message1 and completed watching the video in Message2 are set to 1. Here’s what happens at each position:
- Position 8: Both bitmaps have 1 → The result bitmap will have 1.
- Position 10: Only the click bitmap has 1 → The result bitmap will have 0.
- Position 15: Only the click bitmap has 1 → The result bitmap will have 0.
- Position 18: Both bitmaps have 1 → The result bitmap will have 1.
- Position 20: Only the click bitmap has 1 → The result bitmap will have 0.
The resulting bitmap will be:
000000001000000100000
This bitmap shows that Users 8 and 18 clicked on Message1 and completed watching the video in Message2.
4. Using BITCOUNT for Analysis
You can use the BITCOUNT command to count how many users performed an action or to count the results of a combination of actions.
Example: Counting users who clicked on Message1
To count how many users clicked on Message1, you would run:
BITCOUNT Message1_Click
For the current example, the result will be 5, since five users (with IDs 8, 10, 15, 18, and 20) clicked on Message1.
Example: Counting users who clicked on Message1 and completed watching the video in Message2
After performing the BITOP AND operation, you can count how many users performed both actions by running:
BITCOUNT result
In this case, the result will be 2, as only Users 8 and 18 performed both actions.
5. Complex Queries with Redis Bitmap Operations
Bitmap operations in Redis make it easy to run complex queries on user behavior. For example, you might want to find users who clicked on Message1 but did not complete the video in Message2. You can achieve this with a combination of the AND and NOT operations.
Example: Users who clicked on Message1 but did not complete the video in Message2
First, you would invert the Video2_Completed bitmap:
BITOP NOT Video2_NotCompleted Video2_Completed
This creates a new bitmap (Video2_NotCompleted) where:
- Bits are 1 for users who did not complete the video.
- Bits are 0 for users who completed the video.
Next, perform a bitwise AND between the Message1_Click bitmap and the Video2_NotCompleted bitmap to find users who clicked on the message but did not complete the video:
BITOP AND result Message1_Click Video2_NotCompleted
This result bitmap will contain 1 for users who clicked on Message1 but did not complete the video in Message2.
In our scenario, this query would yield:
000000001010000000100
This means Users 10, 15, and 20 clicked on Message1 but did not complete the video in Message2. Using BITCOUNT on the "result" will return the number of users who meet the specified criteria.
BITCOUNT result # Result will be 3
IMPORTANT: Bitmaps can be easily divided across multiple keys, which is particularly useful for sharding large datasets and avoiding the performance issues that come with handling massive keys. A simple strategy for splitting a bitmap involves storing M bits per key. You can determine the key name by calculating bit-number/M and the position of the specific bit within that key using bit-number MOD M. This allows for efficient distribution and management of bitmaps while keeping each individual key size manageable.
Working with Bitmaps in Databricks
In our implementation, we chose to replicate Redis bitmap functionality directly in Databricks as all our analytics logic was already centralized there. This decision helped us avoid the maintenance overhead of managing analytics across two separate systems. The implementation was relatively straightforward, applying core bitmap principles. One challenge we addressed was handling user IDs as UUIDs instead of integers, which is a common scenario.
Below are the key steps we followed to achieve Redis bitmap-like functionality in Databricks:
- Created a persistent table, UserIdMapping, to store the mapping between anonymized UUIDs and integer-based User IDs.
- Implemented custom logic to populate the UserIdMapping table using daily log files containing millions of rows.
- Added a bitmask based on the integer User ID as a new column to all DataFrames used in the analysis.
- Stored the results of bitwise operations as BinaryType, ensuring both memory and compute efficiency.
Conclusion
Bitmap operations are a powerful and efficient way to track and analyze user interactions across vast datasets. Operations such as setting, getting, and performing bitwise operations (for example: AND, OR) enable lightning-fast queries that would otherwise require far more memory and compute resources. Whether used in Redis or another platform, bitmap-based techniques can quickly deliver valuable insights, making them ideal for scenarios with millions of records and complex behavior analysis, all while keeping computational costs low.