Deep Packet Inspection (DPI) is a technique used by firewalls, intrusion detection systems (IDS), and web filters to examine the contents of network packets in real time. A critical challenge in DPI is speed: a firewall must decide in microseconds whether to allow or block a connection.
A Bloom filter is a probabilistic data structure that answers membership queries ("is this URL in the blocklist?") in constant time O(1) and with very low memory, making it ideal as a fast first-pass filter in a DPI pipeline.
A Bloom filter may produce false positives (flagging a safe URL as malicious) but never false negatives (it will never miss a truly malicious URL). This asymmetry is critical for security applications.
| Symbol | Meaning | Trade-off |
|---|---|---|
m | Bit array size | Larger → lower false positive rate, more memory |
k | Number of hash functions | Optimal k = (m/n) ln 2 |
n | Number of inserted elements | More elements → higher false positive rate |
p | False positive probability | p ≈ (1 − e^(−kn/m))^k |
Implement a BloomFilter class in Python with the following interface:
class BloomFilter:
def __init__(self, m: int, k: int):
"""
Initialise the filter.
m: size of the bit array
k: number of hash functions
"""
...
def _hash(self, item: str, seed: int) -> int:
"""
Return an index in [0, m) for the given item and seed.
Use a deterministic hash (e.g., hashlib or mmh3).
"""
...
def add(self, item: str) -> None:
"""Set k bits for item."""
...
def __contains__(self, item: str) -> bool:
"""
Return True if all k bits are set (possible match),
False if any bit is 0 (definite non-member).
"""
...
def false_positive_rate(self, n: int) -> float:
"""
Calculate the theoretical FP rate given n items inserted.
Formula: (1 - e^(-k*n/m))^k
"""
...
Hint Use mmh3 (MurmurHash3) or hashlib.sha256 with different seeds. Different seeds must produce independent-looking distributions.
Download a real-world blocklist (we recommend the URLhaus dataset from abuse.ch, which is free and updated daily).
import csv, urllib.request
URL = "https://urlhaus.abuse.ch/downloads/csv_recent/"
def load_malicious_urls(path_or_url: str) -> list[str]:
"""
Parse the URLhaus CSV format.
Return a list of URL strings (skip comment lines starting with #).
"""
...
def optimal_params(n: int, target_fp: float = 0.01) -> tuple[int, int]:
"""
Given n items and a desired false positive rate,
compute optimal m and k.
m = -n * ln(p) / (ln 2)^2
k = (m/n) * ln 2
"""
...
Your script must print:
m and k for 1% FP rateset would require (estimate)Write a test suite covering the following:
__contains__. Test with all loaded URLs.This interactive demo lets you experience a Bloom filter with m=32 bits and k=3 hash functions. Pre-loaded with a sample blocklist. Use it to build intuition before coding.