Assignment: Deep Packet Inspection with Bloom Filters

Overview
Tasks
Live Demo

Background

Deep Packet Inspection (DPI) is a technique used by firewalls, intrusion detection systems (IDS), and web filters to examine the contents of network packets in real time. A critical challenge in DPI is speed: a firewall must decide in microseconds whether to allow or block a connection.

A Bloom filter is a probabilistic data structure that answers membership queries ("is this URL in the blocklist?") in constant time O(1) and with very low memory, making it ideal as a fast first-pass filter in a DPI pipeline.

A Bloom filter may produce false positives (flagging a safe URL as malicious) but never false negatives (it will never miss a truly malicious URL). This asymmetry is critical for security applications.

How a Bloom filter works

  1. Allocate a bit array of size m, all initialised to 0.
  2. Choose k independent hash functions, each mapping a URL to an index in [0, m).
  3. Insert: Run the URL through all k hash functions; set each resulting bit to 1.
  4. Query: Run the URL through all k hash functions; if all bits are 1 → "possibly malicious"; if any bit is 0 → "definitely safe".

Key parameters

SymbolMeaningTrade-off
mBit array sizeLarger → lower false positive rate, more memory
kNumber of hash functionsOptimal k = (m/n) ln 2
nNumber of inserted elementsMore elements → higher false positive rate
pFalse positive probabilityp ≈ (1 − e^(−kn/m))^k

Part 1 — Implement the Bloom Filter 

Implement a BloomFilter class in Python with the following interface:

class BloomFilter:
    def __init__(self, m: int, k: int):
        """
        Initialise the filter.
        m: size of the bit array
        k: number of hash functions
        """
        ...

    def _hash(self, item: str, seed: int) -> int:
        """
        Return an index in [0, m) for the given item and seed.
        Use a deterministic hash (e.g., hashlib or mmh3).
        """
        ...

    def add(self, item: str) -> None:
        """Set k bits for item."""
        ...

    def __contains__(self, item: str) -> bool:
        """
        Return True if all k bits are set (possible match),
        False if any bit is 0 (definite non-member).
        """
        ...

    def false_positive_rate(self, n: int) -> float:
        """
        Calculate the theoretical FP rate given n items inserted.
        Formula: (1 - e^(-k*n/m))^k
        """
        ...

Hint Use mmh3 (MurmurHash3) or hashlib.sha256 with different seeds. Different seeds must produce independent-looking distributions.

Part 2 — Load a Malicious URL Database 

Download a real-world blocklist (we recommend the URLhaus dataset from abuse.ch, which is free and updated daily).

import csv, urllib.request

URL = "https://urlhaus.abuse.ch/downloads/csv_recent/"

def load_malicious_urls(path_or_url: str) -> list[str]:
    """
    Parse the URLhaus CSV format.
    Return a list of URL strings (skip comment lines starting with #).
    """
    ...

def optimal_params(n: int, target_fp: float = 0.01) -> tuple[int, int]:
    """
    Given n items and a desired false positive rate,
    compute optimal m and k.
    m = -n * ln(p) / (ln 2)^2
    k = (m/n) * ln 2
    """
    ...

Your script must print:

Part 3 — Test and Evaluate

Write a test suite covering the following:

  1. No false negatives: every URL that was inserted must be found by __contains__. Test with all loaded URLs.
  2. Measured FP rate: generate 10,000 random URLs that are not in the database; count how many are incorrectly flagged. Compare measured rate to theoretical prediction.

This interactive demo lets you experience a Bloom filter with m=32 bits and k=3 hash functions. Pre-loaded with a sample blocklist. Use it to build intuition before coding.

Bit array (m=32)

1
1
1
0
1
0
0
0
0
0
1
1
0
0
1
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
1
1
5
URLs loaded
12
Bits set
5.24%
Theor. FP rate

Query a URL

Add to blocklist

Sample blocklist (pre-loaded)