Unlike iOS (really just Safari), Android has no content blocking API. Tracking protection is available in some browsers, e.g. Firefox in combination with addons (and also in Firefox’s private browsing which includes tracking protection enabled by default). For fun, we decided to look into whether it’s possible to provide Tracking Protection when using Android’s default WebView implementation. This blog post describes how that was done, and explores some of the implementation details of our URL matching algorithm.
It turns out that Firefox Focus on iOS also had to build their own URL matching implementation: iOS content blocking is current only available in Safari, and not in the iOS WebView equivalent. That implementation was influenced by the design of iOS’s content blocking APIs and file formats, but when you’re not subject to that restriction it’s possible to build a faster approach, so my ignorance of that version wasn’t necessarily a bad thing, as I’ll describe later in this post.
Why would you want to do this? One reason is that browser engines are large – and we wanted to see whether it’s possible to build a privacy focused browser whose size measures in megabytes instead of tens of megabytes – which would require reusing whatever engine the platform provides (in the case of iOS you actually have no choice in the matter, fortunately Android is a little more free). There are actually some drawbacks to using platform-provided browser engines – which will the topic of a future post – but it’s certainly possible to implement tracking protection on top of Android’s WebView.
Tracking Protection Lists
Firefox and Focus use the Disconnect tracking protection lists: these are lists of domains hosting trackers that should be blocked, categorised by tracker type, e.g. Social trackers, Analytics Trackers, Advertising Trackers, etc. Further to this there’s an override “entity” list, which unblocks domains that are owned by a given company whenever you are browsing a site owned by that company. (E.g. if FooBar Tracker Corp owns both
bar.com, we would allow loading of resources from
bar.com while browsing
foo.com, even though we’d block all other sites from loading resources from
bar.com.) You can read more about these lists at the repo where the Mozilla copies of these lists are maintained.
As such, tracking protection is fairly simple: every time a given webpage requests a resource, we match the resource URL’s host against the blocklist. If it’s blocked, we check the entitylist to verify whether there’s an override in place for the current site. Android’s WebView provides a callback that is called every time it wants to load a resource, allowing you to override resource loading.
The iOS content blocking API actually allows for regex based matching on the entire resource URL, which is more complex than what we needed for basic tracking protection. The disconnect lists only work using domains/hosts, which simplifies the implementation somewhat. Focus on iOS originally only supported the content blocking API, and added the browser later – the browser implementation therefore simply reused the same bundled list format. The content blocking lists aren’t used for iOS’s WebView equivalent, although that is apparently changing.
Implementing URL matching
The simple (but not particularly efficient) method would be iterate over the list of hosts every time a resource is fetched. In fact, we could just iterate over the regex’s in the iOS content blocking lists, and check those directly to avoid implementing our own matching.
The original Android implementation was actually a rushed afternoon (or two) hacky proof of concept from our December All Hands – it turned out to be robust and fast enough, so it was kept beyond that time. It might be possible to build an even faster implementation, but this one hasn’t provoked any user complaints yet.
As mentioned, iterating over the list of blocked hosts is expensive, O(nh) for
n = number of blocked hosts == very large, h
= host length (small). Fortunately at some point or another I had learned about Tries (contrary to what some might assume, an Information and Computer Engineering degree at my alma mater doesn’t actually involve any Data Structures and Algorithms – but that’s nothing a little independent study can’t quickly fix).
Those offer much smaller memory consumption (not that memory consumption is particularly significant compared to what a web engine will need), and much faster lookup [O(h)]:
(In reality, the Trie possibly consumes more memory because of the overhead of each node being an object. More efficient representations are available in order to avoid one node per character, but that didn’t seem worthwhile given that this implementation is already performant enough.)
There’s still a bunch of overhead in various places: we’re using the Android/Java URL classes to extract the hostname from the resource URL, which could well be more costly than the actual act of searching the tree. I haven’t measured in detail yet.
(Building this concluded completed the bi-yearly cycle of proper Data Structures and Algorithms construction – I’d last been able to build some trees for a bookmarks folder UI the preceeding summer.)
As mentioned above, there’s also the entitylist: this consists of sets of hosts (A), for which another set of hosts (B) is whitelisted (usually those sets would be the same, but that isn’t guaranteed or necessary). This is simply an extension of the same tree: the set of whitelisted domains (B) is another Trie. That Trie is then attached to every node representing one of the whitelisted domains (A) – we simply extend the default Node to have a WhitelistNode, which has a reference to the whitelisted-domains Trie.
Every real project needs its own String implementation
Searching and inserting into our hostname tries involves walking strings backwards. That would either require either some annoying index arithmetic, or reversing the String before insertion/search (i.e. creating a copy of the String). Neither of those sounded like fun, so I decided to add a String wrapper. This is arguably completely unnecessary, but made things a little simpler (and perhaps more efficient). The String wrapper also meant that the Trie implementation didn’t need to have much knowledge about subdomains either, we can just start at the start of our reversed String. (Because we need to correctly match subdomains, but not other domains, the Trie still needs to be aware of full stop being used for domain separation, so it isn’t completely domain agnostic.
We only need to access the String character by character, which is why we can avoid a complete string copy/reversal – if this weren’t the case, there would be little value in a wrapper.
The wrapper takes care of index arithmetic for reversed strings – and implements support for getChar(int) and substring(int). That’s pretty much all there was to FocusString. (I no longer need to miss the amazing days of many C++ string classes…)
Somewhat naively, I’d assumed that our Java implementation doesn’t create a copy when calling String.substring() – in other words that it would just adjust internal indexes while reusing the same String buffer and/or equivalent behaviour. Without that assumption, there would be little point in avoiding a String copy on reversal, since – thanks to our recursive Trie traversal – we’d be creating copies when traversing that Trie.
It turns out that assumption was wrong: it was true for Java 6, and also for earlier versions of Java 7 – before changing in Java 7u6. I don’t really know where Android’s implementation originates, but it also creates copies. Thus, FocusString was expanded to include offsets, and FocusString.substring() merely fiddles those offsets.
It was hard to predict what the impact of this change might be in advance, since I didn’t have much experience in this area – I discovered that it was actually a noticeable improvement: on my fairly modern Nexus 6P, average URL matching time dropped by about 20% – from approximately 1.2ms to 1.0ms (these numbers are for debug builds with code coverage enabled – that drops to 0.26ms vs 0.42ms for coverage free debug builds, which is even more significant). We already had tests in place which helped verify that things wouldn’t break, so this was a fairly low risk change (I did use this as an opportunity to extend those tests though).
As mentioned above, the iOS equivalent implementation is a lot simpler. It iterates over the lists of hosts, and does regex matching for each host. I decided to port that implementation to Android, primarily to check for consistency of results. Fortunately the Trie based implementation was mostly correct, except for our subdomain matching. Both
foo.bar.com should be blocked if
bar.com is in the blocklist. My Trie based implementation also blocked
foobar.com. Ooops. That was a quick fix, albeit one which required making the Trie search implementation hostname aware. Other than that, results have been the same in our testing.
These parallel implementations allowed for performance comparisons. (Note: the underlying regex and other library implementations on each platform might be different, so the difference in results could be very different if both algorithms were running on an iPhone.) On my N6P, the Trie based implementation took an average of 0.3ms per resource URL check, the ported iterative/regex approach took 42ms. Some pages like to load a lot of resources – so that’s a difference you’d notice quickly. It’s possible that my ported implementation was suboptimal, but it’s certainly clear that the Trie based approach was worth it from a performance perspective.
To be fair, this implementation did take more work – and you have to remember that the iOS implementation was influenced by the blocklist file format that iOS uses for its tracking protection API, whereas the Android version was clean-sheet design.
Trie Diagram corrected on 10th May 2017, thank you to Gervase Markham for spotting the mistake.