Wednesday 26 February 2014

There are only two hard things in Computer Science: cache invalidation and naming things.

It is the first of these which I have recently been attempting and I think Phil Karlton might have a good point.

What are we talking about?

Web browsers are pretty complex beasties but the basic concept is pretty easy to understand. They fetch a bunch of files that make up a web page and render those source files into something suitable for human consumption.

It is also intuitive that fetching all the files that make up a web page, every time you browse to a new page, might be wasteful, especially if most of the files had not changed from the previous page.

In order to address this browsers hang onto copies of these files in case your browsing needs them again, this is known as a source file cache. Caches are a widely used technique throughout many aspects of computer technology but the basic idea is that they trade one resource for another.

In this case we are trading local storage space (memory or disc) for access time. This type of trade may give large benefits due to the large differences in access times (sometimes known as latency) between local and network resources.

For example, if the source files that make up a web page are a Megabyte in size accessed with a 2013 era PC with a 10Mbit connection to the Internet:
  • Accessing the data from main memory will take approximately 20µs (millionths of a second).
  • The same data retrieved from a hard disc drive will take 4,000µs to arrive, around 200 times longer.
  • If we were to retrieve the data from a fast web server and assuming the network connection is in perfect working order, the data will roll in 1,200,000µs later, or around 300 times more slowly than disc and a unfathomable 60,000 times slower than memory.
From this example it becomes startlingly obvious why a source cache is so desirable. I have greatly simplified the browsers operation here as they usually implement many layers of additional caching in other areas to make the browsing experience subjectively smoother.

And before the observant reader suggests that we could just make the network faster it ought to be mentioned that due to fundamental physical constraints, like the speed of light, it would be unlikely that the average practical network reply will ever drop much below 150,000µs [1] or some 40 times slower than disc. Still a worthwhile gain.

Knowing the cost of everything and the value of nothing

Having shown the benefits it might be nice to think there are no downsides to caching, this is not the case, as I mentioned we are trading storage for time and as in any trade one must expect there to be overheads.

In the case of a browser we face several obstacles. Firstly not every file that goes to make up a web page can be cached. The HTTP specification contains an entire section on caching which contains numerous rules and operations to ensure that only objects that should be cached are and determines for how long they can be kept before going "stale".

A great deal of this complexity comes from just how desirable caching is to network providers. Many ISPs will have web proxy servers to cache requests for all users to reduce their bandwidth usage, or increasingly, to implement content filtering according to the local government censorship rules. The browsers cache must know how to interact with these proxies without getting erroneous results.

Then comes the problem alluded to in the section title. A browsers cache cannot grow indefinitely. Eventually a system would run out of memory and disc if a browser kept every file. To deal with this the cache is limited sometimes by the number of files it holds, but more commonly by size both in memory and on disc.

The caching structure used by a web browser is hierarchical in that it will first evict (move) files from memory to a backing store (disc) and when the disc cache size limit is exceeded files are evicted to the next tier. Which in this case, will involve deleting the local copy (invalidating) and requiring performing a fetch from the network if the file is needed again.

Tardis at the Beeb by Sarah G from flikrWhen the cache size is exceeded the decision is made about what to remove from the cache using a cache strategy or algorithm. The aim of this decision is to discard the data that will not be required again for the longest time.

If we had a blue box with a madman inside, who we could persuade to bring our browsing history back from the future, implementing bélády's algorithm might be possible. Failing a perfect solution we are reduced to making a judgement based on the information available to us.

This involves computing a cost for replacing each entry in the cache and evicting those with the lowest values. The selected algorithm has a large impact on the effectiveness of the cache and often needs fine tuning for the task at hand.

Never has so little been measured so much

Having established that caching is useful and that we need to make decisions on what to keep it follows that those decisions must be based on information. This information consists of two main groups:
  • A series of metrics about the cache state such as the number of objects and their cumulative size currently being stored in memory.
  • Information about each object, known as metadata, this includes things like the objects individual size, how long until it becomes stale and when it was last used.
Maintaining all of this information is a significant part of the overheads involved with caching and a compromise must be struck between having the necessary information to make good caching choices and the cost maintaining that data.

Cache metrics

Most cache implementations maintain at least some basic metrics, at the very least how much storage is being used so a decision on if evicting entries from the cache to reclaim space is necessary.

It is also common to keep a record of how well the cache is performing. The cache efficiency (the hit rate) is usually measured as the ratio of successful accesses provided from the cache (a cache hit) to the total number of requests to the cache.

A perfect cache would have all hits and no misses (a hit rate of 1) while using as little storage as possible (but as already noted that requires a special madman). The hit rate can be used to automatically change the behaviour of the eviction algorithm, perhaps using more storage if the hit rate drops too low or changing the eviction frequency.

Metadata

Each object must carry with it the data necessary to regain its state when retrieved from the cache. This information is essential for the correct operation of the cache but in of itself provides no direct value.

In the case of browser source data this includes all the headers sent in the original network fetch. These headers are themselves just metadata sent along with the object by the web server which must be preserved.

Two particularly useful value for making a good eviction decision are how long it has been since the object was last used and how many times it has been used. This is because an object that was cached a long time ago and then never used again is much less valuable than one that has recently been repeatedly used.

An Implementation

It is all very well talking about the general solution but a full understanding can only be gained by examining a real implementation. The NetSurf cache was chosen as it is a suitably small amount of code but still demonstrates all the major design features.

Interface

The cache deals with all source data and provides the browser with a single unified object request interface regardless of if the source data can be cached or not. This is useful to hide the implementation details allowing for changes without affecting the rest of the browsers code.

The implementation effectively consists of two lists of objects, those that can be cached and those that cannot. Each object on these lists is reference counted against one or more users and while there is at least one user the source data is maintained in memory.

Objects determined to be non cached are always directly fetched from their source and are immediately released once they have no users remaining.

For cacheable objects an attempt is made to fulfil the request in descending order:
  1. from objects already in memory.
  2. from the backing store.
  3. from a network fetch.
Irrespective of the source of the data the remaining operations on the object (determining freshness etc.) are exactly as if the request had been fulfilled from the network. This approach means the use of the cache is transparent to the object users.

Writeout

Objects are placed in the backing store asynchronously to any other operations. When an object has been fetched from the network a background task is scheduled for when the browser is otherwise idle.

This writeout task constructs a list of all cacheable objects not yet in the backing store, associates a cost value with each object and then proceeds to write those objects to the backing store in highest to lowest value order.

Cleaning

A periodically scheduled task deals with ensuring the cache is cleaned. This consists of destroying stale objects and arranging fresh objects are not using more memory than the configuration permits.

Memory usage is reduced within the cleaning task by discarding the source data for objects already held in the backing store (where the data can be retrieved relatively cheaply)

Additional memory may be recovered by simply discarding unused objects held only in memory. These objects are usually those which have only recently become unused otherwise the writeout task would have committed them to the backing store.

The most recently used objects are statistically likely to be used again soon, because of this there is a high risk of a cache miss associated with discarding these objects. In order to mitigate this undesirable effect the cleaning heuristic favours using more memory than configured for short periods rather than discarding from memory.

Backing store

The backing store is a simple key-value database. Each source data object and its metadata (its headers etc.) is associated with a unique key (the URL used to fetch it). The backing store is simply required to store and retrieve the object data associated with the given URL.

The backing store implementation in NetSurf is pluggable, this flexibility is required for dealing with the greatly varying capabilities and limits of the various systems the browser is required to execute on.

A novel aspect of this backing store interface is that if an implementation returns the wrong object it is not considered an error and is simply treated as a cache miss. This is possible because the metadata contains the URL of the stored object enabling object verification.

This behaviour is permitted because techniques which significantly improve key-value store performance (principally key hashing) become available if they are not required to always give the correct answer.

The backing store is also required to manage its size within the configured limits and deal with any filesystem behaviour details.

The reference backing store is trivial in that it performs a hash operation on the input URL, encodes the result with base64url encoding and uses that as the object filename. The length of the hash can be configured allowing use of the reference implementation in any situation where using the filesystem as a database is acceptable.

The reference backing store default hash is SHA-1 yielding almost unique [2] 160 bit key values stored in much the same way as the git DVCS. Note we are not using this hash for its cryptographic properties and at worst a collision will result in the cache being a tiny amount less efficient.

Conclusion

Hopefully this has shown what a browser source file cache is, why it is useful and a basic understanding of how they are implemented.

I will admit I have glossed over some of the more challenging aspects especially in relation to actually implementing the cache strategy but I hope that the reader will forgive those omissions of detail in a quest for a little more clarity of the general principles.

I would like to thank John-Mark Bell for implementing the NetSurf cache and to Melodie Parry for proof reading and providing feedback.

[1] Speed of light is 3x10^8m/s earth circumference is 4x10^7m, divide circumference by speed gives 133,000µs which gives longest round trip time from anywhere on surface of earth to point furthest from. So assuming a whole megabyte can be transferred in one round trip and allowing for some overhead the lower bound estimate of 150,000µs looks reasonable.

[2] Mathematically speaking this hash would allow us to say that with the current estimated 1x10^11 URLs on the internet the likelihood of a collision is still vanishingly small (at least 1x10^-15)