Introduction to Varnish caching

Over the last week I’ve had my first attempt at setting up a Varnish environment which is something I’ve wanted to have a go at for a long time and I thought it would be worth sharing what I’ve learnt during that process. I was quite surprised at how simple it was to get going, and by the performance provided. This will just be a brief introduction that will probably be expanded on in future Varnish related postings.

What is Varnish caching?

First off for those that do not know, Varnish is a web application accelerator that will cache content from any web server such as Apache, Nginx or IIS and serve it up much faster. Varnish is able to do this because it caches various contents meaning that less resource intensive strain is placed on the web / database server, providing a more responsive and stable server environment with higher capacity. Put simply, Varnish will speed up a website quite significantly.

Rather than having to execute PHP scripts, run MySQL queries or perform some other tasks on the server for every page view received you can have Varnish with the content cached and ready to serve out over the Internet.

Varnish can be run on different flavours of Linux and does not require it’s own server to run. You could easily set up Varnish on the same server you have the web server running on, however I was more keen to set up a test that can be expanded in the future if needed by adding either more varnish cache servers, or web servers.

The test environment

For the test environment I created 3 virtual machines running Debian 6, one with Varnish installed which acts as the primary web server that deals with the external requests from the Internet coming in from port 80 while the other two servers are running Apache and act as two traditional web servers. Varnish was as easy as an “apt-get install varnish” to get going, you can add their own repository in order to keep more up to date.

So I have three virtual machines, each assigned 1gb of RAM and 1 CPU core and about 40gb of disk, all running on my local network.

192.168.1.1 – Varnish
192.168.1.2 – Apache1
192.168.1.3 – Apache2

With these servers the goal was to originally have the Varnish server cache content from both web servers that are serving up the same content in a round robin style. This means that Varnish would cache the content from Apache1, then once the cache expires Apache2 will be cached next, and then back to Apache1 after that. When Varnish returns a result from cache without going to the backend servers it is known as a “Hit” while a request that does not go to the cache and is instead passed on to the backend servers is called a “Miss”.

Configuring Varnish

After first installing Varnish, I edited the /etc/default/varnish file which contains basic config. In this file I only defined that the Varnish server will be listening on port 80 for incoming requests and have 256mb of cache in memory under DAEMON_OPTS like this:

DAEMON_OPTS="-a :80 \
-T :6082 \
-f /etc/varnish/default.vcl \
-S /etc/varnish/secret \
-s malloc,256m"

You’ll notice that the VCL file is specified as /etc/varnish/default.vcl by default, this is the file all of the Varnish configuration goes in. VCL stands for Varnish Configuration Language and it is used to define how Varnish works. The VCL code is translated into C and is then compiled.

When you first open the file it will contain some default VCL code that is commented out. It is recommended to leave this there as it is used as a default if your VCL has any problems.

My VCL file

Now I’ll go through the contents of my /etc/varnish/default.vcl and what they do.

Initially I defined the backend Apache1 and Apache2 servers and these are what Varnish will communicate with and cache from

backend apache1{
  .host = "192.168.1.2";
  .port = "80";
}
backend apache2{
  .host = "192.168.1.3";
  .port = "80";
}

This is pretty straight forward, it makes both of my Apache servers known by Varnish and tells it that it can connect to Apache on port 80 which is what I have Apache listening on by default. You can easily change the port Apache listens on over on the web servers and then change the port in the VCL file if you want to have the communication between Varnish and the web servers on some other port.

Next both of these backend servers are grouped together into one object known as a director that I’ve called “apachegroup”. It’s also defined that this group will use round-robin between the two backend Apache servers.

director apachegroup round-robin{
{
  .backend = apache1;
}
{
  .backend = apache2;
}
}

Other useful things to use in the VCL file are vcl_recv which is used on incoming requests and can be used to decide what to do with them. On the other hand, vcl_fetch is used once something has come from the backend.

For example, the below VCL will parse a request and if the URL contains /test/ it will only ever send the request to apache1, otherwise the group is used.

sub vcl_recv {
  if (req.url ~ "/test/") {
    set req.backend = apache1;
  } else {
    set req.backend = apachegroup;
  }
}

An example of vcl_fetch is shown in the below VCL, if we want to cache a PHP script for a specific defined amount of time we can set this here. Here we are caching “long.php” for 300 seconds, while “short.php” will be cached for 150 seconds.

sub vcl_fetch {
  if (req.url ~ "/long.php") {
    set beresp.ttl = 300s;
  } elsif (req.url ~ "/short.php"){
    set beresp.ttl = 150s;
  }
}

For cases that you don’t specify the commented out VCL at the bottom of the .vcl file will be used, so if the file name is neither long.php or short.php it will default to being cached for 120 seconds.

Those are the basics of this test, when visiting the Varnish server in a browser at http://192.168.1.1 content will now be loaded round-robin style from both Apache servers.

You can add some more VCL in to test this, the VCL will output information into the HTTP response headers and you can view this in a tool such as Firebug, a useful add on for the FireFox browser. Printing out into the headers is a good way to troubleshoot what is happening behind the scenes as you can print things out and determine what server is being used for the content. This is done in vcl_deliver which is sent out to the web browser client from the Varnish server.

sub vcl_deliver {
  if (obj.hits > 0) {
    set resp.http.X-Cache = "HIT";
    set resp.http.X-Cache-Hits = obj.hits;
    set resp.http.X-Varnish-Backend = req.backend;
  } else {
    set resp.http.X-Cache = "MISS";
    set resp.http.X-Varnish-Backend = req.backend;
  }
}

This basically says that if Varnish has more than 0 hits (i.e. the page is already cached), then the string “HIT” will be printed out along with the number of hits and the name of the backend server. If there are no hits then the content is not loaded from cache, and it is defined as “MISS”. You will see the cached contents “Age” in the headers too and this will increase over time, you can refresh the page until the cache is flushed (2 minutes by default) and you will see that the X-Varnish-Backend will change from “apache1” to “apache2” the next time it is cached as the result of a miss.

Other functionality

Of course there is so much more that you can define and get Varnish to do, you can set how long you want the Cache to be considered fresh before Varnish will request a new copy from the web servers. You can select various pages and file types to either be cached or to be ignored and be retrieved straight from the real server.

Something interesting I came across was that you can set a weight to the backend servers, and have one accessed more than the other. This is good in the situation that one backend server is not as powerful resource wise as another, you can tell Varnish to access the server with more resources twice for every access of the one with fewer resources.

To set weighting on the backend servers, I had to disable round-robin for the director and change it to random.

director apachegroup1 random {
{
  .backend = apache1;
  .weight = 1;
}
{
  .backend = apache2;
  .weight = 2;
}
}

Measuring success

Varnish comes with plenty of inbuilt tools that can be used to measure the performance of Varnish and get an idea of exactly what is happening behind the scenes which will then allow you to make changes in order to get the most out of Varnish.

Here are some of the useful commands that you can run to get useful information back:

varnishlog: Shows Varnish logs, there are a lot and Varnish does not log to disk by default due to the large amount of logs. You can view these in real time and you will see why they aren’t logged – while the command is running refresh your cached page.

varnishstat: Displays statistics and information on a running Varnish instance.

varnishhist: Displays a live histogram of hits vs misses.

varnishtop: Shows a live list of the most of the most common log entries.

Summary

Varnish is easy to install and get going and can be used to speed up web sites and applications with the right configuration. I plan on benchmarking some test pages and comparing the results to various web servers which are out there so that I can see the difference in performance that Varnish makes.

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>