WpW: Crawl Your Site; Make it Fly!
Welcome to another installment of WordPress Wednesday!
Today we’re going to talk about the concept of “crawling” and why it has the potential to make your WordPress site even faster.
What does a crawler do?
We recently released an update to our LiteSpeed Cache for WordPress plugin. Among other things, the new version includes crawler functionality. What does this mean?
LSCache’s crawler, travels its way throughout the backend, refreshing pages that have expired in the cache. The purpose is to keep the cache as fresh as possible while minimizing visitor exposure to uncached content.
NOTE: Crawler functionality is enabled or disabled at the server level, and its availability is controlled by your hosting provider. We’ll talk about that more below, or you can see our wiki page on how to enable the crawler.
Why would we want to do this?
Well, first let’s look at how pages are cached without a crawler. The whole process is initiated by user request. The cache is empty until non-logged-in users start sending requests to the backend. The first time a page is visited, the request hits the backend, WordPress’s PHP code is invoked to generate the page, the page is served to the user, and then stored in cache for next time.
That’s a fairly time-consuming (and bandwidth-consuming!) process for the server.
Now, let’s look at what happens when the cache is built by a crawler. When the crawler requests a page, the request hits the backend, WordPress’s PHP code is invoked to generate the page, but because of the special header that lets LSWS know that this is a crawler that initiated the request, the full page doesn’t need to be served. It is simply stored in cache.
This saves significant bandwidth.
Additionally, with the crawler refreshing expired pages at regular intervals, the chances that a user will encounter an uncached page is significantly diminished. This makes for a faster site.
Let’s look at some settings
You want the crawler to be effective in its mission, but not at the expense of your system’s performance. It’s helpful to know what some of the settings mean, so you can control just how many resources you want to give to the crawling process.
Navigate to WP-Admin > LiteSpeed Cache > Settings > Crawler and take a look at the following settings:
The crawler sends requests to the backend, one page after another, as it traverses your site. This can put a heavy load on your server if there is no pause between requests. Set the Delay to let LSCache know how often to send a new request to the server. The default is
10,000 microseconds (or .01 seconds). You can increase this amount to lessen the load on the server, just be aware that will make the entire crawling process take longer.
In order to keep your server from getting bogged-down with behind-the-scenes crawling, you can put limits on the crawling duration. For example, if we set Run Duration to
60 seconds, then the crawler will run for 1 minute before taking a break. After the break is over, the crawler will start back up exactly where it left off and run for another 60 seconds. This will continue until the entire site has been crawled.
Interval Between Runs
This setting determines the length of the break mentioned above. In that same example, if we set Interval Between Runs to
600 seconds, then the crawler would pause for 10 minutes after every 1-minute run.
How often do we want to re-initiate the crawling process? This depends on how long it takes to crawl your site. The best way to figure this out is to run the crawler a couple of times and keep track of the elapsed time. Once you’ve got that amount, set the interval to slightly more than that. For example, if your crawler routinely takes 4 hours to complete a run, you could set the interval to 5 hours (or
When Threads is set to
3, then there are 3 separate crawling processes happening concurrently. The higher the number, the faster your site is crawled, but also the more load that is put on your server.
Server Load Limit
This setting is a way to keep the crawler from monopolizing system resources. Once it reaches this limit, the crawler will be terminated rather than allowing it to compromise server performance. This setting is based on
linux server load. (A completely idle computer has a load average of 0. Each running process either using or waiting for CPU resources adds 1 to the load average.)
What to Include
The rest of these settings control which taxonomies are crawled and which are not. By default, all of the traditional types (Posts, Pages, Categories, and Tags) are crawled, as are any custom taxonomies (like, Products, Product Categories, and Product Tags, if you’re using WooCommerce for example). If you want to exclude the custom types, they must be specifically added to the Exclude Custom Post Types textbox.
Protecting your server from overload
All of the settings above were designed to work together to protect your server from overload. The first four parameters (Delay, Run Duration, Interval Between Runs, and Crawl Interval) control how often and for how long the crawler is allowed to run. You can set these values to give your crawler as many or as few resources as your system can afford.
Threads and Server Load Limit are two settings that work together to automatically terminate the crawler if it tries to get too greedy with system resources. Here’s an example:
Let’s say we have the following values set before the crawler starts:
Server Load Limit =
The current server load is at 2, and the crawler begins. It’s crawling 4 urls at a time (due to the Threads setting), but this has caused the server load to jump up above our limit of 5. In response, the crawler drops the number of threads to 3 and keeps going.
If the server load hits our limit of 5 again, the crawler will drop the number of threads to 2 and go on. This process is repeated until we are down to a single thread.
If the server load is still too high with only one crawler thread, the crawling process will be terminated.
On the other hand, if the server is doing just fine with one thread, the crawler will increase the number of threads one at a time, until it has either reached the Thread Limit we set (4 in this case) or our Server Load Limit.
Watching the crawler do its thing
If you’re the kind of person who likes to sit in front of a terminal and watch a process making progress (and, really, who isn’t?) you’ll enjoy the Show crawler status button.
Navigate to WP-Admin > LiteSpeed Cache > Crawler and press the button. If the crawler isn’t actually running right now, make sure Activation is set to
ENABLE and press the Manually run button. Then you can watch it go!
Your output is bound to be more fun to watch than ours above, since we only have 2 pages in this particular test blog. Our crawler was finished before we even opened up the status window! (Don’t worry, we tested this update extensively on several sites with many more pages than 2! 😉 )
Impacts on Shared Hosting
Site owners love the crawler functionality, but if you are a shared hosting provider, chances are your first thought goes to the impact all of this crawling could have on your servers. Despite the fact that we’ve provided several features designed to minimize a crawler’s impact on the server, it’s a valid concern when you are hosting thousands of installations. This is why we have ultimately put crawler control into the hands of the hosts.
The crawler is disabled by default. This will keep usage restricted to those who truly want to use it. As host, it is your choice whether to enable crawler functionality at all. You can keep it permanently-disabled on a server-wide basis, if you do not want crawling on your system.
When crawling is disabled, site administrators will see the following message in WP Admin > LiteSpeed Cache -> Crawler:
Warning: The crawler feature is not enabled on the LiteSpeed server. Please consult your server admin.
How to Enable the Crawler
NOTE: it is not recommended to turn on the crawler for shared hosting setups unless the server has enough capacity to handle it!
As of LSWS v5.1.16*, there are a few different approaches you can take to crawling on your server:
- You can disable it for the entire server
- You can enable it for the entire server
- You can selectively enable it for particular clients, while leaving it disabled for everyone else
To disable crawling for the entire server, do nothing. It is disabled by default.
To enable crawling for the entire server, you need to update the appropriate configuration file, like you did when you originally set up your cache root. Add the following:
<IfModule LiteSpeed> CacheEngine on crawler </IfModule>
To selectively enable crawling for particular clients, you would not update the server’s config file. Instead, locate (or create) the virtual host include files for those clients and add the above lines to that.
The exact location of the relevant configuration or include file varies, depending on the control panel you use (or if you use no control panel at all), and which of the above options you are looking to enact. For detailed instructions, please see our wiki page on the subject.
If you don’t have access to the appropriate files, you will need to ask your server administrator to enable the crawler for you.
*If you are on v5.1.16 and having difficulty getting this to work, please force reinstall to the latest build.
LiteSpeed Cache for WordPress’s new crawler functionality is pretty handy, don’t you think? By regularly renewing the cache, you minimize the number of times your users have to wait for dynamic pages to be processed, and you speed up your whole site!
After you’ve updated the plugin (or installed it for the first time!), come back here and let us know what you think. We’d love to get your feedback.
Have some of your own ideas for future WordPress Wednesday topics? Leave us a comment!
Don’t forget to meet us back here next week for the next installment. In the meantime, here are a few other things you can do:
- Subscribe to the WordPress Wednesday RSS feed
- Download LiteSpeed Cache for WordPress plugin
- Learn more about the plugin on our website