By Rodney Amato
One of the biggest features in ActiveKB NX is the fact that it can scale up to hold a much larger amount of data than previous versions. The inspiration for adding enormous scalability options came when a customer asked if ActiveKB could work with 40,000 questions and 1,000 categories. I had to reply that I didn't know, however it had piqued my curiosity enough to wonder if I could get it going with that many questions. If I hadn't been doing the support myself I may never have heard about this and it may have been much longer before this issue came up.
The first thing I had to do was to work out a way of testing a system with that many questions. I didn't want to have to use someones real data because that would only tell me what was currently possible so I needed a way to generate data. Lots of Data. Fast.
I decided that my best bet was to build an API for the create/update/delete functions of most of the major tables in the database. Doing this would provide me with something I could use for a number of features, so by spending the time now to implement something in a standard way, I could get benefits from it in the future using several different points of view (revision control, data validation, etc). Also, I had done something like this before, so I decided it was going to be best to spend the time to do this now rather than to do it later and have to go back and rewrite some of the new features to use the API.
Once the API was built and was tested to the point where I was happy with it, I wrote a short script which created lots and lots of questions. This let me find some of the major bottlenecks in the system quite easily. Unfortunately it also exposed a major weakness in my test script. If I inserted the exact same data into a table containing a full text index, I would effectively lose the full text index since more than 50% (read: all) of the rows contained the same data.
To combat this, I had to go back and rewrite part of the script to generate data that was much more random. Since generating the data and inserting it into the database can take a while, I only wanted to have to do this once, so I had it generate the categories randomly also and assign a random number of questions to each category.
I imported about 50,000 questions and 1,000 categories using my test script. Now it was time for the test. Would ActiveKB work as is, with that many questions? I thought I'd be lucky to see anything at all appear in my browser.
After trying to load a page and getting nowhere, I fell back to the server logs. From experience I know that the most likely bottleneck was going to be the database so I checked the database log and saw the problem right away.
ActiveKB was doing lots of unnecessary database queries. Things like getting a list of questions to show and then fetching the data for those questions one at a time. Things like this worked OK with a small number of questions but were failing with a larger number of questions, because instead of doing 30 queries, ActiveKB was doing 300.
Now that I was pretty sure I knew where the problem was, it was time to fix it. But how would I know when it was fixed? What if a page had a few different areas that needed work? How would I know if my changes made things worse or better?
I needed a way to work out how long a page took to generate, so I added a timer at the bottom of each page. The code for adding the timer is really straightforward and is based on the example code at http://www.php.net/manual/en/function.microtime.php. The only real difference is the formatting of the code. Also, my code shows memory usage details so that I can get an idea of how much memory I'm using per page.
Once I knew what I wanted to fix and how I was going to be able to tell if it was fixed or not, the process of going through the code and trying to find the areas I could speedup began.
The MySQL log is the best place to start, since its easy to monitor the log (e.g. tail -f mysql.log) and then load up a page and see which queries have been executed. How did I know which queries should run? I wanted only the queries needed to show the data on a page to run. Nothing more, nothing less.
By far the most common source of unnecessary queries occurs in a piece of code that looks like this:
function GetDescription($id)
{
$query = “SELCT description FROM table WHERE questionid = “.((int) $id);
$res = mysql_query($query);
return mysql_result($res, 0, 0);
}
... and then to see it used in a loop like this:
while ($row = mysql_fetch_array($result)) {
$description = GetDescription($row['id']);
}
The problem with this is that you'll end up executing a database query for each row in the result, so if you have 1000 rows, that's 999 more queries than are required, and in web applications, queries are expensive in the amount of time that they take to execute.