Development Diary of ActiveKB NX

 Before I even started developing ActiveKB NX, I had to decide which new features were going to make it into the new release. I got ideas for new features from lots of places including the guys in the office, our customers, other products and just about anywhere else I could get inspiration.

From the list that I came up with, I needed to work out the order in which to tackle the features. One method you can use is to work out which features are going to give the most benefit from the least amount of work. These are generally the features that are worth implementing first.

Another strategy for working out where to begin is to organize the features into a tree like structure where a feature which requires another feature to be completed is put under the parent feature. Using this method, the parent features are the most important.

The method that I used with ActiveKB NX was a combination of the two. I created a tree of all the features, and then if I could find anything that would logically be grouped together or a few features that could proceed each other, I could group them and then work out which features I thought were most important. After that, I worked out which were the easiest to do and used both importance and ease of implementation to work out the order in which I would implement new features.

As I started development, new things were added to the list, however, at some point I had to draw the line and accept the fact that I can’t put all of the new features into the next release otherwise it would never get done.

We support our products by having the developer of a product do the support for that product also. This allows us to get a better idea of what problems our customers are having with our products, or what features would make their life easier when using our products. It’s because of this that I decided adding support for web servers other than Apache was an important decision for ActiveKB NX.One of the biggest features in ActiveKB NX is the fact that it can scale up to hold a much larger amount of data than previous versions. The inspiration for adding enormous scalability options came when a customer asked if ActiveKB could work with 40,000 questions and 1,000 categories. I had to reply that I didn’t know, however it had piqued my curiosity enough to wonder if I could get it going with that many questions. If I hadn’t been doing the support myself I may never have heard about this and it may have been much longer before this issue came up.

The first thing I had to do was to work out a way of testing a system with that many questions. I didn’t want to have to use someones real data because that would only tell me what was currently possible so I needed a way to generate data. Lots of Data. Fast.

I decided that my best bet was to build an API for the create/update/delete functions of most of the major tables in the database. Doing this would provide me with something I could use for a number of features, so by spending the time now to implement something in a standard way, I could get benefits from it in the future using several different points of view (revision control, data validation, etc). Also, I had done something like this before, so I decided it was going to be best to spend the time to do this now rather than to do it later and have to go back and rewrite some of the new features to use the API.

Once the API was built and was tested to the point where I was happy with it, I wrote a short script which created lots and lots of questions. This let me find some of the major bottlenecks in the system quite easily. Unfortunately it also exposed a major weakness in my test script. If I inserted the exact same data into a table containing a full text index, I would effectively lose the full text index since more than 50% (read: all) of the rows contained the same data.

To combat this, I had to go back and rewrite part of the script to generate data that was much more random. Since generating the data and inserting it into the database can take a while, I only wanted to have to do this once, so I had it generate the categories randomly also and assign a random number of questions to each category.

I imported about 50,000 questions and 1,000 categories using my test script. Now it was time for the test. Would ActiveKB work as is, with that many questions? I thought I’d be lucky to see anything at all appear in my browser.

After trying to load a page and getting nowhere, I fell back to the server logs. From experience I know that the most likely bottleneck was going to be the database so I checked the database log and saw the problem right away.

ActiveKB was doing lots of unnecessary database queries. Things like getting a list of questions to show and then fetching the data for those questions one at a time. Things like this worked OK with a small number of questions but were failing with a larger number of questions, because instead of doing 30 queries, ActiveKB was doing 300. 

Now that I was pretty sure I knew where the problem was, it was time to fix it. But how would I know when it was fixed? What if a page had a few different areas that needed work? How would I know if my changes made things worse or better?

I needed a way to work out how long a page took to generate, so I added a timer at the bottom of each page. The code for adding the timer is really straightforward and is based on the example code at http://www.php.net/manual/en/function.microtime.php. The only real difference is the formatting of the code. Also, my code shows memory usage details so that I can get an idea of how much memory I’m using per page.

Once I knew what I wanted to fix and how I was going to be able to tell if it was fixed or not, the process of going through the code and trying to find the areas I could speedup began.

The MySQL log is the best place to start, since its easy to monitor the log (e.g. tail -f mysql.log) and then load up a page and see which queries have been executed. How did I know which queries should run? I wanted only the queries needed to show the data on a page to run. Nothing more, nothing less.

By far the most common source of unnecessary queries occurs in a piece of code that looks like this:

function GetDescription($id)
{
    $query = “SELCT description FROM table WHERE questionid = “.((int) $id);
    $res = mysql_query($query);
    return mysql_result($res, 0, 0);
}

… and then to see it used in a loop like this:

while ($row = mysql_fetch_array($result)) {
    $description = GetDescription($row['id']);
}

The problem with this is that you’ll end up executing a database query for each row in the result, so if you have 1000 rows, that’s 999 more queries than are required, and in web applications, queries are expensive in the amount of time that they take to execute.
When I’m developing, the order in which I do things can make a big difference. Taking the time to make a rough outline at the start rather than just jumping in and coding can sometimes mean the difference between a relatively easy product update, and one that takes way longer than originally planned.

This entry was posted in Programming. Bookmark the permalink.

Leave a Reply