Main Page   Class Hierarchy   Compound List   File List   Compound Members   File Members  

Stage 1: the internal database and printing the results

Next we decided to work on the internal database: how to keep data on selected logfile records such that it is easy to produce the required output (this information will be stored in a Stats object).

It appears that all possible criteria, date, path and domain are ``hierarchical'', in that they consist of a sequence of items where the first item represents the first-level category, the second item represents a subcategory within the category of the first item etc.

E.g. a logrecord from Jun 3, 1999 10:30 can be represented by the sequence <1999,6,3,10>. For domains, an example would be <be,ac,vub,tinf2>, corresponding to ``tinf2.vub.ac.be''. For paths, <a,b,c> corresponds to ``/a/b/c''.

An obvious choice to represent e.g. date categories would then be a vector (or list, but vector seems more appropriate since we do not intend to insert in the sequences). A nice side effect for this decision is that it can be used to represent all levels:

A simple way to keep counts is to have a mapping from such category vectors to integers (the latter being the number of requests corresponding to the category):
      typedef   map<vector<int>,int>   Stats::DateCount;
      Stats::DateCount                 datecount;
Now suppose we get a LogRecord from a certain Date. To update datecount, all we need to do is create a category
vector<int>  v
for each of the categories that we are interested in and then do
        ++datecount[v];
A nice feature of the Stats::DateCount type is that it is sorted according to the comparison operator ``<'' of the key type vector<int>. Conveniently, this comparison operator works lexicographically(as in a dictionary). This means that shorter vectors that are a prefix of a longer one come first. E.g. the map will keep the vectors sorted as in the example below (where we assume that we are only interested in the first 2 levels: year and month).
        <1999>
        <1999,2>
        <1999,3>
        <1999,4>
        ...
        <2000>
        <2000,1>
        ...
In other words, printing datecount is trivial:
  for (DateCount::iterator i=datecount.begin(); i!=datecount.end(); ++i) {
    const vector<int>& v((*i).first);  
    // print v nicely, e.g. 1999-6-3 10:00 for <1999,6,3,10>
    ..
    cout << "\t" << (*i).second << "\n"; //actual count for this date category
    }
Now we come back to the problem of updating datecount. Although, we didn't design it yet, we will want to use (at least in this version), the supplied Date class since we hope this saves writing a parsing function for something like
        16/Aug/1999:16:28:39
(We will later discover that Date::Date() cannot parse this, so we will have to mess with the string to be parsed, and, with hindsight, it would perhaps have been better to make our own simple Date class.)

So we can assume to have a Date ``d '' in hand. Now we need to create a category vector<int> out of it for all the category levels of interest.

But how do we know what these levels are? From the configuration file. Thus we will assume that Configuration has function members like these:

      unsigned int Configuration::max_level() const;
      set<unsigned int> Configuration::levels() const;
Now we can write code to do the job: it assumes that the Configuration is available in a variable `` conf '' and the date in a variable ``date ''.
      set<unsigned int> levels(conf.levels());
      for (unsigned int l=1; l<=conf.max_level(); ++l) 
        if (levels.count(l)) { // we are interested in this level
          vector<int> dv(l); // a vector with i integers
          switch (l) {
            case 4: dv[3] = d.hours();
            case 3: dv[2] = d.day();
            case 2: dv[1] = d.month();
            case 1: dv[0] = d.year();
            }
          ++datecount[dv];
          }
Note the use of set<int>::count(int) which is the fastest way to check whether an element belongs to a set. Note also that there are no ``break '' statements separating the cases, and that is convenient here.

That's it. The solution for domains and paths is similar except that we use vector<string> instead of vector<int>. For domains, there is the added small complication that we need to have the vector ``in reverse order'', i.e. tinf2.vub.ac.be must result in a (level 4) vector <be,ac,vub,tinf2>. Since we want to test the stuff before continuing, we also add dummy code that returns the Configuration::CRITERIUM for a format, as shown below.

      class Configuration {
      public:
        enum CRITERIUM { DATE, PATH, DOMAIN };
        Configuration();
        ~Configuration() {}
      
        CRITERIUM       format() const { return format_; }
        unsigned int      max_level() const { return max_level_; }
        const set<int>& levels() const { return levels_; }
      
        bool parse(istream&);
      private:
        CRITERIUM       format_;
        set<int>        levels_;
        unsigned int    max_level_;
      };
In order to test, we also need code that actually retrieves a Date from a line of the log file, and similarly for path and domain. For the first we use the supplied Date class, for the latter we directly parse into vector<string> 's.

Note the trick used to represent paths: a path "/a/b/c" is represented by a vector<string> with 4 components: <"","a","b","c">. The first empty string represents the ``root'' directory. Thus, a path "/ " will be represented by <"">

Smaller issues that must be solved for paths include the removal of trailing ``/ '', so that e.g. "/a/b/c/" is understood as "/a/b/c", and the decoding of things like "\%7E" as '~', for which we borrow a www_decode(string&) function from somewhere.

The details are in the implementation of LogRecord which now looks like this.

      class LogRecord {
      public:
        LogRecord() {}
      
        bool parse_line(const string& line);
      
        const Date& date() const { return date_; }
        const vector<string>& path() const { return path_; }
        const vector<string>& domain() const { return domain_; }
      
      private:
        bool parse_date(const string& line);
        bool parse_path(const string& line);
        bool parse_domain(const string& line);
        Date    date_;
        vector<string> path_;
        vector<string> domain_;
      };

The present state of the system can be found in the directory stage01.


httpstats-stage01 [ 7 April, 2001]