Next we decided to work on the internal database: how to keep data on selected logfile records such that it is easy to produce the required output (this information will be stored in a Stats object).
It appears that all possible criteria, date, path and domain are ``hierarchical'', in that they consist of a sequence of items where the first item represents the first-level category, the second item represents a subcategory within the category of the first item etc.
E.g. a logrecord from Jun 3, 1999 10:30 can be represented by the sequence <1999,6,3,10>. For domains, an example would be <be,ac,vub,tinf2>, corresponding to ``tinf2.vub.ac.be''. For paths, <a,b,c> corresponds to ``/a/b/c''.
An obvious choice to represent e.g. date categories would then be a vector (or list, but vector seems more appropriate since we do not intend to insert in the sequences). A nice side effect for this decision is that it can be used to represent all levels:
typedef map<vector<int>,int> Stats::DateCount; Stats::DateCount datecount;
vector<int> v
++datecount[v];
<1999> <1999,2> <1999,3> <1999,4> ... <2000> <2000,1> ...
for (DateCount::iterator i=datecount.begin(); i!=datecount.end(); ++i) { const vector<int>& v((*i).first); // print v nicely, e.g. 1999-6-3 10:00 for <1999,6,3,10> .. cout << "\t" << (*i).second << "\n"; //actual count for this date category }
datecount
. Although, we didn't design it yet, we will want to use (at least in this version), the supplied Date class since we hope this saves writing a parsing function for something like 16/Aug/1999:16:28:39
So we can assume to have a Date ``d '' in hand. Now we need to create a category vector<int> out of it for all the category levels of interest.
But how do we know what these levels are? From the configuration file. Thus we will assume that Configuration has function members like these:
unsigned int Configuration::max_level() const; set<unsigned int> Configuration::levels() const;
set<unsigned int> levels(conf.levels()); for (unsigned int l=1; l<=conf.max_level(); ++l) if (levels.count(l)) { // we are interested in this level vector<int> dv(l); // a vector with i integers switch (l) { case 4: dv[3] = d.hours(); case 3: dv[2] = d.day(); case 2: dv[1] = d.month(); case 1: dv[0] = d.year(); } ++datecount[dv]; }
break
'' statements separating the cases, and that is convenient here.
That's it. The solution for domains and paths is similar except that we use vector<string> instead of vector<int>. For domains, there is the added small complication that we need to have the vector ``in reverse order'', i.e. tinf2.vub.ac.be must result in a (level 4) vector <be,ac,vub,tinf2>. Since we want to test the stuff before continuing, we also add dummy code that returns the Configuration::CRITERIUM for a format, as shown below.
class Configuration { public: enum CRITERIUM { DATE, PATH, DOMAIN }; Configuration(); ~Configuration() {} CRITERIUM format() const { return format_; } unsigned int max_level() const { return max_level_; } const set<int>& levels() const { return levels_; } bool parse(istream&); private: CRITERIUM format_; set<int> levels_; unsigned int max_level_; };
Note the trick used to represent paths: a path "/a/b/c" is represented by a vector<string> with 4 components: <"","a","b","c">. The first empty string represents the ``root'' directory. Thus, a path "/ " will be represented by <"">
Smaller issues that must be solved for paths include the removal of trailing ``/ '', so that e.g. "/a/b/c/" is understood as "/a/b/c", and the decoding of things like "\%7E" as '~', for which we borrow a www_decode(string&) function from somewhere.
The details are in the implementation of LogRecord which now looks like this.
class LogRecord { public: LogRecord() {} bool parse_line(const string& line); const Date& date() const { return date_; } const vector<string>& path() const { return path_; } const vector<string>& domain() const { return domain_; } private: bool parse_date(const string& line); bool parse_path(const string& line); bool parse_domain(const string& line); Date date_; vector<string> path_; vector<string> domain_; };
The present state of the system can be found in the directory stage01.