Main Page   Class Hierarchy   Compound List   File List   Compound Members   File Members  

Parsing the configuration file and selections -- part 1.

Parsing itself is not difficult. Since this is a 2nd year project, we will parse ``by hand''. If we were wiser, we would of course use a professional parser generator like yacc/bison of lex/flex. Parsing by hand is an ``ad hoc'' affair but we can make the job easier by using auxiliary functions, essentially one for each nonterminal in the grammar (this strategy is called ``predictive'' parsing; it is less efficient than the parsers generated by bison, but the parsing method is clearly not critical here, from an efficienty point of view).

The main Configuration::parse() function will call Configuration::parse_comment, Configuration::parse_format, Configuration::parse_select and Configuration::parse_exclude, depending on the first word detected. These will then further call functions like Configuration::parse_pattern which itself may call Configuration::parse_string_pattern, Configuration::parse_domain_name etc.

A more interesting problem is how we will represent the result of parsing select or exclude directives (Representing the format info is already fixed: it is stored in Configuration::max_level_ and Configuration::levels_).

Clearly, e.g. a selection criterium must be stored such that it is easy to use. An elegant solution, which we will adopt, is to consider a selection as an Expression that can be evaluated for a LogRecord.

  class Expression {
  public:
    virtual bool eval(const LogRecord& r) const = 0;
    virtual ~Expression() {}
  };
Note the virtual destructor which is needed because we will use Expression as a base class.

We will then have several classes that are derived from Expression, one for each kind of selection: e.g.

  class PathExpression: public Expression {
  public:
    PathExpression(const string& pathstring);
    bool eval(const LogRecord& r) const;
  };
Exclusions are easy to do by using a NotExpression:
  class NotExpression: public Expression {
  public:
    NotExpression(const Expression* e): exp_(e) {}
    bool eval(const LogRecord& r) const { return (e?!e.eval(r):false); }
  private:
    Expression* exp_;
  }

Thus, if we parse an exclude directive, we just construct an Expression e as for a select directive and then do something like

  new NotExpression(e);

Similarly, we define an AndExpression to keep a bunch of directives that must all be fulfilled:

  class AndExpression: public Expression {
  public:
    AndExpression() {}
    void add(Expression* e) { if (e) exps_.push_back(e); }
    bool eval(const LogRecord& r) const { 
      for (list<Expression*>::const_iterator i=exps_.begin(); i!=exps_.end(); ++i)
        if (!(*i)->eval(r))
          return false;
      return true;
      }
  private:
    list<Expression*> exps_;
  }

This is a bit much to all add to Configuration, so we make new files expression.h and expression.C.

Since we don't know how many directive there will be, we will use new and delete to create Expression objects. The result of all ``select'' and ``exclude'' directives will be an AndExpression which we will store in Configuration.

  Expression* Configuration::select_;

and a suitable retrieval function.

        Expression* Configuration::selection() const { return select_; } 

The main function can then be updated like so

  Expression* select(config.select());

  while (getline(config_file,line)) {
    if (logrecord.parse_line(line)) {
      if (select) {
        if (select->eval(logrecord))
          stats.add(logrecord);
        }
      else // no selections or exclusions
        stats.add(logrecord);
      ++n_lines_parsed;
      }
    else
      ++n_error_lines;
    }

The remaining problem is the parsing and representation of the patterns inside Expressions. Looking e.g. at a path string it seems that parsing such a string should reuse code which is now in LogRecord::parse_path(). Similarly for parsing a domain pattern. The LogRecord::parse_date() is not reusable, so we will write special code for date patterns. Thus we will lift out the reusable parsing code from the LogRecord parse functions and put them somewhere where they can be reached by Expression (such a restructuring operation is called refactoring).

Where to put the code? One alternative is as free functions, e.g.

        parse_domain(const string&,vector<string>&);
However, we prefer to bite the bullet and add two extra classes: Domain and Path that contain the relevant parsing code (which can also be used for patterns) and modify existing code accordingly, as shown below for LogRecord.
        class LogRecord {
        public:
          LogRecord() {}
        
          bool parse_line(const string& line); // fill in date_, path_, domain_
        
          const Date& date() const { return date_; }
          const Path& path() const { return path_; }
          const Domain& domain() const { return domain_; }
        
        private:
          Date   date_;
          Path   path_;
          Domain domain_;
        };
This is a reasonably complex reorganization, so we start by doing that, and then verify that everything still works, before implementing Expression and Configuration parsing.

There are 4 more files: domain.h, domain.C, path.h and path.C. For Domain, we decide to switch from vector<string> to list<string> as the underlying representation since this allows us to user list<string>::push_front to insert domain components, thus immediately storing the components in the right order, e.g. "tinf2.vub.ac.be" is stored (and printed) as "be.ac.vub.tinf2".

We also add a constructor

    Path::Path(const Path& p, int n)
which will initialize a Path by copying the first n components from p. Thus the code to store a Path in Stats:add() becomes simpler:
    const Path& path(r.path());
    unsigned int m(min(path.size(),max_level));
    for (unsigned int l=1; l<=m; ++l) 
      if (levels.count(l)) {
        Path pp(path,l); // use brand new Path constructor
        ++paths_[pp];
        }
Naturally, we will implement operator<<(ostream &,const Path &), simplifying the output in Stats::print_paths() to:
   void
   Stats::print_paths(ostream& os) const {
   for (PathCount::const_iterator i=paths_.begin(); i!=paths_.end(); ++i) {
     const Path& path((*i).first);
     os << path << "\t" << (*i).second << "\n";
     }
   }
where, of course Stats::PathCount is now a map<Path,int> instead of a map<vector<string>,int> (that is why we need Path::operator== and Path::operator<, which simply delegate their job to the underlying vector<string>). The introduction of Domain brings similar benefits.

Now the part of Stats dealing with dates looks ugly, when compared to the elegant handling of paths and domains. So we may as well define a DatePattern class that can parse date patterns from the configuration file and that will replace Date in LogRecord. The supplied Date class will then only be used in LogRecord::parse_date(), and that only because we are too stubborn to write our own parsing code. So that's another two files: datepattern.h and datepattern.C. Note the brute force approach to parsing DatePattern's. It simple and it works, although it may accepts strings that would not be acceptable to a more careful parsing function, e.g. one constructed with the help of bison.

The result of the refactoring can be found in the stage02a directory.


httpstats-stage02a [ 7 April, 2001]