Parsing itself is not difficult. Since this is a 2nd year project, we will parse ``by hand''. If we were wiser, we would of course use a professional parser generator like yacc/bison of lex/flex. Parsing by hand is an ``ad hoc'' affair but we can make the job easier by using auxiliary functions, essentially one for each nonterminal in the grammar (this strategy is called ``predictive'' parsing; it is less efficient than the parsers generated by bison, but the parsing method is clearly not critical here, from an efficienty point of view).
The main Configuration::parse() function will call Configuration::parse_comment, Configuration::parse_format, Configuration::parse_select and Configuration::parse_exclude, depending on the first word detected. These will then further call functions like Configuration::parse_pattern which itself may call Configuration::parse_string_pattern, Configuration::parse_domain_name etc.
A more interesting problem is how we will represent the result of parsing select or exclude directives (Representing the format info is already fixed: it is stored in Configuration::max_level_ and Configuration::levels_).
Clearly, e.g. a selection criterium must be stored such that it is easy to use. An elegant solution, which we will adopt, is to consider a selection as an Expression that can be evaluated for a LogRecord.
class Expression { public: virtual bool eval(const LogRecord& r) const = 0; virtual ~Expression() {} };
We will then have several classes that are derived from Expression, one for each kind of selection: e.g.
class PathExpression: public Expression { public: PathExpression(const string& pathstring); bool eval(const LogRecord& r) const; };
class NotExpression: public Expression { public: NotExpression(const Expression* e): exp_(e) {} bool eval(const LogRecord& r) const { return (e?!e.eval(r):false); } private: Expression* exp_; }
Thus, if we parse an exclude directive, we just construct an Expression e as for a select directive and then do something like
new NotExpression(e);
Similarly, we define an AndExpression to keep a bunch of directives that must all be fulfilled:
class AndExpression: public Expression { public: AndExpression() {} void add(Expression* e) { if (e) exps_.push_back(e); } bool eval(const LogRecord& r) const { for (list<Expression*>::const_iterator i=exps_.begin(); i!=exps_.end(); ++i) if (!(*i)->eval(r)) return false; return true; } private: list<Expression*> exps_; }
This is a bit much to all add to Configuration, so we make new files expression.h and expression.C.
Since we don't know how many directive there will be, we will use new and delete to create Expression objects. The result of all ``select'' and ``exclude'' directives will be an AndExpression which we will store in Configuration.
Expression* Configuration::select_;
and a suitable retrieval function.
Expression* Configuration::selection() const { return select_; }
The main function can then be updated like so
Expression* select(config.select()); while (getline(config_file,line)) { if (logrecord.parse_line(line)) { if (select) { if (select->eval(logrecord)) stats.add(logrecord); } else // no selections or exclusions stats.add(logrecord); ++n_lines_parsed; } else ++n_error_lines; }
The remaining problem is the parsing and representation of the patterns inside Expressions. Looking e.g. at a path string it seems that parsing such a string should reuse code which is now in LogRecord::parse_path(). Similarly for parsing a domain pattern. The LogRecord::parse_date() is not reusable, so we will write special code for date patterns. Thus we will lift out the reusable parsing code from the LogRecord parse functions and put them somewhere where they can be reached by Expression (such a restructuring operation is called refactoring).
Where to put the code? One alternative is as free functions, e.g.
parse_domain(const string&,vector<string>&);
class LogRecord { public: LogRecord() {} bool parse_line(const string& line); // fill in date_, path_, domain_ const Date& date() const { return date_; } const Path& path() const { return path_; } const Domain& domain() const { return domain_; } private: Date date_; Path path_; Domain domain_; };
There are 4 more files: domain.h, domain.C, path.h and path.C. For Domain, we decide to switch from vector<string> to list<string> as the underlying representation since this allows us to user list<string>::push_front to insert domain components, thus immediately storing the components in the right order, e.g. "tinf2.vub.ac.be" is stored (and printed) as "be.ac.vub.tinf2".
We also add a constructor
Path::Path(const Path& p, int n)
const Path& path(r.path()); unsigned int m(min(path.size(),max_level)); for (unsigned int l=1; l<=m; ++l) if (levels.count(l)) { Path pp(path,l); // use brand new Path constructor ++paths_[pp]; }
void Stats::print_paths(ostream& os) const { for (PathCount::const_iterator i=paths_.begin(); i!=paths_.end(); ++i) { const Path& path((*i).first); os << path << "\t" << (*i).second << "\n"; } }
Now the part of Stats dealing with dates looks ugly, when compared to the elegant handling of paths and domains. So we may as well define a DatePattern class that can parse date patterns from the configuration file and that will replace Date in LogRecord. The supplied Date class will then only be used in LogRecord::parse_date(), and that only because we are too stubborn to write our own parsing code. So that's another two files: datepattern.h and datepattern.C. Note the brute force approach to parsing DatePattern's. It simple and it works, although it may accepts strings that would not be acceptable to a more careful parsing function, e.g. one constructed with the help of bison.
The result of the refactoring can be found in the stage02a directory.