More Perl regex stuff in E

Mark S. Miller
Sat, 17 Apr 1999 13:33:10 -0700

At 08:43 AM 4/16/99 , wrote:
>I'm not familiar enough with E or perl's regex syntax to really understand
>your code, but I know that Python does let you break up strings just like C
>so that you can do something like:
>r = re.compile( "\n?" # optional newline(?)
>			"[ \t]*" # zero or more tabs or spaces
>		)
>This is a common idiom in Python regex code and a simple way to explain what
>you are doing.  The compiler implicitly concatenates those strings into one
>string as the argument to re.compile.

Thanks for pointing this out.  The "(?x)" in Perl5 (and therefore
OroMatcher) tells the regex engine to ignore unescaped whitespace outside of 
character classes, and also to accept an unescaped "#" as a 
comment-to-end-of-line character.  I had been prohibiting this, as my 
wrapper would have to distinguish whether an "(" or "@" was inside a comment 
of not, which would mean it would have to know what whitespace mode we're in 
at that time.  As an experiment, I enabled, it and rewrote the previous 
short code to heavily comment the otherwise mysterious regex's.  I then made 
sure that my embedded comments contained no "("s or "@"s.  The result 
(below) is so compellingly better that it's clear I need to support the (?x) 
flag.  I'm even tempted to make it the default.  Hell, I'm almost tempted to 
prohibit turning it off, so old Perl programmers won't be so tempted to 
carry over their write-only regex coding habits.  (Take it easy everyone, I 
said "almost")

As an added benefit, (?x) enabled me to avoid long lines.  And Chip, you 
were right -- it looks like my mailer should get the rap for wrapping.

(I'd like to take a moment to mention how impressed I am with OROMatcher.  
The entire oromatcher.jar file is only 27KB compressed, it's clearly doing 
something impossible, and for all the weird things I've done with it, every 
bug I've encountered has been one of mine.)

The following is essentially identical to the previous 21 line program 
except for these whitespace and comment issues.  With as a casual 
reference, is this readable?  How do y'all like it, code-quality-wise, 
compared to how you'd write it without a regex engine?  (And no, I don't 
imagine it'll be possible to get people to comment typical E regex code this 
heavily.)  With one exception, it worked the first time after transforming.  
The exception is that I didn't think to escape the "#"s that were part of 
the regex proper.

# Given the text of an updoc script, this returns a sequence of
# carrots, one for each execution unit.
define parseUpdoc(script) {
    define result := [] diverge
    while (script =~ rx`(?msx) # 'm' allows multi-line input
                               # 's' allows newlines to be treated as a 
                               #     normal character
                               # 'x' allows comments and whitespace, such
                               #     as this, to be embedded in the regex

                        # Everything up to the last newline before the 
                        # beginning of the first code segment is gathered
                        # into a comment.  The "?" in ".*?" is the
                        # anti-greedy modifier.  It means "eat the least
                        # that would enable the rest of the match to 
                        # succeed".  Without it, the comment would be
                        # everything before the *last* code segment.
                        # The "?" on the newline makes it optional,
                        # allowing the first code segment to come at
                        # the beginning of the text.

                        # A code segment begins with a line whose first
                        # non-whitespace character is a "?".  We
                        # wish to gather into the 'code' variable
                        # all text starting after the first optional
                        # space up to but not including the
                        # last newline of the code segment.  This
                        # pattern just picks up the first line of
                        # code.  The while loop below picks up the rest
                        [ \t]*\?[ ]?(@code.*?)\n

                        # Just gathers everything else for further
                        # processing. 
                        (@rest.*)`) {
        script := rest
        while (script =~ rx`(?msx)

                            # A code continuation line has a ">" as
                            # its first non-whitespace character.
                            # The extra code it contributes starts
                            # after the following optional space.
                            [ \t]*>[ ]?(@moreCode.*?)\n

                            # everything else
                            (@rest.*)`) {

            # Since we have more code, the previous newline was part
            # of the code, so add it back in.
            code += "\n" + moreCode
            script := rest
        if (script =~ rx`(?msx)

                         # A code block is optionally followed by an
                         # output block, which is the output we'll
                         # be regression testing against.  Ignoring
                         # whitespace, the first line of an output
                         # block is identified by a 
                         # "# <identifier>:".  The identifier is the
                         # label saying what kind of output block we
                         # have.  We have to escape the "#"s below
                         # so rx knows they don't start a comment.
                         [ \t]*\#[ \t]+?(@label\w*):

                         # The output then consists of the
                         # following text with all leading and
                         # trailing whitespace trimmed from each
                         # line, and without the final newline.
                         [ \t]*(@output.*?)[ \t]*\n

                         # Everything else
                         (@rest.*)`) {

            script := rest
            while (script =~ rx`(?msx)

                                # Output continues if the next
                                # line's first non whitespace
                                # character is "#".
                                [ \t]*\#[ \t]*(@moreOutput.*?)[ \t]*\n

                                # Everything else
                                (@rest.*)`) {

                output += "\n" + moreOutput
                script := rest

            # Updoc proceeds, of course, by munching carrots.  This
            # carrot does have an output block to regression-test
            # against.
            result push(CarrotMaker(comment, code, label, output))

        } else {
            # This carrot has no output block, but it still needs to
            # be chewed in sequence.  First, it probably has side
            # effects that the later carrots depend on.  Second, lack of an
            # output block still implicitly asserts a successful,
            # rather than an exceptional (abrupt), outcome.
            result push(CarrotMaker(comment, code, null, null))
    result snapshot