1    #! /usr/bin/env python
    

    lpy.py - convert python source to formatted html

usage:
    lpy [-options(s)] filename(s)

writes:
    filename.html (beware!)

options:
    -autoformat = for processing python source with no formatting cues
    -autosplit = inserts split directives automatically
    -nosplits = disables split directives
    -noindex = disable automatic index
    -printing = uses smaller fonts

download lpy.zip

for more information,
contact:
Danbala Software

      24    __Id__ = "$Id: lpy.py.html,v 1.1 2005/01/26 01:13:52 u37519820 Exp $"

Table of Contents

Introduction

lpy.py is a tool for code publishing inspired by Knuth's Literate Programming concepts. It's purpose is to make code presentable for reading.

Practically speaking, it's yet-another-Python-to-HTML too, with additions.

Other than regular Python colorizing, we're looking for some special stuff in comments and docstrings. A comment or string starting with ~ is pulled out and passed through a macro engine. The macro engine looks for the @ sign as a macro marker, using several syntaxes:

    @symbol - can be a variable replacement or function call (no arguments)
    @symbol{string} - call a function (or % a string) with one literal string
    @symbol(args) - call a function with args
    @{any Python code} - execute any Python
    @(any Python expression) - evaluate any Python and display result

The Python functions invoked can return a result string or print (to sys.stdout).
 See pymacro for more information about the macro engine.

The macro set has some simplified html macros, like @i{to italicize} to italicize, for example. HTML is simple widely known, but it is verbose, and put lots of noise in the input stream. The macros are intended to make the source tidier. See htmlmax for information about the macro set.

We're also looking for comments or strings starting with which maps to early instead of late execution. The ~ stuff is processed on the way out (late), and the @ on the way in (early), so the @ directives occuring later in the source can affect the behavior of the ~ directives occurring earlier in the source. The contents macros use this mechanism.

One other little thing - any comment starting with #- is replaced with <hr>.

Traditional LP involves writing a mixture of formatting and code, and then extracting the program and document from the source using separate programs ('tangle' and 'weave').

The term inverted literate programming describes producing the document from the actual program source code instead of from a separate source file. That's what this tool does. The concept of having to preprocess a Python script file in order to execute does seem perverse.

There are other useful approaches to LP - eventually there could be a wysiwyg / hyperlinked / outlining / structured / browser coding-and-documenting environment that really understands the program, in order to help the programmer or reader make sense out of it.

This tool is a lot less ambitious, it's simply to help me write better code. Writing readable programs is more fun (and more difficult and time-consuming).


Order of presentation is important, and Python is pretty flexible, because stuff doesn't have to be defined until executed.

Imports should come first - using from module import *, our symbols can get stomped on.

Imports

The traditionally useful Python standard library items, sys and string stuff:
     199    import sys      # we're using sys.argv
     200    import string   # and some string manipulations
pytokens is based on tokenize from the standard library, but it adds a few useful items...
     205    from pytokens import *
     206    #Parse, WHITE, NEWLINE, NL, INDENT, DEDENT, OP, COMMENT, STRING, RESERVED
    
pymacro is the module responsible for our 'macro language facility' htmlmax is the module that contains the actual html macros we're using, so let's get pymacro to load it for us...
     215    pymacro.load('htmlmax')
htmlmax is used to separate the actual html formatting specs from this module, to make it easier to customize. Sneaky -> the stuff got loaded into pymacro by the line above. It could have been loaded separately to keep the pymacro name space clean, but it's handier in there, anyway.
     224    from pymacro import html_prefix, html_suffix, html_wrap, gWraps

main

Now we can define our main program. It processes command line arguments, -macros and filenames, feeding macros to pymacro and passing filenames to PyToHTML.

Invoked without arguments, print the __doc__ string as the usage message.

     239    def main(argv):
     240        "for command line invocation - gimme sys.argv[1:]"
     241        if not len(argv): 
     242            print __doc__   # usage info
     243            return
     244        
     245        for arg in argv:
     246            if arg[0] == '-':   # command line - convert to @
     247                s = pymacro.pymax_process('@SetFlag{'+arg[1:]+'}')
     248                if s:           # report any output
     249                    print s
     250            else:
     251                PyToHTML(arg, arg+".html")
PyToHTML is the function that actually does the 'work'...
     255    def PyToHTML(inputfilename, outputfilename):
     256        """Convert a Python file to an HTML file"""
Open Files
     260        input  = open(inputfilename, 'r')
     261        output = open(outputfilename, 'w')
Make sure Token class is ready, then Parse into Tokens
Parse calls us back and returns a list of our results
     266        Token_ClassInit()
     267        tokens = Parse(input, Token)
Write out the html prefix, with title
Write out all the Tokens as html
Write out the html suffix Write out the cross reference
     274        output.write(html_prefix(inputfilename))
     275        if Token.autosplit:
     276            output.write(pymacro.pymax_process('@startsplit'))
     277        for t in tokens:
     278            output.write(t.asHTML())    
     279        output.write(html_suffix())
     280        if Token.autosplit:
     281            output.write(pymacro.pymax_process('@endsplit'))
     282        if not pymacro.TestFlag('noindex'):
     283            output.write(Token.xref.asHTML())
What did we use there that we haven't defined yet?
    Parse comes from pytokens,
        html_prefix, html_suffix comes from htmlmax (via pymacro)...
    
    class Token and helper function Token_ClassInit

Token class

Token_ClassInit

     322    def Token_ClassInit():
Create/re-initialize Token's class variables
     325        Token.previous = 0      # for tracking our linked list
     326        Token.indentlevel = 0   # for tracking syntax levels
     327        Token.parenlevel = 0
     328        Token.bracketlevel = 0    
     329        Token.bracelevel = 0    
     330        Token.afternewline = 1  # if we came right after a newline
     331        Token.firstonline = 1   # need to know if we are the first real item
     332    
     333        # options:                            
     334        Token.autoformat = pymacro.TestFlag('autoformat')
     335        Token.autosplit = pymacro.TestFlag('autosplit')
     336        Token.xref = XRef()     # symbol cross reference
Token is the class... Token is responsible for:
  • Tracking each code element,
  • determining if it is 'special' and needs to be passed to the pymacro engine
  • syntax coloring of STRING, COMMENT, and RESERVED words,
  • Line numbers, and other HTML formatting
     347    class Token:
     348        """ Token items are spewed out by the pytokens module """

Token.__init__

The constructor does most of the work.

It is abnormally large, and should be broken into sub functions... but it's quicker inline.

We track all of the info from tokenize, as well as maintaining a doubly linked list of the tokens for looking around. We're also tracking indentlevel, parenlevel, and bracketlevel which is important for understanding the structure of the Python code. We're only looking for 'special' strings if they are first on the line, which we don't know unless we know parenlevel etc.

     364        def __init__(self, type, token, srow, scol, erow, ecol):
Pick up our token info
     368            self.type = type
     369            self.token= token
     370            self.line = srow
     371            # scol, erow, ecol, not in use
    
Stitch up our linked list
     375            self.next     = None
     376            self.previous = Token.previous
     377            if Token.previous:
     378                Token.previous.next = self
     379            Token.previous = self
Remember our level info
     382            self.indentlevel  = Token.indentlevel
     383            self.parenlevel   = Token.parenlevel
     384            self.bracketlevel = Token.bracketlevel
     385            self.bracelevel   = Token.bracelevel
Do all of the level tracking maintenance
     388            if type == INDENT:
     389                Token.indentlevel = Token.indentlevel + 1
     390            elif type == DEDENT:
     391                Token.indentlevel = Token.indentlevel - 1
     392            elif type == OP:
     393                if token == '(':
     394                    Token.parenlevel = Token.parenlevel + 1
     395                elif token == ')':
     396                    Token.parenlevel = Token.parenlevel - 1
     397                elif token == '[':
     398                    Token.bracketlevel = Token.bracketlevel + 1
     399                elif token == ']':
     400                    Token.bracketlevel = Token.bracketlevel - 1
     401                elif token == '{':
     402                    Token.bracelevel = Token.bracelevel + 1
     403                elif token == '}':
     404                    Token.bracelevel = Token.bracelevel - 1
Figure out if we're white space, for future reference
     408            self.isWhite = type in (WHITE, INDENT, DEDENT, NEWLINE, NL)
Worried about newline: isNewLine means we're a single literal new line character, but isAfterNewLine means previous item could have also been comment, etc.
     414            self.isAfterNewLine = Token.afternewline
     415            self.isNewLine = type == NL or type == NEWLINE
Set up afternewline for the Next token
     418            Token.afternewline = token and token[-1] == '\n'
Looking for first string on line as possible special comment
     422            self.isFirstString = 0
     423            if type == STRING and Token.firstonline:
     424                if not (self.parenlevel or self.bracketlevel or self.bracelevel):
     425                    self.isFirstString = 1
For automatic formatting without explicit ~, add it in, and preserve existing formatting by adding @N (<br>) directives
     429            if Token.autoformat:
     430                if Token.firstonline:
     431                    if self.isFirstString and len(token) > 2:
     432                        if token[0] == token[1]:
     433                            if token[3] <> '~':
     434                                token = token[0:3] + '~' + token[3:]
     435                        elif token[1] <> '~':
     436                            token = token[0] + '~' + token[1:]
     437                        token = string.replace(token, "\n", "@N\n")
     438                    elif type == COMMENT:
     439                        if token[1] <> '~' and token[1] <> '-':
     440                            token = '#~'+token[1:]
     441                        token = string.replace(token, "\n", "@N\n")
Set up firstline for the Next token
     444            if Token.afternewline or type == DEDENT or type == INDENT:
     445                Token.firstonline = 1
     446            elif type <> WHITE:
     447                Token.firstonline = 0
Special defaults
     451            self.isSpecial = 0
     452            self.preprocessed = 0
Quick hack for division lines - beware of #- with other stuff
     456            if type == COMMENT:
     457                if len(token) > 1 and token[1] == '-':
     458                    self.isSpecial = "<hr>"
     459                    self.preprocessed = 1
     460                    self.token = ""
Look for specially marked boxes of your favorite serial
     463            # this isn't quite as brutal as it looks
     464    
     465            if type == COMMENT and len(token) > 2:
     466                if token[1] == '~': 
     467                    self.isSpecial = token[2:]
     468                elif token[1] == '@':  # early v late execution
     469                    self.isSpecial = pymacro.pymax_process(token[1:])
     470                    self.preprocessed = 1
     471                    self.token = ""
     472            elif self.isFirstString and len(token) > 2:
     473                if token[1] == '~':
     474                    self.isSpecial = token[2:-1]
     475                elif token[0] == token[1] and token[3] == '~':
     476                    self.isSpecial = token[4:-3]                
     477                elif (token == 
     478    '~this is a test'):    # this is a test to make sure we don't grab this
     479                    pass
Worrying about the extra blank lines.

If we look backward and ignore any white space (including INDENT/DEDENT/NEWLINE/NL etc.) and the first thing we find is also 'special', then tell him to forget trailing wrapping, and tell me to forget leading wrapping.

     491            if self.isSpecial:
     492                # drop leading space, just because
     493                if self.isSpecial[0] == ' ':    
     494                    self.isSpecial = self.isSpecial[1:]
     495                
     496                self.leading = 1
     497                self.trailing = 1
     498                previous = self.previous
     499                while previous:
     500                    if previous.isWhite:
     501                        previous.type = WHITE
     502                        previous.token = '' # clear 
     503                        previous = previous.previous
     504                    else:
     505                        if previous.isSpecial:
     506                            previous.trailing = 0
     507                            self.leading = 0
     508                        break
Also look forward to get rid of extra white space/new lines following special. Handled by looking back from newline instead.

     516            if self.isNewLine:  # look back for last special
     517                previous = self.previous 
     518                while previous:
     519                    if previous.isWhite:
     520                        previous = previous.previous
     521                    else:
     522                        if previous.isSpecial:
     523                            self.token = ''
     524                            # walk back forward looking for white
     525                            previous = previous.next
     526                            while not previous is self:
     527                                if previous.type == WHITE:
     528                                    previous.token = ''
     529                                previous = previous.next
     530                        break
Maintain symbol cross reference for the summary Index
     534            if type == NAME:
     535                Token.xref(token, srow)
Trim the extra lines and white space at the end of the program
     539            if type == ENDMARKER:
     540                previous = self.previous 
     541                while previous:
     542                    if previous.isWhite:
     543                        previous.type = WHITE
     544                        previous.token = ''
     545                        previous = previous.previous
     546                    else:
     547                        break

Token.asHTML

Render as HTML
     Some oddities remain with special:
 1) Leading spaces are always present for indented strings
 2) Whether or not to autohtml and/or escapes4html isn't obvious when/if.
    escape is appropriate on original text but not on macro output;
    autohtml is appropriate on original text without expansions...
     561        def asHTML(self):
     562            """ render this Token asHTML """
     563    
     564            s = self.isSpecial # well, isn't that special?
     565            if s:
     566                if not self.preprocessed:
     567                    s = autohtml(pymacro.pymax_process(escapes4html(s)))
     568                if 0:   # doesn't work as well as hoped - dedents come too late
     569                    t = self.indentlevel + 1
     570                    if self.indentlevel:
     571                        s = '<ul>' * t + s + '</ul>' * t
     572                return html_wrap(self.leading, self.trailing, s)
     573                
     574            s = ""
     575            # put out the line numbers
     576            if (self.token or self.type==DEDENT) and self.isAfterNewLine:
     577                s = ('<a name="%d">%4d</a>    ' % (self.line, self.line))
     578    
     579            type = self.type
     580            token = self.token
     581            if type > RESERVED:
     582                type = RESERVED
     583            wrap = gWraps.get(type,None)
     584    
     585            # check for strings with embedded newlines
     586            if type == STRING and string.count(token, '\n') > 1:
     587                t = escapes4html(string.join(
     588                                   string.split(token, '\n'), '\n        '))
     589                if wrap:
     590                    t = wrap % t
     591                s = s + t + '\n'
     592            else:
     593                t = escapes4html(token)
     594                if wrap:
     595                    t = wrap % t
     596                s = s + t
     597            return s

XRef class

The XRef class does simple symbol tracking. It remembers every name/line combination submitted (tossing duplicates) in order to render an HTML-formatted Index listing. The indexes are too gassy by line number. Some close-enough value could be used for congregating lines. Section numbers might be a better plan.
     611    class XRef:
     612        """ XRef maintains a cross reference of symbols by line number """
     613        
     614        def __init__(self):
build the empty symbol table
     616            self.symbols = {}
     617                    
     618        def __call__(self, token, line):
add a symbol reference
     620            symbols = self.symbols
     621            if symbols.has_key(token):
     622                if not symbols[token].count(line): # already on this line
     623                    symbols[token].append(line)
     624            else:
     625                symbols[token] = [line]
     626            
     627        def asHTML(self):
render the table as html
     629            refs = self.symbols
     630            # two columns, each containing a table of two columns
     631            r = '<hr><ul><table><tr><td valign=top><table>'
     632            nk = refs.keys()
     633            nk.sort()
     634            bp = (len(nk) + 2) / 3
     635            i = 0
     636            for n in nk:
     637                if i and i % bp == 0:
     638                    r = r + '</table></td><td valign=top><table>'
     639                i = i + 1
     640                r = '%s<tr><td valign=top class="xref">%s</td><td valign=top class="xref">' % (r, n)
     641                s = ''
     642                for l in refs[n]:
     643                    s = '%s <a href="#%d">%d</a> ' % (s, l, l)
     644                r = r + s + '</td></tr>\n'
     645            r = r + '</table></td></tr></table></ul>'
     646            return r

autohtml

autohtml tries to do some best guess formatting of the html. We'd like to have 'pretty good' formatting without too many directives. Two \n in a row is a paragraph \n followed by white space is a line break with leading &nbsp chars. Anything else useful I can think of?

     658    def autohtml(s):
     659        s = string.join(string.split(s, '\n\n'), '\n<p>\n')
     660        
     661        while 1:
     662            i = string.find(s, '\n ')
     663            if i >= 0:
     664                # beware of pseudo-blank lines and all space lines
     665                c = 1
     666                x = i + c + 1
     667                l = len(s)
     668                while x < l and s[x] == ' ':
     669                    c = c + 1
     670                    x = x + 1
     671                if i > 0 and s[i-1] == '>': # watch for <p><br>
     672                    s = s[0:i+1] + c * "&nbsp;" + s[i+c+1:]
     673                else:
     674                    s = s[0:i] + "<br>\n" +  c * "&nbsp;" + s[i+c+1:]
     675            else:
     676                break
     677             
     678        #s = string.join(string.split(s, '\n '), '\n<br>&nbsp;')
     679        return s

escapes4html

     683    def escapes4html(s):
     684        return string.replace(string.replace(s,'&','&amp;'),'<','&lt;')

Kick Off

Pass command line arguments to main
     733    if __name__ == '__main__':  
     734        main(sys.argv[1:])

Index