lpy.py

   1    #! /usr/bin/env python

lpy.py - convert python source to formatted html

usage:
lpy [-options(s)] filename(s)

writes:
filename.html (beware!)

options:
    -autoformat = for processing python source with no formatting cues
    -autosplit = inserts split directives automatically
    -nosplits = disables split directives
    -noindex = disable automatic index
    -printing = uses smaller fonts

download lpy.zip

for more information,
contact:
Danbala Software

  24    __Id__ = "$Id: lpy.py.html,v 1.1 2005/01/26 01:13:52 u37519820 Exp $"

Introduction
Imports
main
Token class
XRef class
autohtml
escapes4html
Kick Off
Index

Introduction

lpy.py is a tool for code publishing inspired by Knuth's Literate Programming concepts. It's purpose is to make code presentable for reading.

Practically speaking, it's yet-another-Python-to-HTML too, with additions.

Other than regular Python colorizing, we're looking for some special stuff in comments and docstrings. A comment or string starting with ~ is pulled out and passed through a macro engine. The macro engine looks for the @ sign as a macro marker, using several syntaxes:

    @symbol - can be a variable replacement or function call (no arguments)
    @symbol{string} - call a function (or % a string) with one literal string
    @symbol(args) - call a function with args
    @{any Python code} - execute any Python
    @(any Python expression) - evaluate any Python and display result

The Python functions invoked can return a result string or print (to sys.stdout).
See pymacro for more information about the macro engine.

The macro set has some simplified html macros, like @i{to italicize} to italicize, for example. HTML is simple widely known, but it is verbose, and put lots of noise in the input stream. The macros are intended to make the source tidier. See htmlmax for information about the macro set.

We're also looking for comments or strings starting with which maps to early instead of late execution. The ~ stuff is processed on the way out (late), and the @ on the way in (early), so the @ directives occuring later in the source can affect the behavior of the ~ directives occurring earlier in the source. The contents macros use this mechanism.

One other little thing - any comment starting with #- is replaced with <hr>.

Traditional LP involves writing a mixture of formatting and code, and then extracting the program and document from the source using separate programs ('tangle' and 'weave').

The term inverted literate programming describes producing the document from the actual program source code instead of from a separate source file. That's what this tool does. The concept of having to preprocess a Python script file in order to execute does seem perverse.

There are other useful approaches to LP - eventually there could be a wysiwyg / hyperlinked / outlining / structured / browser coding-and-documenting environment that really understands the program, in order to help the programmer or reader make sense out of it.

This tool is a lot less ambitious, it's simply to help me write better code. Writing readable programs is more fun (and more difficult and time-consuming).

Order of presentation is important, and Python is pretty flexible, because stuff doesn't have to be defined until executed.

Imports should come first - using from module import *, our symbols can get stomped on.

Imports

The traditionally useful Python standard library items, sys and string stuff:

 199    import sys      # we're using sys.argv
 200    import string   # and some string manipulations

pytokens is based on tokenize from the standard library, but it adds a few useful items...

 205    from pytokens import *
 206    #Parse, WHITE, NEWLINE, NL, INDENT, DEDENT, OP, COMMENT, STRING, RESERVED

pymacro is the module responsible for our 'macro language facility'

 210    import pymacro

htmlmax is the module that contains the actual html macros we're using, so let's get pymacro to load it for us...

 215    pymacro.load('htmlmax')

htmlmax is used to separate the actual html formatting specs from this module, to make it easier to customize. Sneaky -> the stuff got loaded into pymacro by the line above. It could have been loaded separately to keep the pymacro name space clean, but it's handier in there, anyway.

 224    from pymacro import html_prefix, html_suffix, html_wrap, gWraps

main

Now we can define our main program. It processes command line arguments, -macros and filenames, feeding macros to pymacro and passing filenames to PyToHTML.

Invoked without arguments, print the __doc__ string as the usage message.

 239    def main(argv):
 240        "for command line invocation - gimme sys.argv[1:]"
 241        if not len(argv): 
 242            print __doc__   # usage info
 243            return
 244        
 245        for arg in argv:
 246            if arg[0] == '-':   # command line - convert to @
 247                s = pymacro.pymax_process('@SetFlag{'+arg[1:]+'}')
 248                if s:           # report any output
 249                    print s
 250            else:
 251                PyToHTML(arg, arg+".html")

PyToHTML is the function that actually does the 'work'...

 255    def PyToHTML(inputfilename, outputfilename):
 256        """Convert a Python file to an HTML file"""

Open Files

 260        input  = open(inputfilename, 'r')
 261        output = open(outputfilename, 'w')

Make sure Token class is ready, then Parse into Tokens
Parse calls us back and returns a list of our results

 266        Token_ClassInit()
 267        tokens = Parse(input, Token)

Write out the html prefix, with title
Write out all the Tokens as html
Write out the html suffix Write out the cross reference

 274        output.write(html_prefix(inputfilename))
 275        if Token.autosplit:
 276            output.write(pymacro.pymax_process('@startsplit'))
 277        for t in tokens:
 278            output.write(t.asHTML())    
 279        output.write(html_suffix())
 280        if Token.autosplit:
 281            output.write(pymacro.pymax_process('@endsplit'))
 282        if not pymacro.TestFlag('noindex'):
 283            output.write(Token.xref.asHTML())

What did we use there that we haven't defined yet?
    Parse comes from pytokens,
        html_prefix, html_suffix comes from htmlmax (via pymacro)...

    class Token and helper function Token_ClassInit

Token class

Token_ClassInit

 322    def Token_ClassInit():

Create/re-initialize Token's class variables

 325        Token.previous = 0      # for tracking our linked list
 326        Token.indentlevel = 0   # for tracking syntax levels
 327        Token.parenlevel = 0
 328        Token.bracketlevel = 0    
 329        Token.bracelevel = 0    
 330        Token.afternewline = 1  # if we came right after a newline
 331        Token.firstonline = 1   # need to know if we are the first real item
 332    
 333        # options:                            
 334        Token.autoformat = pymacro.TestFlag('autoformat')
 335        Token.autosplit = pymacro.TestFlag('autosplit')
 336        Token.xref = XRef()     # symbol cross reference

Token is the class... Token is responsible for:

Tracking each code element,
determining if it is 'special' and needs to be passed to the pymacro engine
syntax coloring of STRING, COMMENT, and RESERVED words,
Line numbers, and other HTML formatting

 347    class Token:
 348        """ Token items are spewed out by the pytokens module """

Token.init

The constructor does most of the work. It is abnormally large, and should be broken into sub functions... but it's quicker inline. We track all of the info from tokenize, as well as maintaining a doubly linked list of the tokens for looking around. We're also tracking indentlevel, parenlevel, and bracketlevel which is important for understanding the structure of the Python code. We're only looking for 'special' strings if they are first on the line, which we don't know unless we know parenlevel etc.	364 def __init__(self, type, token, srow, scol, erow, ecol):
Pick up our token info	368 self.type = type 369 self.token= token 370 self.line = srow 371 # scol, erow, ecol, not in use
Stitch up our linked list	375 self.next = None 376 self.previous = Token.previous 377 if Token.previous: 378 Token.previous.next = self 379 Token.previous = self
Remember our level info	382 self.indentlevel = Token.indentlevel 383 self.parenlevel = Token.parenlevel 384 self.bracketlevel = Token.bracketlevel 385 self.bracelevel = Token.bracelevel
Do all of the level tracking maintenance	388 if type == INDENT: 389 Token.indentlevel = Token.indentlevel + 1 390 elif type == DEDENT: 391 Token.indentlevel = Token.indentlevel - 1 392 elif type == OP: 393 if token == '(': 394 Token.parenlevel = Token.parenlevel + 1 395 elif token == ')': 396 Token.parenlevel = Token.parenlevel - 1 397 elif token == '[': 398 Token.bracketlevel = Token.bracketlevel + 1 399 elif token == ']': 400 Token.bracketlevel = Token.bracketlevel - 1 401 elif token == '{': 402 Token.bracelevel = Token.bracelevel + 1 403 elif token == '}': 404 Token.bracelevel = Token.bracelevel - 1
Figure out if we're white space, for future reference	408 self.isWhite = type in (WHITE, INDENT, DEDENT, NEWLINE, NL)
Worried about newline: isNewLine means we're a single literal new line character, but isAfterNewLine means previous item could have also been comment, etc.	414 self.isAfterNewLine = Token.afternewline 415 self.isNewLine = type == NL or type == NEWLINE
Set up afternewline for the Next token	418 Token.afternewline = token and token[-1] == '\n'
Looking for first string on line as possible special comment	422 self.isFirstString = 0 423 if type == STRING and Token.firstonline: 424 if not (self.parenlevel or self.bracketlevel or self.bracelevel): 425 self.isFirstString = 1
For automatic formatting without explicit ~, add it in, and preserve existing formatting by adding @N (<br>) directives	429 if Token.autoformat: 430 if Token.firstonline: 431 if self.isFirstString and len(token) > 2: 432 if token[0] == token[1]: 433 if token[3] <> '~': 434 token = token[0:3] + '~' + token[3:] 435 elif token[1] <> '~': 436 token = token[0] + '~' + token[1:] 437 token = string.replace(token, "\n", "@N\n") 438 elif type == COMMENT: 439 if token[1] <> '~' and token[1] <> '-': 440 token = '#~'+token[1:] 441 token = string.replace(token, "\n", "@N\n")
Set up firstline for the Next token	444 if Token.afternewline or type == DEDENT or type == INDENT: 445 Token.firstonline = 1 446 elif type <> WHITE: 447 Token.firstonline = 0
Special defaults	451 self.isSpecial = 0 452 self.preprocessed = 0
Quick hack for division lines - beware of #- with other stuff	456 if type == COMMENT: 457 if len(token) > 1 and token[1] == '-': 458 self.isSpecial = "<hr>" 459 self.preprocessed = 1 460 self.token = ""
Look for specially marked boxes of your favorite serial	463 # this isn't quite as brutal as it looks 464 465 if type == COMMENT and len(token) > 2: 466 if token[1] == '~': 467 self.isSpecial = token[2:] 468 elif token[1] == '@': # early v late execution 469 self.isSpecial = pymacro.pymax_process(token[1:]) 470 self.preprocessed = 1 471 self.token = "" 472 elif self.isFirstString and len(token) > 2: 473 if token[1] == '~': 474 self.isSpecial = token[2:-1] 475 elif token[0] == token[1] and token[3] == '~': 476 self.isSpecial = token[4:-3] 477 elif (token == 478 '~this is a test'): # this is a test to make sure we don't grab this 479 pass
Worrying about the extra blank lines. If we look backward and ignore any white space (including INDENT/DEDENT/NEWLINE/NL etc.) and the first thing we find is also 'special', then tell him to forget trailing wrapping, and tell me to forget leading wrapping.	491 if self.isSpecial: 492 # drop leading space, just because 493 if self.isSpecial[0] == ' ': 494 self.isSpecial = self.isSpecial[1:] 495 496 self.leading = 1 497 self.trailing = 1 498 previous = self.previous 499 while previous: 500 if previous.isWhite: 501 previous.type = WHITE 502 previous.token = '' # clear 503 previous = previous.previous 504 else: 505 if previous.isSpecial: 506 previous.trailing = 0 507 self.leading = 0 508 break
Also look forward to get rid of extra white space/new lines following special. Handled by looking back from newline instead.	516 if self.isNewLine: # look back for last special 517 previous = self.previous 518 while previous: 519 if previous.isWhite: 520 previous = previous.previous 521 else: 522 if previous.isSpecial: 523 self.token = '' 524 # walk back forward looking for white 525 previous = previous.next 526 while not previous is self: 527 if previous.type == WHITE: 528 previous.token = '' 529 previous = previous.next 530 break
Maintain symbol cross reference for the summary Index	534 if type == NAME: 535 Token.xref(token, srow)
Trim the extra lines and white space at the end of the program	539 if type == ENDMARKER: 540 previous = self.previous 541 while previous: 542 if previous.isWhite: 543 previous.type = WHITE 544 previous.token = '' 545 previous = previous.previous 546 else: 547 break

Token.asHTML

Render as HTML
     Some oddities remain with special:
1) Leading spaces are always present for indented strings
2) Whether or not to autohtml and/or escapes4html isn't obvious when/if.
    escape is appropriate on original text but not on macro output;
    autohtml is appropriate on original text without expansions...

 561        def asHTML(self):
 562            """ render this Token asHTML """
 563    
 564            s = self.isSpecial # well, isn't that special?
 565            if s:
 566                if not self.preprocessed:
 567                    s = autohtml(pymacro.pymax_process(escapes4html(s)))
 568                if 0:   # doesn't work as well as hoped - dedents come too late
 569                    t = self.indentlevel + 1
 570                    if self.indentlevel:
 571                        s = '<ul>' * t + s + '</ul>' * t
 572                return html_wrap(self.leading, self.trailing, s)
 573                
 574            s = ""
 575            # put out the line numbers
 576            if (self.token or self.type==DEDENT) and self.isAfterNewLine:
 577                s = ('<a name="%d">%4d</a>    ' % (self.line, self.line))
 578    
 579            type = self.type
 580            token = self.token
 581            if type > RESERVED:
 582                type = RESERVED
 583            wrap = gWraps.get(type,None)
 584    
 585            # check for strings with embedded newlines
 586            if type == STRING and string.count(token, '\n') > 1:
 587                t = escapes4html(string.join(
 588                                   string.split(token, '\n'), '\n        '))
 589                if wrap:
 590                    t = wrap % t
 591                s = s + t + '\n'
 592            else:
 593                t = escapes4html(token)
 594                if wrap:
 595                    t = wrap % t
 596                s = s + t
 597            return s

XRef class

The XRef class does simple symbol tracking. It remembers every name/line combination submitted (tossing duplicates) in order to render an HTML-formatted Index listing. The indexes are too gassy by line number. Some close-enough value could be used for congregating lines. Section numbers might be a better plan.

 611    class XRef:
 612        """ XRef maintains a cross reference of symbols by line number """
 613        
 614        def __init__(self):

build the empty symbol table

 616            self.symbols = {}
 617                    
 618        def __call__(self, token, line):

add a symbol reference

 620            symbols = self.symbols
 621            if symbols.has_key(token):
 622                if not symbols[token].count(line): # already on this line
 623                    symbols[token].append(line)
 624            else:
 625                symbols[token] = [line]
 626            
 627        def asHTML(self):

render the table as html

 629            refs = self.symbols
 630            # two columns, each containing a table of two columns
 631            r = '<hr><ul><table><tr><td valign=top><table>'
 632            nk = refs.keys()
 633            nk.sort()
 634            bp = (len(nk) + 2) / 3
 635            i = 0
 636            for n in nk:
 637                if i and i % bp == 0:
 638                    r = r + '</table></td><td valign=top><table>'
 639                i = i + 1
 640                r = '%s<tr><td valign=top class="xref">%s</td><td valign=top class="xref">' % (r, n)
 641                s = ''
 642                for l in refs[n]:
 643                    s = '%s <a href="#%d">%d</a> ' % (s, l, l)
 644                r = r + s + '</td></tr>\n'
 645            r = r + '</table></td></tr></table></ul>'
 646            return r

autohtml

autohtml tries to do some best guess formatting of the html. We'd like to have 'pretty good' formatting without too many directives. Two \n in a row is a paragraph \n followed by white space is a line break with leading &nbsp chars. Anything else useful I can think of?

 658    def autohtml(s):
 659        s = string.join(string.split(s, '\n\n'), '\n<p>\n')
 660        
 661        while 1:
 662            i = string.find(s, '\n ')
 663            if i >= 0:
 664                # beware of pseudo-blank lines and all space lines
 665                c = 1
 666                x = i + c + 1
 667                l = len(s)
 668                while x < l and s[x] == ' ':
 669                    c = c + 1
 670                    x = x + 1
 671                if i > 0 and s[i-1] == '>': # watch for <p><br>
 672                    s = s[0:i+1] + c * "&nbsp;" + s[i+c+1:]
 673                else:
 674                    s = s[0:i] + "<br>\n" +  c * "&nbsp;" + s[i+c+1:]
 675            else:
 676                break
 677             
 678        #s = string.join(string.split(s, '\n '), '\n<br>&nbsp;')
 679        return s

escapes4html

 683    def escapes4html(s):
 684        return string.replace(string.replace(s,'&','&amp;'),'<','&lt;')

Kick Off

Pass command line arguments to main

 733    if __name__ == '__main__':  
 734        main(sys.argv[1:])

Index

COMMENT	438 456 465
DEDENT	390 408 444 576
ENDMARKER	539
INDENT	388 408 444
NAME	534
NEWLINE	408 415
NL	408 415
None	375 583
OP	392
Parse	267
PyToHTML	251 255
RESERVED	581 582
STRING	423 586
TestFlag	282 334 335
Token	267 275 280 283 325 326 327 328 329 330 331 334 335 336 347 376 377 378 379 382 383 384 385 389 391 394 396 398 400 402 404 414 418 423 429 430 444 445 447 535
Token_ClassInit	266 322
WHITE	408 446 501 527 543
XRef	336 611
__Id__	24
__call__	618
__doc__	242
__init__	364 614
__name__	733
afternewline	330 414 418 444
append	623
arg	245 246 247 251
argv	239 241 245 734
asHTML	278 283 561 627
autoformat	334 429
autohtml	567 658
autosplit	275 280 335
bp	634 637

bracelevel	329 385 402 404 424
bracketlevel	328 384 398 400 424
c	665 666 669 672 674
count	586 622
ecol	364
erow	364
escapes4html	567 587 593 683
find	662
firstonline	331 423 430 445 447
gWraps	224 583
get	583
has_key	621
html_prefix	224 274
html_suffix	224 279
html_wrap	224 572
i	635 637 639 662 663 666 671 672 674
indentlevel	326 382 389 391 569 570
input	260 267
inputfilename	255 260 274
isAfterNewLine	414 576
isFirstString	422 425 431 472
isNewLine	415 516
isSpecial	451 458 467 469 474 476 491 493 494 505 522 564
isWhite	408 500 519 542
join	587 659
keys	632
l	642 643 667 668
leading	496 507 572
len	241 431 457 465 472 634 667
line	370 577 618 622 623 625
load	215
main	239 734

n	636 640 642
next	375 378 525 529
nk	632 633 634 636
open	260 261
output	261 274 276 278 279 281 283
outputfilename	255 261
parenlevel	327 383 394 396 424
preprocessed	452 459 470 566
previous	325 376 377 378 379 498 499 500 501 502 503 505 506 517 518 519 520 522 525 526 527 528 529 540 541 542 543 544 545
pymacro	210 215 224 247 276 281 282 334 335 469 567
pymax_process	247 276 281 469 567
pytokens	205
r	631 638 640 644 645 646
refs	629 632 642
replace	437 441 684
s	247 248 249 564 565 567 571 572 574 577 591 596 597 641 643 644 658 659 662 667 668 671 672 674 679 683 684
scol	364
sort	633
split	588 659
srow	364 370 535
string	200 437 441 586 587 588 659 662 684
symbols	616 620 621 622 623 625 629
sys	199 734
t	277 278 569 571 587 590 591 593 595 596
token	364 369 393 395 397 399 401 403 418 431 432 433 434 435 436 437 439 440 441 457 460 465 466 467 468 469 471 472 473 474 475 476 477 502 523 528 535 544 576 580 586 588 593 618 621 622 623 625
tokens	267 277
trailing	497 506 572
type	364 368 388 390 392 408 415 423 438 444 446 456 465 501 527 534 539 543 576 579 581 582 583 586
wrap	583 589 590 594 595
write	274 276 278 279 281 283
x	666 668 670
xref	283 336 535

Table of Contents