1 #! /usr/bin/env python lpy.py - convert python source to formatted html
usage:
writes:
options:
24 __Id__ = "$Id: lpy.py.html,v 1.1 2005/01/26 01:13:52 u37519820 Exp $" Table of ContentsIntroductionlpy.py is a tool for code publishing inspired by Knuth's Literate Programming concepts. It's purpose is to make code presentable for reading. Practically speaking, it's yet-another-Python-to-HTML too, with additions. Other than regular Python colorizing, we're looking for some special stuff in comments and docstrings. A comment or string starting with ~ is pulled out and passed through a macro engine. The macro engine looks for the @ sign as a macro marker, using several syntaxes:
@symbol - can be a variable replacement or function call (no arguments)
The Python functions invoked can return a result string or print (to sys.stdout). The macro set has some simplified html macros, like @i{to italicize} to italicize, for example. HTML is simple widely known, but it is verbose, and put lots of noise in the input stream. The macros are intended to make the source tidier. See htmlmax for information about the macro set. We're also looking for comments or strings starting with which maps to early instead of late execution. The ~ stuff is processed on the way out (late), and the @ on the way in (early), so the @ directives occuring later in the source can affect the behavior of the ~ directives occurring earlier in the source. The contents macros use this mechanism. One other little thing - any comment starting with #- is replaced with <hr>. Traditional LP involves writing a mixture of formatting and code, and then extracting the program and document from the source using separate programs ('tangle' and 'weave'). The term inverted literate programming describes producing the document from the actual program source code instead of from a separate source file. That's what this tool does. The concept of having to preprocess a Python script file in order to execute does seem perverse. There are other useful approaches to LP - eventually there could be a wysiwyg / hyperlinked / outlining / structured / browser coding-and-documenting environment that really understands the program, in order to help the programmer or reader make sense out of it. This tool is a lot less ambitious, it's simply to help me write better code. Writing readable programs is more fun (and more difficult and time-consuming).
Order of presentation is important, and Python is pretty flexible, because stuff doesn't have to be defined until executed.
Imports should come first - using from module import *,
our symbols can get stomped on.
210 import pymacro 215 pymacro.load('htmlmax') 224 from pymacro import html_prefix, html_suffix, html_wrap, gWraps mainNow we can define our main program. It processes command line arguments, -macros and filenames, feeding macros to pymacro and passing filenames to PyToHTML.Invoked without arguments, print the __doc__ string as the usage message. 239 def main(argv): 240 "for command line invocation - gimme sys.argv[1:]" 241 if not len(argv): 242 print __doc__ # usage info 243 return 244 245 for arg in argv: 246 if arg[0] == '-': # command line - convert to @ 247 s = pymacro.pymax_process('@SetFlag{'+arg[1:]+'}') 248 if s: # report any output 249 print s 250 else: 251 PyToHTML(arg, arg+".html") Parse calls us back and returns a list of our results Write out all the Tokens as html Write out the html suffix Write out the cross reference 274 output.write(html_prefix(inputfilename)) 275 if Token.autosplit: 276 output.write(pymacro.pymax_process('@startsplit')) 277 for t in tokens: 278 output.write(t.asHTML()) 279 output.write(html_suffix()) 280 if Token.autosplit: 281 output.write(pymacro.pymax_process('@endsplit')) 282 if not pymacro.TestFlag('noindex'): 283 output.write(Token.xref.asHTML()) Parse comes from pytokens, html_prefix, html_suffix comes from htmlmax (via pymacro)... class Token and helper function Token_ClassInit 322 def Token_ClassInit(): 325 Token.previous = 0 # for tracking our linked list 326 Token.indentlevel = 0 # for tracking syntax levels 327 Token.parenlevel = 0 328 Token.bracketlevel = 0 329 Token.bracelevel = 0 330 Token.afternewline = 1 # if we came right after a newline 331 Token.firstonline = 1 # need to know if we are the first real item 332 333 # options: 334 Token.autoformat = pymacro.TestFlag('autoformat') 335 Token.autosplit = pymacro.TestFlag('autosplit') 336 Token.xref = XRef() # symbol cross reference
Token.__init__ |
The constructor does most of the work.
It is abnormally large, and should be broken into sub functions... but it's quicker inline. We track all of the info from tokenize, as well as maintaining a doubly linked list of the tokens for looking around. We're also tracking indentlevel, parenlevel, and bracketlevel which is important for understanding the structure of the Python code. We're only looking for 'special' strings if they are first on the line, which we don't know unless we know parenlevel etc. | 364 def __init__(self, type, token, srow, scol, erow, ecol): |
Pick up our token info | |
Stitch up our linked list | |
Remember our level info | |
Do all of the level tracking maintenance | 388 if type == INDENT: 389 Token.indentlevel = Token.indentlevel + 1 390 elif type == DEDENT: 391 Token.indentlevel = Token.indentlevel - 1 392 elif type == OP: 393 if token == '(': 394 Token.parenlevel = Token.parenlevel + 1 395 elif token == ')': 396 Token.parenlevel = Token.parenlevel - 1 397 elif token == '[': 398 Token.bracketlevel = Token.bracketlevel + 1 399 elif token == ']': 400 Token.bracketlevel = Token.bracketlevel - 1 401 elif token == '{': 402 Token.bracelevel = Token.bracelevel + 1 403 elif token == '}': 404 Token.bracelevel = Token.bracelevel - 1 |
Figure out if we're white space, for future reference | 408 self.isWhite = type in (WHITE, INDENT, DEDENT, NEWLINE, NL) |
Worried about newline: isNewLine means we're a single literal new line character, but isAfterNewLine means previous item could have also been comment, etc. | |
Set up afternewline for the Next token | 418 Token.afternewline = token and token[-1] == '\n' |
Looking for first string on line as possible special comment | |
For automatic formatting without explicit ~, add it in, and preserve existing formatting by adding @N (<br>) directives | 429 if Token.autoformat: 430 if Token.firstonline: 431 if self.isFirstString and len(token) > 2: 432 if token[0] == token[1]: 433 if token[3] <> '~': 434 token = token[0:3] + '~' + token[3:] 435 elif token[1] <> '~': 436 token = token[0] + '~' + token[1:] 437 token = string.replace(token, "\n", "@N\n") 438 elif type == COMMENT: 439 if token[1] <> '~' and token[1] <> '-': 440 token = '#~'+token[1:] 441 token = string.replace(token, "\n", "@N\n") |
Set up firstline for the Next token | |
Special defaults | |
Quick hack for division lines - beware of #- with other stuff | |
Look for specially marked boxes of your favorite serial | 463 # this isn't quite as brutal as it looks 464 465 if type == COMMENT and len(token) > 2: 466 if token[1] == '~': 467 self.isSpecial = token[2:] 468 elif token[1] == '@': # early v late execution 469 self.isSpecial = pymacro.pymax_process(token[1:]) 470 self.preprocessed = 1 471 self.token = "" 472 elif self.isFirstString and len(token) > 2: 473 if token[1] == '~': 474 self.isSpecial = token[2:-1] 475 elif token[0] == token[1] and token[3] == '~': 476 self.isSpecial = token[4:-3] 477 elif (token == 478 '~this is a test'): # this is a test to make sure we don't grab this 479 pass |
Worrying about the extra blank lines.
If we look backward and ignore any white space (including INDENT/DEDENT/NEWLINE/NL etc.) and the first thing we find is also 'special', then tell him to forget trailing wrapping, and tell me to forget leading wrapping. | 491 if self.isSpecial: 492 # drop leading space, just because 493 if self.isSpecial[0] == ' ': 494 self.isSpecial = self.isSpecial[1:] 495 496 self.leading = 1 497 self.trailing = 1 498 previous = self.previous 499 while previous: 500 if previous.isWhite: 501 previous.type = WHITE 502 previous.token = '' # clear 503 previous = previous.previous 504 else: 505 if previous.isSpecial: 506 previous.trailing = 0 507 self.leading = 0 508 break |
Also look forward to get rid of extra white space/new lines following special.
Handled by looking back from newline instead.
| 516 if self.isNewLine: # look back for last special 517 previous = self.previous 518 while previous: 519 if previous.isWhite: 520 previous = previous.previous 521 else: 522 if previous.isSpecial: 523 self.token = '' 524 # walk back forward looking for white 525 previous = previous.next 526 while not previous is self: 527 if previous.type == WHITE: 528 previous.token = '' 529 previous = previous.next 530 break |
Maintain symbol cross reference for the summary Index | |
Trim the extra lines and white space at the end of the program |
Token.asHTMLRender as HTMLSome oddities remain with special: 1) Leading spaces are always present for indented strings 2) Whether or not to autohtml and/or escapes4html isn't obvious when/if. escape is appropriate on original text but not on macro output; autohtml is appropriate on original text without expansions... 561 def asHTML(self): 562 """ render this Token asHTML """ 563 564 s = self.isSpecial # well, isn't that special? 565 if s: 566 if not self.preprocessed: 567 s = autohtml(pymacro.pymax_process(escapes4html(s))) 568 if 0: # doesn't work as well as hoped - dedents come too late 569 t = self.indentlevel + 1 570 if self.indentlevel: 571 s = '<ul>' * t + s + '</ul>' * t 572 return html_wrap(self.leading, self.trailing, s) 573 574 s = "" 575 # put out the line numbers 576 if (self.token or self.type==DEDENT) and self.isAfterNewLine: 577 s = ('<a name="%d">%4d</a> ' % (self.line, self.line)) 578 579 type = self.type 580 token = self.token 581 if type > RESERVED: 582 type = RESERVED 583 wrap = gWraps.get(type,None) 584 585 # check for strings with embedded newlines 586 if type == STRING and string.count(token, '\n') > 1: 587 t = escapes4html(string.join( 588 string.split(token, '\n'), '\n ')) 589 if wrap: 590 t = wrap % t 591 s = s + t + '\n' 592 else: 593 t = escapes4html(token) 594 if wrap: 595 t = wrap % t 596 s = s + t 597 return s XRef classThe XRef class does simple symbol tracking. It remembers every name/line combination submitted (tossing duplicates) in order to render an HTML-formatted Index listing. The indexes are too gassy by line number. Some close-enough value could be used for congregating lines. Section numbers might be a better plan.629 refs = self.symbols 630 # two columns, each containing a table of two columns 631 r = '<hr><ul><table><tr><td valign=top><table>' 632 nk = refs.keys() 633 nk.sort() 634 bp = (len(nk) + 2) / 3 635 i = 0 636 for n in nk: 637 if i and i % bp == 0: 638 r = r + '</table></td><td valign=top><table>' 639 i = i + 1 640 r = '%s<tr><td valign=top class="xref">%s</td><td valign=top class="xref">' % (r, n) 641 s = '' 642 for l in refs[n]: 643 s = '%s <a href="#%d">%d</a> ' % (s, l, l) 644 r = r + s + '</td></tr>\n' 645 r = r + '</table></td></tr></table></ul>' 646 return r autohtmlautohtml tries to do some best guess formatting of the html. We'd like to have 'pretty good' formatting without too many directives. Two \n in a row is a paragraph \n followed by white space is a line break with leading   chars. Anything else useful I can think of?
658 def autohtml(s): 659 s = string.join(string.split(s, '\n\n'), '\n<p>\n') 660 661 while 1: 662 i = string.find(s, '\n ') 663 if i >= 0: 664 # beware of pseudo-blank lines and all space lines 665 c = 1 666 x = i + c + 1 667 l = len(s) 668 while x < l and s[x] == ' ': 669 c = c + 1 670 x = x + 1 671 if i > 0 and s[i-1] == '>': # watch for <p><br> 672 s = s[0:i+1] + c * " " + s[i+c+1:] 673 else: 674 s = s[0:i] + "<br>\n" + c * " " + s[i+c+1:] 675 else: 676 break 677 678 #s = string.join(string.split(s, '\n '), '\n<br> ') 679 return s escapes4htmlKick OffPass command line arguments to mainIndex |