Forth to HTML

Sunday, 2013-04-28

I keep thinking about writing a literate programming tool for Forth. It seems like a good fit. Peter Knaggs suggests that the major benefits of literate programming (in order of importance) are¹

Flexible order of elaboration
Automatic support for browsing
Typeset documentation

Modern languages often allow you to feed your programs to the compiler in whatever order you choose, and have good support for factoring out any piece you want. Similarly, modern IDEs generally provide good support for browsing -- jumping to the definition of a function, popping up the documentation for a function while you are writing code so you can see what the arguments are supposed to be, even some automatic refactoring operations. And they almost always provide syntax highlighting to display code in a nicely readable format. So in many cases the only real benefit of literate programming these days is to provide typesetting for mathematics and figures in the comments of your program.

But Forth lacks many of these advantages. It requires you to define words before using them, dictating the order of elaboration. And Forth enthusiasts mostly seem to use development environments which are abysmally lacking in features compared to those for other languages.

So here is a little program which marks up Forth source code with HTML syntax highlighting, with a stylesheet which displays it in pseudo-colorforth style. It is somewhat gforth specific, as I haved used evaluate-with and include-file-with (aliases of gforth's names execute-parsing and execute-parsing-file, which I don't like). I have copied the marked-up code into this HTML document and added cross-references by hand. Eventually I'd like to write a tool which does it automatically.

We use what is basically a Forth interpreter loop, only instead of interpreting the code, we simply mark it up and output it. We wrap up this loop in two words allowing us to easily mark up a file or a (single-line) string.

: markup-line ( "..." -- )
    begin parse-word dup while
        2dup find-markup execute
    repeat 2drop cr ;
: markup-input ( "..." -- )  begin markup-line refill 0= until ;

: markup-string ( c-addr u -- )  ['] markup-input evaluate-with ;
: markup ( "filename" -- )
    parse-word r/o open-file throw
    ['] markup-input include-file-with ;

We have redefined parse-word to echo the whitespace before the word, and avoid eating any whitespace after the word.

: parse-word ( "word" -- c-addr u )
    source >in @ /string dup >r
    2dup skip-ws 0 split type-html
    2dup scan-ws 0 split
    2swap nip r> swap - >in +! ;

type-html escapes ‘&’, ‘<’, and ‘>’. We start with a word which takes an xt which decides whether a character should be replaced with an encoded version. Then we use this to implement an HTML-encoding version and one which also encodes double quotes (for use in HTML attributes).

: type-encoded ( c-addr u xt -- )
    >r 2dup begin dup while over c@
        r@ execute if
            2>r  dup >r 2swap r> - type  2r> type
            1 /string 2dup
        else 1 /string then
    repeat 2drop type  r> drop ;

: entity? ( char -- false | c-addr u true )
    dup [char] & = if drop s" &amp;" true else
    dup [char] < = if drop s" &lt;" true else
    dup [char] > = if drop s" &gt;" true else
    dup 9 = if drop s"     " true else  \ tabs are 4 chars
    drop false then then then then ;
: type-html ( c-addr u -- )  ['] entity? type-encoded ;

: attr-encoded? ( char -- false | c-addr u true )
    dup [char] " = if drop s" %22" true else entity? then ;
: type-attr ( c-addr u -- )  ['] attr-encoded? type-encoded ;

Then we have a few words which output HTML tags.

: quoted ( c-addr u -- )  [char] " dup >r emit type-html r> emit ;
: attr ( c-addr u -- )
    [char] " dup >r emit type-attr r> emit ;
: span{ ( c-addr u -- )  ." <span class=" quoted ." >" ;
: named-span{ ( s.class s.name -- )
    ." <span id=" attr space ." class=" quoted ." >" ;
: }span ( -- )  ." </span>" ;
: .tagged ( s.text s.type -- )  span{ type-html }span ;

On top of the basic tag-generating words, we build others for each class of text—definitions, immediate words, literal values, and comments.

: .definition ( c-addr u -- )
    s" d" 2over named-span{ type-html }span ;
: .immediate ( c-addr u -- )  s" i" .tagged ;
: .literal ( c-addr u -- )  s" l" .tagged ;
: comment-span{ ( -- )  s" t" span{ ;

We want to define our markup words in a separate wordlist to avoid overwriting things. But different Forth systems handle naming differently—some put the new word in the wordlist in effect when you begin the definition, some (including gforth) put it in the wordlist which is current when you end the definition. So we want to switch the definitions wordlist for the whole definition and then restore it. These helper definitions make that easier...

variable restore-wid  false restore-wid !
: in-wordlist ( wid1 -- wid2 )
    get-current  true restore-wid !  swap set-current ;
: ?restore-current ( | wid -- )
    restore-wid dup @  false rot !  if set-current then ;
: ; ( C: colon-sys | wid colon-sys -- )
    postpone ;  ?restore-current ; immediate

Then our markup definition and lookup words are simple. Note that the lookup word returns type-html if no special behavior is found. Also, since we are not adding the markup wordlist to the search order, we define a word to bind to another markup definition.

wordlist constant markup-words

: markup: ( "name" -- wid colon-sys )
    markup-words in-wordlist  :  ;
: find-markup ( c-addr u -- xt )  ( xt: ... c-addr u -- ... )
    markup-words search-wordlist 0= if ['] type-html then ;
: [markup-for] ( "name" -- )
    parse-word find-markup compile, ; immediate

Now we can define all the markup. Control flow words are immediate.

markup: ahead .immediate ;
markup: if .immediate ;
markup: else .immediate ;
markup: then .immediate ;

markup: begin .immediate ;
markup: again .immediate ;
markup: until .immediate ;

markup: repeat .immediate ;
markup: while .immediate ;

markup: do .immediate ;
markup: loop .immediate ;
markup: +loop .immediate ;

There are a few words like postpone which parse the next word and mark it as a literal.

markup: postpone .immediate  parse-word .literal ;
markup: ['] [markup-for] postpone ;
markup: [char] [markup-for] postpone ;

Comments and string literals are similar, but with a different parsing (and of course different display classes).

: parse-line ( -- c-addr u )  source >in @ /string dup >in +! ;
markup: \
    comment-span{ type-html parse-line type-html }span ;
markup: (
    comment-span{ type-html
    [char] ) dup parse type-html emit }span ;

markup: .(
    type-html [char] ) parse .literal [char] ) emit ;

markup: s"
    type-html [char] " parse .literal [char] " emit ;
markup: ." [markup-for] s" ;

Colon definitions use a state variable.

variable markup-state  false markup-state !
markup: :
    markup-state @ if type-html else
        type-html parse-word .definition
        s" c" span{  true markup-state !
    then ;
markup: ;
    }span type-html  false markup-state ! ;

Defining words and normal (non-immediate) character and xt literals check this variable to see whether they should mark the next word as a literal.

markup: variable
    type-html  markup-state @ 0= if
        parse-word .definition
    then ;
markup: constant [markup-for] variable ;
markup: alias [markup-for] variable ;

markup: char
    type-html  markup-state @ 0= if
        parse-word .literal
    then ;
markup: ' [markup-for] char ;

And of course we want to highlight our markup defining words correctly.

markup: markup: [markup-for] : ;
markup: [markup-for] [markup-for] postpone ;

And...that's about it, except for the string-handling words we used to define our new parse-word.

\ split S at the start of S', removing N characters
: split  ( s s' n -- s'' s.first )
    over >r  over min /string 2swap  r> - ;

: scan-ws ( c-addr u -- c-addr' u' )
    begin dup while over c@ bl > while
        1 /string
    repeat then ;
: skip-ws ( c-addr u -- c-addr' u' )
    begin dup while over c@ bl <= while
        1 /string
    repeat then ;