Wednesday 21 May 2008

Scripting with Smalltalk

When I saw this post by Peter Norvig, and especially the lines-of-code comparison towards the end of the page, I thought it would be interesting to see how Smalltalk compared. Given the line-of-code metric, this was an ideal chance for me to play with GNU Smalltalk, which positions itself as a scripting-friendly Smalltalk. It proved to be quite a straightforward exercise, and the code came out surprisingly compactly (but quite wide!):

"Inspired by http://norvig.com/spell-correct.html"
"s = SpellCheck new. s initialize. s correct: 'misplet'"
Dictionary extend [
at: key [ ^ self at: key ifAbsent: [ 1 ] ] "make it a defaulting at:"
incrAt: key [ self at: key put: ((self at: key) + 1) ] "increments value" ]

Collection extend [
ifEmpty: block [ self isEmpty ifTrue: [ ^ block value]. ^ self ] "not in gst by default"
bestUsing: d [ ^ self inject: 'xxx' into: [ :sofar :this | ((d at: sofar) > (d at: this)) ifTrue: [ sofar ] ifFalse: [ this ] ] ] ]

String extend [
swapAt: i [ ^ (self copyFrom: 1 to: i), (self at: (i+2)) asString, (self at: (i+1)) asString, (self copyFrom: (i+3) to: self size) ]
removeAt: i [ ^ (self copyFrom: 1 to: i), (self copyFrom: (i+2) to: self size) ]
insert: l at: i [ ^ (self copyFrom: 1 to: i), l asString, (self copyFrom: (i+1) to: self size) ]
replace: l at: i [ ^ (self copy) at: i put: l; yourself ]
findWords [ ^ (self asString tokenize: '[^a-zA-Z]+') collect: [ :each | each asLowercase ] ] ]

Object subclass: SpellCheck [
| nwords alphabet |
initialize [ | lines |
lines := (File name: 'bigtext.txt') contents.
nwords := self train: lines findWords.
alphabet := 'abcdefhgijklmnopqrstuvwxyz' ]

train: features [ | model |
model := Dictionary new.
features do: [ :each | model incrAt: each asLowercase ].
^ model ]

edits1: word [ | s n |
s := Set new.
n := word size.
0 to: (n - 2) do: [ :i | s add: (word swapAt: i) ].
1 to: (n - 1) do: [ :i | s add: (word removeAt: i) ].
alphabet do: [ :letter |
1 to: n do: [ :i | s add: (word replace: letter at: i) ].
0 to: n do: [ :i | s add: (word insert: letter at: i) ] ].
^ s asArray ]

known_edits2: word [ | s |
s := Set new.
(self edits1: word) do: [ :e1 |
(self edits1: e1) do: [ :e2 |
(nwords keys includes: e2) ifTrue: [ s add: e2 ] ] ].
^ s asArray ]

known: words [ ^ ( words select: [ :each | nwords keys includes: each ] ) asSet asArray ]

correct: word [ | candidates |
candidates := (self known: {word}) ifEmpty: [
(self known: (self edits1: word)) ifEmpty: [
(self known_edits2: word) ifEmpty: [ {word} ] ] ].
^ candidates bestUsing: nwords ]
]

54 lines of not very idiomatic Smalltalk including blank lines and comments (such as they are). More than twice as many lines as Peter Norvig's Python solution, but 1/6th of the size of the Java solution!

It felt very strange - and frustrating - to be developing in Smalltalk without the full support of the traditional environment, especially as I was trying to pick up the subtle differences from Squeak Smalltalk, but the ability to 'extend' the core classes from within the scripts means that it will be very easy to be able to hack out scripts using this tool.

6 comments:

Paolo Bonzini said...

(I've placed this also at http://smalltalk.gnu.org/blog/ with the code formatted a little better).

Interesting! And it's actually quite idiomatic, though by using just a couple more tricks you can decrease the number of lines.

First, you can use the Bag class to count occurrences. It does internally exactly the same thing that you do with Dictionary. You do not need "train:" anymore, and just write "nwords := lines findWords asBag". Also, just write "nwords includes:" instead of "nwords keys includes:" (BTW, that's slow and it would have been better to write "nwords includesKey:" when using a dictionary). Down to 54-4-5=45 lines.

Returning Sets is perfectly fine from edits1: and known_edits2: (by the way, underscores are rarely used for identifiers as you probably know). By returning Sets, you can also omit the "asSet asArray" conversion in known:.

If you have a new version of Smalltalk (>= 3.0.2), you can rewrite these two like this:

edits1: word [
| n |
n := word size.
^Set join: {
0 to: n - 2 collect: [ :i | word swapAt: i.
1 to: n - 1 collect: [ :i | word removeAt: i ].
alphabet gather: [ :letter | 1 to: n collect: [ :i | word replace: letter at: i ] ].
alphabet gather: [ :letter | 0 to: n collect: [ :i | word insert: letter at: i ] ] }
]

known_edits2: word [
^(self edits1: word) gather: [ :e1 | self known: (self edits1: e1) ] ]

For this change to work, you should also add asArray to the definition of alphabet. This brings it down to 39 lines.

I'll also point out that you can define swapAt:, removeAt:, insertAt: using a powerful method, copyReplaceFrom:to:with:

swapAt: i [ ^self copyReplaceFrom: i+1 to: i+2 with: {self at: i+2. self at: i+1} ]
removeAt: i [ ^self copyReplaceFrom: i+1 to: i+1 with: #() ]
insert: l at: i [ ^self copyReplaceFrom: i+1 to:i with: {l} ]

I recommed against inlining these, but you can. Since these are the less readable lines in Mike's Python code, it is fair to outline them in Python too and add 5 lines to Mike's code, bringing the final comparison 26-39.

Also, to improve the readability of "bestUsing:", you can use "fold:" instead of "inject:into:". It uses the first element to start the iteration, and the block is invoked the first time with the first two elements. Actually, #ifEmpty: and the equivalent of Python's max-accepting-a-lambda could be added to the Smalltalk standard library, actually, which would make 26-35. The extra lines are basically lines like "Foo extend", class declarations (the Python code is not embedded in a class, you cannot do that in Smalltalk; passing the filename as a parameter would require additional lines in Python but not in Smalltalk), variable declarations.

The only method that really comes up longer is #correct:, where Python's idiomatic "or" construct allows very short code.

The readability of the Smalltalk code is probably debatable (it's wide!), but it's a matter of being used to one style or the other.

Randal L. Schwartz said...

I may be misunderstanding GST syntax, but wouldn't the following code:

Dictionary extend [
at: key [ ^ self at: key ifAbsent: [ 1 ] ] "make it a defaulting at:"

influence the way that every user of Dictionary #at: gets values? It seems like monkey-patching Dictionary in that way would be a bit devastating. Perhaps you should have subclassed Dictionary to make DefaultingDictionary or something.

Michael Davies said...

Thanks for the feedback Paolo - it's interesting how being too close to the solution can stop you seeing the bigger picture - I was so pleased to have worked out how to extend classes to replicate the Python functionality that I totally forgot about Bags!

join: fold: and gather: are such useful functional methods I wonder why they never made it into the standard Collections hierarchy in Squeak.

I'll put the updated code into a separate post for reference.

Paolo Bonzini said...

Actually, I also have worked a little more and got it to a decent speed.

First, sets turn out to be useless. Using Array join: {...} is much faster!

Second, my suggestion actually causes some bad GC problems because the Array version of the wordlist remains in memory as an old object pointing to many new objects (the words themselves). Again if you have a new-enough GNU Smalltalk there is a method coming to the rescue:

findWords [ | b |
b := Bag new.
self allOccurrencesOfRegex: '[a-zA-Z]+' do: [ :s | b add: s match ].
^b ]

Thanks for the nice benchmark again!

Paolo Bonzini said...

@Randal: indeed he's monkeypatching quite heavily! :-P That's also why a Bag is better in this case.

Michael Davies said...

Randal - That's a very good point - it's definitely not recommended for production use. Fortunately, the updated version no longer needs that code.