Thursday, 22 May 2008

Scripting with Smalltalk - updated

My post yesterday attracted more attention than I expected, with Paolo Bonzini and Randal Schwartz both being able to make out the code well enough to comment on it. Paolo was able to identify a number of improvements to the code, both in terms of identifying more appropriate approaches, and in identifying where the code was spending its time. As a result, here's a much faster version. It's also noticeably shorter at 31 lines excluding spaces and comments, but it's still very wide.

This version also uses a few more methods not found in core Squeak including #fold: #gather: #copyReplaceFrom:to: .

As Randal pointed out, I'm monkey-patching core classes with gay abandon, and subclassing would probably be safer, though the over-ride of #at: that (rightly) alarmed him is now gone.


"Inspired by http://norvig.com/spell-correct.html"
"s = SpellCheck new. s initialize. s correct: 'misplet'"
Collection extend [
ifEmpty: block [ self isEmpty ifTrue: [ ^ block value]. ^ self ] "not in gst by default"
maxUsing: block [ ^ self fold: [ :a :b | ((block value: a) > (block value: b)) ifTrue: [ a ] ifFalse: [ b ] ] ] ]

String extend [
swapAt: i [ ^ self copyReplaceFrom: i+1 to: i+2 with: {self at: i+2. self at: i+1} ]
removeAt: i [ ^ self copyReplaceFrom: i+1 to: i+1 with: #() ]
insert: l at: i [ ^ self copyReplaceFrom: i+1 to:i with: {l} ]
replace: l at: i [ ^ (self copy) at: i put: l; yourself ]
findWords [ ^ (self asString tokenize: '[^a-zA-Z]+') collect: [ :each | each asLowercase ] ] ]

Object subclass: SpellCheck [
| nwords alphabet |
initialize [ | lines |
lines := (File name: 'westminster.txt') contents.
nwords := lines findWords asBag.
alphabet := 'abcdefhgijklmnopqrstuvwxyz' asArray ]

edits1: word [ | n |
n := word size.
^ Array join: {
0 to: (n - 2) collect: [ :i | word swapAt: i ].
1 to: (n - 1) collect: [ :i | word removeAt: i ].
alphabet gather: [ :letter | 1 to: n collect: [ :i | word replace: letter at: i ] ].
alphabet gather: [ :letter | 0 to: n collect: [ :i | word insert: letter at: i ] ] } ]

knownEdits2: word [
^ (self edits1: word) gather: [ :e1 |
self known: (self edits1: e1) ] ]

known: words [ ^ ( words select: [ :each | nwords includes: each ] ) ]

correct: word [ | candidates |
candidates := (self known: {word}) ifEmpty: [
(self known: (self edits1: word)) ifEmpty: [
(self knownEdits2: word) ifEmpty: [ {word} ] ] ].
^ candidates maxUsing: [ :d | nwords occurrencesOf: d ] ]
]

No comments: