Tuesday, October 18, 2016

Go Forth with Arduino

Forth is an unusual programming language. To learn it, "you must unlearn what you have learned", as Master Yoda would say. There are many indications that Forth is a programming language of Jedi: it uses postfix notation for expressions (so "a + b" becomes "a b +", which means "to sum receive, a and b you must add" using Yoda's words), it is extremely minimalistic (in most Forths, the language core is written in assembly, and the rest of the language constructs, including conditional, branching and loop instructions, is written in the Forth itself), and it requires long study to understand (even people using Forth in real life to build commercial software sometimes admit, that they are yet to understand it). Also, although Forth has an official ANSI standard, it is so flexible, that most Forth masters tend to build their own light sabres (I mean Forth systems) using the base Forth only as a foundation. Such thing as MANX Musical Forth (an extension of the regular Forth designed specifically to work with MIDI) or J1 Forth (an FPGA implementation of a stack-based CPU) are nothing unusual in the world of Forth. Forth is also complete: it's an operating system, a runtime environment, and an interactive compiler - all that in just a few kilobytes (not megabytes!) of code.

Forth had its best days during the early computers era, since it is extremely well suited for machines with very limited resources. Some of the 8-bit home computers, like British Jupiter ACE or French Hector HRX, used Forth as their operating system. Recently, it has been used successfully in XO-1 laptop's firmware.

Saying all that, no wonder that there is a Forth for Arduino. No wonder that there is more than one. No wonder that most of Forths for Arduino get rid of the bootloader and take full control of the hardware. I decided to try some of them and describe my very short and very subjective experience here.

First, you need an Arduino Uno, or Arduino Nano with Atmega328 CPU, a programmer (I recommend USBasp since it's inexpensive and easy to use) and software that allows programming Arduino board directly. If you have Ubuntu Linux, you can install it with the following command:
sudo apt-get install gcc-avr binutils-avr avr-libc gdb-avr avrdude
For Mac OS X use Homebrew:
brew tap osx-cross/avr
brew install avr-libc
brew install avrdude --with-usb
On Windows just download and install Atmel Studio.

Connect USBasp to the ISP pins of Arduino and put it into the USB port in your computer. You can now upload software straight to the Atmel chip using avrdude command line tool. It is important that you understand what this tool does, before you start fiddling with command line switches, because you can brick the board if you misuse them (this applies primarily to hfuse). I recommend reading Martin Currey's Arduino / ATmega 328P fuse settings if you want to use settings other than provided in this article.

AmForth 6.3

Let's start with AmForth. Download the AmForth distribution archive, and extract appl/arduino/uno.hex and appl/arduino/uno.eep.hex files. They contain compiled binaries in the form of human readable Intel HEX format. Connect the programmer and run the following command (the whole command should be a single line):
avrdude -p m328p -c usbasp -U flash:w:uno.hex -U eeprom:w:uno.eep.hex -U efuse:w:0xfd:m -U hfuse:w:0xd9:m -U lfuse:w:0xff:m -v
After a while, AmForth will be uploaded to Arduino, and the board will reset. Now, connect the Arduino with your computer using mini USB cable and open a terminal program with the following parameters: baud rate 38400, 8 bits, no parity, and 1 stop bit. For Windows, you can use Putty, just select connection type "serial" and a suitable COM port. In Mac OS X and Linux you can use screen, but the connection port depends on the chip your board uses for serial communication. If you have original Arduino, or a more expensive clone, which uses FTDI chip, the command will look similar to:
screen /dev/ACM0 38400
for Linux, and
screen /dev/tty.usbmodem1421 38400
for Mac OS X. In your case device port can be different, depending on which port Arduino is connected to, but it should follow the general pattern of /dev/ACM... for Linux and /dev/ttyusbmodem... for Mac.
With cheaper Arduino clones, using CH340 chip for serial communication, the command will look like:
screen /dev/ttyUSB0 38400
for Linux, and something similar to:
screen /dev/cu.wch\ ch341\ USB\=\>RS232\ 1420 38400
on Mac OS X. To disconnect from screen, use the following key combination: ctrl + a, ctrl + backslash, enter.

If everything goes well, you can start using Forth on your Arduino. For example, you can enter your first program, which calculates the greatest common divisor of two numbers:
: gcd ( a b -- gcd )
  begin
    dup
    while
      swap over mod
  repeat
  drop ;
Forth instructions are called words and are stored in a dictionary. The first line defines a word gcd (colon is the beginning of a word definition), and contains a comment (in brackets) which says that the word expects two values as input (a and b) and produces one value as output (gcd). The names used in comment can be anything, since all values in Forth are put on, and taken from, an anonymous data stack.
Words begin and repeat denote a loop. Within the loop, the current value located on the top of the stack is duplicated. The word "while" takes a value from the top of the stack and checks whether it is false (zero) or true (any value other than zero). If it's true, the loop continues. Because "while" consumes the value it takes from the stack, we need word "dup" before it. Otherwise, the word "while" would eat up all the values from the stack.
Next we swap the top two values on the stack and replicate the lower one to the top. Basically, we take "a b" series and create "b a b" from it. The word "mod" takes two values from the stack, divides them, and puts back the remainder. Now the stack looks like this: "b remainder". The loop is repeated, so the remainder is duplicated and evaluated. If it's not zero, the loop continues, otherwise it is removed from the stack (by the word "drop") and the value remaining on the stack is the final result. Semicolon ends the word definition.

To test how it works, input the following command:
15 25 gcd .
It puts 15 and 25 on the stack, executes "gcd" word and prints (dot means "print on the screen") the top value from the stack, which happens to be our result.

I wanted to know how fast AmForth is, so I wrote a simple benchmark, which calculates the greatest common divisor for all combinations of numbers from 0 to n:
: bench ( n -- )
  dup 0 do
    dup 0 do
      j i gcd
      drop
    loop
  loop
  drop ;
In Forth "do/loop" is a loop which needs two values on the stack - the starting value, and the ending value. With "10 0 do ... loop" you repeat the loop from 10 down to 0. To execute the loop n times (as you see in the comment, "n" is expected to be on the stack when you run "bench") you need to duplicate it with "dup", then put 0 on the stack, and then call "do" which will consume those values. But because "n" was duplicated, it is still on the stack, and can be used in the inner loop. Finally, we calculate the greatest common divisor on the current counters of the inner and the outer loop ("i" and "j"), but since we don't need the result and don't want it to remain on the stack and affect the loops, we need to "drop" it.

The following test
200 bench
takes about 8 seconds on AmForth to complete. It is a very good result comparing to other Forths I tested.

Flash Forth 5

Installing Flash Forth also requires a programmer. You also need avr/hex/ff_uno.hex file, which you can upload to the board using USBasp with:
avrdude -p m328p -c usbasp -e -U flash:w:ff_uno.hex -U efuse:w:0xfd:m -U hfuse:w:0xda:m -U lfuse:w:0xff:m -v
Again, the whole command should be a single line.

You can communicate with Flash Forth the same way as with AmForth, but using different baud rate:
screen /dev/ACM0 9600
Flash Forth comes with separate math library, so to be able to define "gcd" word in Flash Forth, you need to download it from http://flashforth.com/math.txt and rewrite or upload it via the terminal. You need to be careful with uploading, though. If you just copy and paste the whole file in the terminal you will overrun the input buffer and Flash Forth will start returning errors. The same applies to uploading code to other Forths, too. It's best to copy the code definition by definition or to use special software, which slows down the transmission (for example, iTerm has a special paste option "Paste Slowly").

Flash Forth does not support "do/loop", and uses "for/next" instead. So the word "bench" has a slightly different definition:
: bench ( n -- )
  dup for
    r@
    dup for
      dup r@ gcd drop
    next
    drop
  next
  drop ;
Word "for" takes only one value from the stack, and always counts down to zero. Also, because there is no "do/loop" in Flash Forth, there is no "i" and "j" either, and you need to copy the current loop counter from the return stack (which keeps track of the program execution) to the data stack with "r@". Except from the syntax, the loop construct remains the same as with AmForth.
However, Flash Forth turns out to be much faster. Running "200 bench" takes about 4 seconds to complete, which is twice as fast as with AmForth.

328eForth 2.20

It's another Forth for Arduino, which is a direct descendant of renowned eForth. Its main advantage is simplicity - the whole source code fits in one file of Atmel assembly, and the compiled hex is only 14 kilobytes long. Unfortunately, the original project page is no longer avaiable, but you can download the source code of version 2.20 from this Github repository.

The repository does not provide a compiled binary, though, so you need to make it yourself. Fortunately, it's quite easy - all you need is Atmel Assembler, which you can find on Sourceforge. It's a Windows executable, but it works pretty well with Wine, so you can compile 328eForth on Linux and Mac with no problem. Put my_forth.asm file in the avr8/Atmel directory and run the following command:
avrasm2.exe -fI -I Appnotes2/ my_forth.asm
You should now have the my_forth.hex file, which you can upload to Arduino with:
avrdude -p m328p -c usbasp -e -U flash:w:my_forth.hex -U efuse:w:0xfd:m -U hfuse:w:0xd8:m -U lfuse:w:0xff:m -v
To communicate with 328eForth, connect via terminal using baud rate 19200.

I love this Forth implementation for its simplicity, but unfortunately it is quite slow and buggy. The "gcd" word does not work properly, because the word "mod" is broken and returns wrong results. Also, the benchmark executes in 27 seconds with 328eForth, comparing to 8 with AmForth and 4 with Flash Forth. To make things worse, the project page and documentation are missing and can be reached only partially, through the Wayback Machine.

Yaffa Forth 0.6.1

This Forth is different. Yaffa Forth is written in C and can be uploaded to the Arduino board like any regular sketch, with Arduino IDE. It's a good option for people who don't have a programmer or only want to give Forth a short try. It also provides standard Arduino interface with words such as "pinMode", "digitalRead/digitalWrite", "analogRead/analogWrite", etc. AmForth, Flash Forth and 328eForth don't use Arduino libraries, so you have to talk to the pins directly via I/O ports. Also, because Yaffa's source code is very clean and well documented, you can easily extended it with new words which can use existing Arduino libraries written in C.

Yaffa Forth has some deficiencies, though. Because it works on top of virtual machine written in C, which itself also needs memory, it has less space available for the stack, which can be especially painful on Arduino Uno or Nano (they both only have 2kB RAM). Also, it is much slower than previous Forths - the benchmark code identical to AmForth's takes 70 seconds to run, which makes it about ten times slower than AmForth and almost twenty times slower than Flash Forth.

There is also one more important difference between aforemetioned Forths and Yaffa Forth. AmForth, Flash Forth and 328eForth store all user-defined words in flash memory, together with the main dictionary (in 328eForth a new word is defined in RAM, but must be copied to flash with word "flush" before it can be used). This means that if you turn off the power, your definitions remain in Arduino's memory. With Yaffa Forth, all new words are stored in RAM and disappear once you turn the board off or press the reset button. If you want to store your definitions for future use, you must write all of them in EEPROM with eeLoad -> code -> ctrl + z). It's because Yaffa Forth relies on Arduino bootloader, which prevents user applications from writing directly to flash. On the other hand, EEPROM memory can handle almost ten times as many write cycles as flash before it wears off, so in this respect it may be more hardware friendly to use Yaffa Forth than its counterparts.

14 comments:

Anonymous said...

Flashforth on the PIC24 runs the same test in 178 milliseconds @ 27 MIPS clock speed.

Peter Silver said...

Really interesting Krzysztof, covering an area that is pretty challenging to a novice.

Ultimately I want to programme the STM8 in eForth after familiarising myself with eforth328. Checking the "Wayback machine" link in your article reveals an accessible file, 328eForth.hex. I wondered what difference it would have to your my_forth.hex created file in your article? Further more, excuse my ignorance, but is the slow benchmark due to the mentioned faulty "MOD"?

Krzysztof Kliś said...

Nicely spotted! There's indeed a slight difference in the 328eForth.hex you mentioned and my_forth.hex compiled from the source code found on Github. I did some research and discovered that the repository also contains the original source code of 328eForth at https://github.com/DRuffer/328eforth/blob/7da3325469aaf625cae9d89989d30730044781dd/328eforth220.asm, which was later replaced by my_eforth.asm. I compiled 328eForth.asm and it resulted in 328eForth.hex identical to the file stored in Wayback archive. Next, I compared it with my_eforth.asm and the difference is only in COM port speed settings. Out of curiosity I uploaded 328eForth.hex to Arduino Uno and checked it with "15 4 mod .", which resulted in 1, while it should be 3. So it's still something wrong with the calculations. I remember I tried to track down the problem some time ago, and I found the source code of 328eForth for a different CPU architecture (I believe it was Intel). The logic seemed to be a straightforward translation to Atmel and I could not spot an obvious error. I suspect that the problem arises from an assumption that the CPU flags (zero, overflow, carry) are set and cleared the same way across all architectures, which may not be true.
As for the benchmark, you are absolutely right. If you look at the source code of eForth, division is in fact made by subsequent subtractions. If the boundary checks are not working correctly (again, just a guess), it may result in much more operations than necessary (for example several thousand operations - until an overflow occurs - instead of just a few), and affect the overall speed test results.

Peter Silver said...

Thanks Krzysztof for the response. If you don't mind communicating, is this the channel you prefer? Just in case, I'll ask some questions now! I have started to read the 328eForth_readme accessible via Wayback machine.

1. At present I don't have a USBasp but I do have more than 2 328p UNOs. Can the hex file be regarded as a bootloader and the technique described in, "https://www.arduino.cc/en/Tutorial/ArduinoToBreadboard" be used please?

2. I wondered how many non-obvious eForth words are affected by the "MOD" problem. I don't actually want to get deeply involved in the 328eForth as I'm trying to "think" STM8!

3. Wrt the MOD problem, just a thought but do you got the same results with a different compiler?

Krzysztof Kliś said...

I don't mind answering here, maybe someone else finds this thread interesting as well :) You can also contact me directly via Gmail (krzysztof.klis at ...) if you wish so.

As for your questions:

1. Yes, you can regard the hex file as a bootloader. I haven't tried using one Arduino as a programmer for another Arduino board, but it should work just as good as USBasp.

2. I tried the same operations with a different eForth version (for PC/DOS as far as I remember) and everything worked fine. So I think it's just the Arduino port which is broken. I have not tried the STM8 version though.

3. 328eForth is written in AVR assembly, so maybe I should use more appropriate word "assembler" instead of "compiler". With high level languages different compilers produce different results, assemblers just perform a direct translation of human readable mnemonics into CPU instructions. So no matter which assembler you use, the resulting binary code will always be the same. I tried Atel Assembler and Avra and they both produced identical hex outputs.

Out of curiosity, have you tried to repeat my benchmark with eForth for STM8?

Peter Silver said...

Hello again Krzysztof.

I'm of the same view about letting others know my experience. I think that if you solved the 328eForth problem, and I suspect you're pretty close, it can be a real help to others.

On my side I've been able to now run eForth on a STM8S103F3 today for the first time ever. At the moment I can't configure a terminal programme to implement a carriage return when it receives presumably a line-feed from the board. Thus with every entry, text scrolls across the screen and then wraps round. However for test benchmark purpose I have a result but you may want to expand on your very useful article.

A. In your article you have a greyed row "15 25 gcd ." but you don't actually state the expected answer 5. In my STM8 implementation this works as does "15 4 MOD ." giving 3. The gcd word appears in WORDS and can be called often.


B. In creating the "bench" word I got errors till I removed j i from the " j i gcd" part. This part of eForth I have to study!

C. Running "200 bench" has the effect of resetting the STM8S103F3 and loosing both the gcd and bench words. However it takes approximately 4 seconds to do this so I suspect some form of the benchmark functioned.

Wrt earlier comments.
1. I'll wait till I get a spare Atmel 328p before trying to program it now. However what has surprised me is that the majority of articles include adding parts for an external 16 Mhz clock. For basic purposes a slower much simpler implementation can use the internal 8Mhz clock as an introduction imo.

2. I enquired about the effect on other words as I was going to suggest creating a replacement for MOD (workingMOD) for it and any other words used in running your benchmark.

3. Wrt to terminology I suspected you were referring to assemblers. As stated I haven't got involved with the Atmel architecture, in fact I stopped with 6502 a long time a go. However I wouldn't be surprised if it is a configuration error allowing say a floating pin to have an effect. The other area I was thinking about was inconsistent register accessing.

FYI the terminal program I'm using is the original HyperTerminal V6.3 as it's approximately 1MB. I did try CoolTerm which is 37MB but didn't like it. I'll probably install the Arduino environment with it's terminal and see if that's acceptable. I might then buy that 328p!

Unless it's really necessary I'm trying to avoid Linux programmes at this time.

As you might have guessed this is early days for me and eForth but I do intend to try and make it better publicised for the STM8 if it meets my expectations. In my history I was involved with an educational robot and designed interfaces for many computers including the Jupiter Ace. I have an Ace (and the robot) somewhere and was thinking of using it for publicity purposes. If anything appeals to you let me know.

Best wishes,

Peter

Peter Silver said...

Hi Krzysztof.

Can you please tell me what results you get with:

a. 18 4 MOD .
b. 18 4 / .
c. 18 4 /MOD . .

If by any chance c gives 4 2 then could I suggest running a modified benchmark with /MOD DROP instead of the existing MOD.

I was going to give you my results for the STM8 with both forms but unfortunately I can't seem to store the words in NVM. The new words disappear when the benchmark is run no matter what I've tried to date. If you can suggest how it's done in 328eForth it would be very helpful.

When I get the time I'll install 328eForth on a Arduino pro mini 3.3v I have.

Best wishes,

Peter

Krzysztof Kliś said...

Hi Peter,
I did the tests and got the following results:

18 4 MOD .
0 ok

18 4 / .
6 ok

18 4 /MOD . .
6 0 ok

I don't know how 328eForth can store words in a microcontroller. According to the readme still available at https://web.archive.org/web/20150217224050/http://offete.com/files/328eForth_readme.pdf you need to use word "flush" to write data to the flash memory, which in theory should make them persistent. But when I reset my board or turn off the power, all user-defined words disappear.

Peter Silver said...

Hi Krzysztof.

I have found another source which has a slightly later assembler version at:
https://gitlab.com/jjonethal/eforth328

I'm now downloading Atmel Studio 7.0 and will test various permutations on a 5V Pro mini.

I'd be grateful if you could send me the expression that you used to benchmark 328eForth as it does not have the AmForth "j" word.

Apart from the MOD problem there is the matter of storing new words in Flash memory. In the case of the STM8 at the moment it only appears to store one word so I'm wondering if a variable is not being referenced correctly and is locked in ROM.

Again let me say that your article has been very useful.




Krzysztof Kliś said...

328eForth does not support do/loop construct, so I used for/next as described in Flash Forth example. Remember to use "flush" after typing :gcd and :bench definitions, otherwise eForth will crash.

Peter Silver said...

Thanks for the patience Krzysztof, this is all outside my comfort-zone at the moment!

In the case of the STM8 it has DO LOOP and I words but I can't seem to use them. However following your last update I have run hopefully representative benchmarks on the STM8S103F3P6 cheap system board and can report 200 BENCH is 3.3 seconds and 400 BENCH is 13.8. The code is below. I accessed the STM8 via RealTerm 2.0.0.70 and found the ability to input delays for each line or character incredibly useful as the "Compilation" is unreliable for no obvious reason. That is the reason for the extra lines between instructions in the code below to allow the STM8 time to process the line.

At the moment I'm investigating how to use an Arduino UNO programmed as an ISP to reprogram a 3V Arduino Pro Mini to eForth. I'm pretty sure I could install the Arduino bootloader via the Arduino IDE but can't find how to use any other code in the IDE. I assembled the same version eForth found at https://gitlab.com/jjonethal/eforth328 but the hex file is identical to yours so would have the sane MOD problem. I need to be sure I can reliably program 3V and 5V Pro Minis based on my earlier experience with the STM8. I have installed the 1GB Atmel Studio 7 but have yet to appreciate it!

My interest in eForth is to see if it is a viable alternative to C, which I don't particularly like, for use with the NRF24L01 wireless transceiver and I2C.

STM8 Benchmark:
ram / close NVM setting

hand / close external file use

reset / clear any created words

FILE / select external file

NVM / created words stored in NVM


: gcd begin dup while swap over mod repeat drop ;


: bench dup for r@ dup for dup r@ gcd drop next drop next drop ;


RAM / close NVM for RAM

HAND / except terminal entry

cr words cr / display dictionary with hopefully stored new words


Peter Silver said...

Hi Krzysztof, some progress! I got the 328eForth MOD and other words to work when I preceded it with DECIMAL on a 5v Arduino Pro Mini. I attempted your benchmark but that proved too much and the Pro has locked up and I won't be able to re programme it till next week.

Just a thought but what's the viability of having a two stage Forth running like an Arduino environment? The protected bootloader part ensures that if necessary the initial dictionary can be reloaded.

TG9541 said...

Hi, that's an interesting discussion :-)

STM8EF doesn't provide access to the loop counter of the outer loop. It's easy to work around that:

```
variable j

: gcd ( a b -- gcd )
begin
dup
while
swap over mod
repeat
drop ;

: bench ( n -- )
dup 0 do
i j !
dup 0 do
j @ i gcd
drop
loop
loop
drop ;
```

Test on a plain STM8S003F3P6@16MHz:

* `25 15 gcd .` gets 5.
* `200 bench` from RAM takes 6.9s
* `200 bench` from Flash takes 6.4s
* `400 bench` from Flash takes about 28s

The problems Peter experienced: bad QA - sorry about that! That's a thing of the past, I hope :-)

TG9541 said...

On a second thought: Peter's and my observations differ by a factor of two. Something must be wrong.

Peter used the definition with `for .. next` (same as in the FlashForth example).

I did a quick test with STM8EF:

```
variable tally 0 tally !
: gcd ( a b -- gcd )
drop 1 tally +! ;

: bench ( n -- )
dup for
r@
dup for
dup r@ gcd drop
next
drop
next
drop ;
```

200 bench ok
tally @ U. 20301 ok

Here we have the factor of two (and an off-by-one error)! The same test with the `do .. loop` example results in a tally of 40000 (as expected).

A working implementation of the double `for .. next` loop might look like this:

```
: bench ( n -- )
1- dup for
r@ j !
dup for
j @ r@ gcd drop
next
next
drop ;
```

Now gcd runs 40000 times (and with the same values as the double `do .. loop` variant).

Conclusion:

AmForth and FlashForth on the ATmega328 show about equal performance (`200 bench` takes about 8 seconds).

At the same CPU clock speed (16MHz) a Forth on the STM8 seems to be slightly faster (`200 bench` takes 6.4s).

However, STM8EF is optimized for size:
* the used full featured v2.2.16 MINDEV binary used in the experiment above requires 4768 bytes
* the small CORE binary (without background task, `CREATE DOES>`, and `DO .. LEAVE .. LOOP`) fits in 3954 bytes
* a "RAM only" subset (like YAFFA Forth) fits in about 3600 bytes

An STM8 Forth optimized for speed would perform a bit better, I suppose.