APPLICATION NOTE 3483

Abstract: The combination of a multiply-accumulate unit (MAC) and a single-cycle core make the MAXQ2000 a versatile microcontroller (µC). The MAXQ2000 has the performance and I/O peripherals ideal for many applications: alarm clocks, handheld medical devices, digital readouts—any application that requires low power, high performance, and numerous I/Os. With the integrated MAC, the MAXQ2000 ventures into the territory of the DSP (µC).

Just how much performance can the MAXQ2000 get out of its MAC? This application note explores that question with an audio-filtering example, and gives some quantitative guidelines on the performance supported by the MAXQ2000.

The hardware required for this application note includes the MAXQ2000 evaluation kit and a small circuit to interface to a computer speaker.

The MAXQ2000 evaluation kit which is available, is a good tool for exploring the MAXQ2000's capabilities. It includes an LCD panel, LED bank, and access to all the I/O pins of the MAXQ2000 µC. It also includes a MAX1407 ADC/DAC, which can be used for audio output.

The second piece of hardware required can be easily breadboarded. The circuit that was used for this demonstration is shown in

The software required to run this demonstration was built and debugged using the IAR embedded workbench. It provides a good debugging environment, making use of the hardware debugging support of the MAXQ2000. You can set breakpoints, set and read registers and memory, and see your call stack while running on the real hardware.

A generic filter takes the form of a linear equation:

y(n) + Σb

where k represents the order of the feedback portion of the filter and j represents the order of the feed-forward portion of the filter.

An example IIR filter could be something as simple as:

y(n) = 0.5y(n-1) + x(n) - 0.8x(n-1)

Some filters are classified as FIR filters. They do not contain the feedback portion. In other words, there is no y portion in the characteristic filter equation:

y(n) = Σa

y(n) = x(n) - 0.2x(n - 1) + 0.035x(n - 3)

In either case, a filter boils down to a characteristic equation that is essentially a weighted average of some number of the past input and output values. The job of filter design is to produce those A

This solution is a tradeoff of accuracy for speed. In many cases, the error from this approach is negligible. For diagnostic purposes, the applet shows three plots of the calculated filter. The first plot shows the ideal filter behavior, using 64-bit floating-point numbers. This plot, labeled "Ideal Transform", is shown in Figure 2.

For simplicity, the applet produces the floating-point coefficients required by the MAXQ® application, so that new filters can simply be cut-and-pasted into the source for the filter application (into the file data.asm). The applet also produces two other values—the order of the filter (the number of coefficients) and the shift count, so the application can shift the final result appropriately. This data appears in the text box at the bottom of the applet, and might look like this:

Zeroes: dc16 dc16 12, 11, 0x1000, 0x26d3, 0x1e42, 0xf9a3, 0xecde, 0xff31, 0xa94, 0x2ae, 0xfd0c, 0xff42, 0xde Shift amount: 12

The MAX1407 has a 12-bit ADC. However, the input data is 16-bits wide, and our filter produces a 16-bit result. So, while those 4 least significant bits (LSBs) are wasted for this application, we can safely analyze our performance as if working on and producing 16-bit values (CD-quality audio is 16 bits).

In this example, the filter coefficients are stored in code space in a table. When a filter is selected, the application finds the appropriate filter, reads the shift amount and number of taps, and is then ready to start filtering data. The following code applies the filter coefficients:

move MCNT, #22h ; signed, mult-accum, clear regs first zeroes_filterloop: move A[0], DP[0] ; let's see if we are out of data cmp #W:rawaudiodata ; compare to the start of the audio data lcall UROM_MOVEDP1INC ; get next filter coefficient move MA, GR ; multiply filter coefficient... lcall UROM_MOVEDP0DEC ; get next filter data move MB, GR ; multiply audio sample... jump e, zeroes_outofdata ; stop if at the start of the audio data djnz LC[0], zeroes_filterloop zeroes_outofdata: move A[2], MC2 ; get MAC result HIGH move A[1], MC1 ; get MAC result MID move A[0], MC0 ; get MAC result LOWBefore this code is executed, LC[0] is set up with the number of taps for the filter, DP[0] is set up with the address of the current input byte to the filter, and DP[1] points to the start of the filter coefficients. Therefore, DP[1] works through the filter coefficients in an incrementing manner, and DP[0] works through the input data in a decrementing manner (processing the most recent input first).

Since the MAC works in a single cycle, there is not a lot of code here to handle it. MCNT is set to 22h to denote the use of signed integers. In the main loop, the consecutive writes to MA, then MB triggers the multiply-accumulate operation—the result is ready in the next clock cycle. Since our accumulator is 48 bits (and our multiply result is 32 bits), we need not worry about any overflow (unless we have 64,000 taps in our filter!).

We can split the function we use to produce an audio sample into three portions: initialization, filter computation loop, and result fixing. In our released example, the initialization takes 38 cycles, the filter computation loop takes 17 cycles per filter coefficient, and the result fixing takes 9 + (6 x S) cycles, where S is the shift amount. Normally, the shift amount is around 12, so we can estimate result fixing at 81 cycles. Therefore, it takes 119 + (17 x N) cycles to produce one filtered output value. At 20MHz, a MAXQ2000 could run a 100-tap filter at close to 11kHz, which is good enough quality for voice data.

Let us go back and reanalyze our application to see where we could tighten it up. We will concentrate in the filter loop, since this is where most of our cycles are happening on any but the most trivial of filters.

There are a few critical improvements we can make to the loop code to improve efficiency. Remember that we are using prerecorded audio samples that are stored in code space. Lookups to code space require more time than lookups in data space because of the MAXQ's Harvard architecture. The functions called UROM_MOVEDP1INC and UROM_MOVEDP0DEC take 5 cycles each (2 cycles for the LCALL, then 3 cycles inside the function). These could each be replaced by two cycles each if we stored our filter in the RAM (one cycle to select the pointer, one to read from it), and if we were serving real-time input data that was stored in the RAM. If we were willing to donate 256 words of RAM to the filter, we could implement a circular buffer using BP[Offs] to store the input data. These changes reduce the loop time to 11 cycles instead of 17. Our filter loop now looks like this (cycle counts listed first in the comments):

zeroes_filterloop: move A[0], DP[0] ; 1, let's see if we are out of data cmp #W:rawaudiodata ; 2, compare to the start of the audio data move DP[1], DP[1] ; 1, select DP[1] as our active pointer move GR, @DP[1]++ ; 1, get next filter coefficient move MA, GR ; 1, multiply filter coefficient... move BP, BP ; 1, select BP[Offs] as our active pointer move GR, @BP[Offs--] ; 1, get next filter data move MB, GR ; 1, multiply audio sample... jump e, zeroes_outofdata ; 1, stop if at the start of the audio data djnz LC[0], zeroes_filterloop ; 1Once we have our filter and input data in the RAM, we can use another trick of the MAXQ architecture. The MAXQ instruction set is highly orthogonal—there are very few restrictions on what can be used as a source in any operation. Therefore, rather than reading the filter data and the input data into GR, we can write it directly to the MAC registers. This brings the loop down to 9 cycles.

zeroes_filterloop: move A[0], DP[0] ; 1, let's see if we are out of data cmp #W:rawaudiodata ; 2, compare to the start of the audio data move DP[1], DP[1] ; 1, select DP[1] as our active pointer move MA, @DP[1]++ ; 1, multiply next filter coefficient move BP, BP ; 1, select BP[Offs] as our active pointer move MB, @BP[Offs--] ; 1, multiply next filter data jump e, zeroes_outofdata ; 1, stop if at the start of the audio data djnz LC[0], zeroes_filterloop ; 1One final improvement can make this code really fly. Every time through the loop, we compare our current data pointer to the start of the audio-input data to see if we are going out of bounds (the MOVE A[0], DP[0] statement, the CMP compare statement, and the JUMP E statement). If we set the initial audio data (which we are now reading with the circular buffer pointed to by BP[Offs]) to all zeroes, we can simply remove these checks. The cost of initializing the RAM to 0 is negligible, as compared to the 4-cycle savings over the next several thousand samples. Our new loop code is a slim 5 cycles.

zeroes_filterloop: move DP[1], DP[1] ; 1, select DP[1] as our active pointer move MA, @DP[1]++ ; 1, multiply next filter coefficient move BP, BP ; 1, select BP[Offs] as our active pointer move MB, @BP[Offs--] ; 1, multiply next filter data djnz LC[0], zeroes_filterloop ; 1Before going back to our performance equation, let us look at the result computation. Our current way of shifting the 48-bit result down seems wasteful.

move A[2], MC2 ; get MAC result HIGH move A[1], MC1 ; get MAC result MID move A[0], MC0 ; get MAC result LOW move APC, #0C2h ; clear AP, roll modulo 4, auto-dec AP shift_loop: ; ; Because we use fixed point precision, we need to shift to get a real ; sample value. This is not as efficient as it could be. If we had a ; dedicated filter, we might make use of the shift-by-2 and shift-by-4 ; instructions available on MAXQ. ; move AP, #2 ; select HIGH MAC result move c, #0 ; clear carry rrc ; shift HIGH MAC result rrc ; shift MID MAC result rrc ; shift LOW MAC result djnz LC[1], shift_loop ; shift to get result in A[0] move APC, #0 ; restore accumulator normalcy move AP, #0 ; use accumulator 0One possible solution is to use our MAC once again. Rather than shifting right by 12 (or any value between 0 and 16), we can shift left by 16 minus that amount (i.e. shift left by 4). This will put our result in the middle 16-bit word of the MAC's registers. Note that our shift left will actually be accomplished with a multiply-by-2 to some power (2

; ; don't care about high word, since we shift left and take the ; middle word. ; move A[1], MC1 ; 1, get MAC result MID move A[0], MC0 ; 1, get MAC result LOW move MCNT, #20h ; 1, clear the MAC, multiply mode only move AP, #0 ; 1, use accumulator 0 and #0F000h ; 2, only want the top 4 bits move MA, A[0] ; 1, lower word first move MB, #10h ; 1, multiply by 2^4 move A[0], MC1R ; 1, get the high word, only lowest 4 bits significant move MA, A[1] ; 1, now the upper word, we want lowest 12 bits move MB, #10h ; 1, multiply by 2^4 or MC1R ; 1, combine the previous result and this one ; ; result is in A[0] ;This will let us take our result computation to 12 cycles, rather than 9 + (6 x S) cycles.

Now let us plug back into our earlier equation. Our new equation uses a conservative estimate of 40 cycles of overhead and 5 cycles per loop iteration. Using the same example as earlier of a 100-tap filter, the MAXQ2000 could process 16-bit, mono audio data at 37kHz, as seen in

Filter Length (Taps) | Max Rate (Hz) |

50 | 68965.51724 |

100 | 37037.03704 |

150 | 25316.4557 |

200 | 19230.76923 |

250 | 15503.87597 |

300 | 12987.01299 |

350 | 11173.18436 |

There is one more performance improvement that we can implement for applications in which a higher sampling rate is required and code space can be sacrificed. We can 'in-line' the filter coefficients, which eliminates both the need to select active pointers and the need to loop (this technique is also known as loop unrolling). The price of this change is increased code space—previously, our 100-point filter took 100 words to store; now it will take 300 words to store (2 words for each coefficient move, 1 word for each data value move). In a 16 kilo-word device, this may be an insignificant price for the performance benefit. The new code might look something like this:

move BP, BP ; select BP[Offs] as our active pointer zeroes_filtertop: move MA, #FILTERCOEFF_0 ; 2, multiply next filter coefficient move MB, @BP[Offs--] ; 1, multiply next filter data move MA, #FILTERCOEFF_1 ; 2, multiply next filter coefficient move MB, @BP[Offs--] ; 1, multiply next filter data move MA, #FILTERCOEFF_2 ; 2, multiply next filter coefficient move MB, @BP[Offs--] ; 1, multiply next filter data . . . move MA, #FILTERCOEFF_N ; 2, multiply next filter coefficient move MB, @BP[Offs--] ; 1, multiply next filter data ; ; filter calculation complete ;To calculate the performance benefit of this change, we again assume 40 cycles of overhead, but now there are 3 cycles per loop iteration, though we have truly eliminated the loop. The 100-tap performance limit now sits at 58kHz (see

Filter Length (Taps) | Max Rate (Hz) |

50 | 105263.1579 |

100 | 58823.52941 |

150 | 40816.32653 |

200 | 31250 |

250 | 25316.4557 |

300 | 31250 |

350 | 27027.02703 |

- Dedicate a segment of RAM to store the latest output samples (this would be most efficiently implemented as a circular buffer, using the BP[Offs] register in a manner similar to that described previously)
- Include the characteristic filter coefficients for the feedback (the 'y' portion) of the filter
- Add another loop that continues to accumulate products that are the result of the feedback portion of the filter

- MAXQ Home Page
- MAXQ Family User's Guide
- MAXQ2000 User's Guide Supplement
- Dallas Semiconductor Microcontroller Support Forum
- Source code and support applications used in this application note