Port of RIOTOS to Curiosity PIC32 MZ EF Development Board

Status
Not open for further replies.

nsaspook

Well-Known Member
Most Helpful Member
I've just started creating a new (from an existing mips/pic32mz configuration) cpu and board port for this dev board. So far only a few I/O items are working like the debug uart, user leds and rgb led but it boots, runs the 'hello world' and timer serial port examples and the Linux based bare-metal MIPS-MTI compiler/linker creates a mplabx-ipe compatible hex file for flashing.
http://www.microchip.com/Developmen...8Q1&utm_content=DevTools&utm_campaign=Article

Fork for riotos software: http://github.com/nsaspook/RIOT/tree/PIC32MZEF

http://github.com/RIOT-OS/RIOT/wiki/Family:-MIPS

**broken link removed**
working toolchain: http://codescape.mips.com/components/toolchain/2016.05-03/downloads.html
SDK Installers v1.4

Update the Linux shell path (in .bashrc) to include the correct compiler after the tools install.
export PATH=$PATH:/usr/local/go/bin:/opt/imgtec/Toolchains/mips-mti-elf/2016.05-03/bin
 
Last edited:
The LEDs, switch, usb power, uart and spi (found a good starting example on the net) drivers are running on the hardware so it's possible now to communicate with radio modules and other serial devices once testing is done.

Still creating a board file with all the pin and device connections but most of the simple communications signals to the Mikro Bus headers are done.

SPI MOSI data with SCK clock on the top and two UART serial ports (and one debug serial port) on the bottom running the driver test program.



Serial data from uart4 out -> uart2 in -> spi2 out as a 4 byte sequence for every uart byte received.

RIOT-OS application testing code for this board after being compiled and flashed on a Linux workstation using the baremetal MIPS compiler with MPLABX-IPE to program the hex file result.

C:
/*
 * PIC32MZ EF Curiosity Development Board, RIOT-OS port testing example
 */

#include <stdio.h>
#include <string.h>
#include "xtimer.h"
#include "timex.h"
#include "periph/uart.h"
#include "periph/gpio.h"
#include "periph/spi.h"

/* set interval to 1 second */
#define INTERVAL (1U * US_PER_SEC)

/* serial #1 interrupt received data callback processing */
static void _rx_cb1(void* data, uint8_t c)
{
    uint8_t *recd = data, rdata[4];

    *recd = c;
    /* write received data to TX and send SPI byte */
    uart_write(1, &c, 1);
    /* SPI in interrupt context for testing, bus has mutex_lock
    * we receive one byte from the uart and transfer 4 bytes using SPI
    */
    spi_transfer_bytes(SPI_DEV(1), 0, true, recd, rdata, 4);
}

/* serial #2 interrupt received data callback processing */
static void _rx_cb2(void* data, uint8_t c)
{
    uint8_t *recd = data, rdata[4];

    *recd = c;
    /* write received data to TX and send SPI byte */
    uart_write(2, &c, 1);
    /* SPI in interrupt context for testing, bus has mutex_lock
    * we receive one byte from the uart and transfer 4 bytes using SPI
    */
    spi_transfer_bytes(SPI_DEV(2), 0, true, recd, rdata, 4);
}

int main(void)
{
    /* variable data[1..2] byte 4 has SPI id data for testing */
    uint32_t data1 = 0x0f000000, data2 = 0xf0000000;
    char buffer[128];
    int dd, times_count = 0;
    xtimer_ticks32_t last_wakeup = xtimer_now();
    /*
    * setup serial ports, uart 1,2,4 @115200 bps and spi 1,2
    * uart callback uses a 4 byte variable for data so SPI can
    * transfer 4 bytes in the callback
    */
    uart_init(1, DEBUG_UART_BAUD, _rx_cb1, &data1);
    uart_init(2, DEBUG_UART_BAUD, _rx_cb2, &data2);
    uart_init(4, DEBUG_UART_BAUD, NULL, 0);
    spi_init(SPI_DEV(1));
    spi_init(SPI_DEV(2));
    spi_acquire(SPI_DEV(1), 0, SPI_MODE_0, SPI_CLK_1MHZ);
    spi_acquire(SPI_DEV(2), 0, SPI_MODE_0, SPI_CLK_1MHZ);

    while (1) {
        /* stop unused variable warning from compiler */
        (void) last_wakeup;
        /*
        * repeat the data stream to all serial ports by sending data to uart #4
        */
        sprintf(buffer, "Times %d, Testing longer string %" PRIu32 "\n", times_count++, xtimer_usec_from_ticks(xtimer_now()));
        /* send string to serial device #4, TX pin out looped to device #1 and 2 RX pin inputs */
        uart_write(4, (uint8_t *) buffer, strlen(buffer));
        /* cpu busy loop delay */
        for (dd = 0; dd < 100000; dd++) {
            last_wakeup = xtimer_now();
        }
    }

    return 0;
}
 
Porting issues: Not many but interrupts are a slight issue to I/O speed.
1. Interrupts. The MIPS compiler can only handle Single Vector Mode EIC (like legacy 8 bit pics) so all interrupt flags must be checked in sequence adding latency during long interrupt low priority sequences. To use MULTI-VECTOR MODE with hand written code is an accident waiting to happen so I think I can use the Microchip xc32 compiler to generate the correct , register setting, Computed Offset, Variable Offset, interrupt prologue, shadow register set and an epilogue for each device in a riot-os module if I see a real problem.
https://gcc.gnu.org/onlinedocs//gcc/MIPS-Function-Attributes.html
https://ww1.microchip.com/downloads/en/DeviceDoc/60001108H.pdf

In my test example (to look for interrupt based problems) I'm using a SPI write in the uart receive context.

All uarts have 8 byte deep separate tx and tx FIFO buffers in hardware.
C:
// in the interrupt vector code, EIC_IRQ is set

/* note Compiler inserts GP context save + restore code (to current stack). */
#ifdef EIC_IRQ
/*
 * This is a hack - currently the toolchain does not support correct placement
 * of EIC mode vectors (it is coming though) But we can support non-vectored EIC
 * mode and note the default PIC32 interrupt controller (which uses EIC +
 * MCU-ASE) defaults to non vectored mode anyway with all interrupts coming via
 * vector 0 which is equivalent to 'sw0' in 'VI' mode.
 *
 * Thus all EIC interrupts should be decoded here
 *
 * When toolchain support is available we could move to full vector mode but
 * this does take up significant space (MCU-ASE provides 256 vectors at 32B
 * spacing (the default) thats 8KB of vector space!), So a single entry point
 * may be better anyway.
 *
 */
void __attribute__((interrupt("vector=sw0"), keep_interrupts_masked)) _mips_isr_sw0(void)
#else

void __attribute__((interrupt("vector=hw5"))) _mips_isr_hw5(void)
#endif
{

// stuff
#ifdef _PORTS_P32MZ2048EFM100_H
    /* process uart receive interrupts here */
    if (IEC3bits.U1RXIE && IFS3bits.U1RXIF) {
        UART_1_ISR_RX();
        IFS3CLR = _IFS3_U1RXIF_MASK;
    }

    if (IEC4bits.U2RXIE && IFS4bits.U2RXIF) {
        UART_2_ISR_RX();
        IFS4CLR = _IFS4_U2RXIF_MASK;
    }

    if (IEC5bits.U4RXIE && IFS5bits.U4RXIF) {
        UART_4_ISR_RX();
        IFS5CLR = _IFS5_U4RXIF_MASK;
    }
#endif
}


/* uart interrupt in single vector sw0 */
static void rx_irq(uart_t uart)
{
#ifdef _PORTS_P32MZ2048EFM100_H
    PDEBUG1_TOGGLE;
#endif
    if (UxSTA(pic_uart[uart]) & _U1STA_OERR_MASK) {
        /* clear the FIFO */
        while ((UxMODE(pic_uart[uart]) & _U1MODE_ON_MASK) && (UxSTA(pic_uart[uart]) & _U1STA_URXDA_MASK)) {
            if (isr_ctx[uart].rx_cb)
                isr_ctx[uart].rx_cb(isr_ctx[uart].arg, UxRXREG(pic_uart[uart]));
#ifdef _PORTS_P32MZ2048EFM100_H
            PDEBUG1_TOGGLE;
#endi
        }
        UxSTACLR(pic_uart[uart]) = _U1STA_OERR_MASK;
    }

    if ((UxMODE(pic_uart[uart]) & _U1MODE_ON_MASK) && (UxSTA(pic_uart[uart]) & _U1STA_URXDA_MASK)) {
        if (isr_ctx[uart].rx_cb)
            isr_ctx[uart].rx_cb(isr_ctx[uart].arg, UxRXREG(pic_uart[uart]));
    }
}

void UART_1_ISR_RX(void)
{
    rx_irq(1);
}

void UART_2_ISR_RX(void)
{
    rx_irq(2);
}

void UART_4_ISR_RX(void)
{
    rx_irq(4);
}


uart tx write processing time on top, bottom trace driver processing time for spi1 clocks started from that byte of data.



The top trace is the rx_irq latency (time between toggles) between uart interrupts, bottom trace driver processing time for spi1 clocks started from those interrupts, top rising debug edge to spi clock edge(depends on the SPI mode).
Here I'm sending both uarts (1,2) the exact same signal from uart 4 so the time between toggles tell me how long the processor takes to handle both.



Debug signal timing.

Because the SPI write is in the interrupt context any pending device interrupt flags must wait until the transfer is done as shown above in the toggle after spi write finishes and the uart ISR returns to the vector. With multi-vector the higher priority interrupt would preempt the current interrupt, process the pending flag and return to spi processing when done. You normally wouldn't spend too much time in a typical interrupt so the effect on most real-world RTOS I/O is small outside of testing worst-case examples looking for corner-case bugs.
 
Last edited:
Another issue to examine is processor memory speed in a 32-bit controller with cache and a 200MHz ref clock/sysclock. Using L-1 cache properly is vital to system speed.
The test program uses a cpu/memory bound busy loop for a time delay.
If the system is set for NOCACHE this is the resulting delay.

Delay timing.

Driver processing timing. 44.00us

Enable cache using the non-coherent, write-back, write allocate mode.

The delay loop now runs in L-1 cache with a huge speedup. The amount of speedup changes with program execution (pipeline flow) so you can't use
busy loops for precise timing on a pic32 with cache enabled.

Driver processing timing. 34.00us with most of the time spent waiting for the 4 byte spi transfer to complete in 28.00us


Using L1 Cache on PIC32MZ Devices
**broken link removed**

C:
    /* L1 cache modes, boot code defaults to WB_WA, best performance
    * Uncached
    * Cacheable, non-coherent, write-back, write allocate
    * Cacheable, non-coherent, write-through, write allocate
    * Cacheable, non-coherent, write-through, no write allocate
    */
#define UNCACHED    0x02
#define WB_WA        0x03
#define WT_WA        0x01
#define WT_NWA        0x00

/* L1 cache control
 * CP0 Register 16, Select 0
 * bit 2-0 K0<2:0>: Kseg0 bits
 * Kseg0 coherency algorithm.
 * http://ww1.microchip.com/downloads/en/AppNotes/00001600C.pdf
 */
void set_cache_policy(uint32_t cc)
{
    uint32_t cp0;

    cp0 = _mips_mfc0(16);
    cp0 &= ~0x03;
    cp0 |= cc;
    _mips_mtc0(16, cp0);
    asm("nop"); /* re-sequence the pipeline after cp0 write */
    asm("nop");
}
RIOT-OS cache control function code for the pic32mzef.

Explains the usage of the cpu configuration registers.
https://ww1.microchip.com/downloads/en/DeviceDoc/61113E.pdf
 
Looks like I may need to expand the RIOT-OS spi API to include ASYNC transfer modes. The current API makes it very hard to use the chips 16 deep FIFO because there it's no provision for receiver callbacks, only waits for a received data after the transmit. I've modified the SPI driver to detect NULL receive buffers for sending data to things a DAC using the transmit FIFO only but it should be possible to have a general ASYNC mode driver at the core of regular send/receive transfers.


S1TS and S2TS are the spi transmit outputs without FIFO support. SPIRT show the toggles for each spi receive transaction in response to each UART byte receive interrupt duration shown by URXD.
Everything looks nice, synchronous and slow.


Here we have the same data-stream but with the SPI FIFO's enabled. (NULL as the receive buffer address to the modified spi driver)
The data rate is faster (uart retransmissions and spi sends) because we don't wait for synchronous data transfer but without a proper receive callback system this can't be used for bidirectional transfers
unless we use a receiver interrupt callback like the uart driver does.

The Linux SPI system is much too complex (I've written drivers for it) for this simple rtos but a spi_async driver (with synchronous wrappers) should be possible without too much trouble.
http://www.hep.by/gnu/kernel/device-drivers/spi.html
 
Last edited:
Got the first version of the async SPI engine running for the radio module on the board.

SPI #1 (two 18 byte blocks) and #2 (one 18 byte block) transmit with the sync transfer wrapper on the test functions. 438us
spi_transfer_bytes


The same as above but with the SPI #1 second block using async transfers. Non-blocking transfers automatically run
in parallel saving processing time in the main task using interrupts, the 16 byte deep fifo and asynchronous completion flags
in the driver. 308us
spi_transfer_bytes_async


SPI receive ISR processing time per byte interrupt from the fifo buffer, first edge to last edge, minus about 10 ns per toggle for debug pin operations.

C:
/* spi interrupt in single vector sw0 */
static void spi_rx_irq(spi_t bus)
{
uint8_t rdata __attribute__((unused));
#ifdef _PORTS_P32MZ2048EFM100_H
PDEBUG1_ON; // FIFO has data
#endif
while (!((SPIxSTAT(pic_spi[bus]) & _SPI1STAT_SPIRBE_MASK))) {
#ifdef _PORTS_P32MZ2048EFM100_H
PDEBUG1_TOGGLE; // FIFO has data
#endif
if (pic_spi[bus].in) {
*pic_spi[bus].in++ = SPIxBUF(pic_spi[bus]);
} else {
/* dump the received data with no callback */
rdata = SPIxBUF(pic_spi[bus]);
}
if (!--pic_spi[bus].len)
pic_spi[bus].complete = true;
#ifdef _PORTS_P32MZ2048EFM100_H
PDEBUG1_TOGGLE; // FIFO has data
#endif
}
/* time ref toggle */
#ifdef _PORTS_P32MZ2048EFM100_H
PDEBUG1_OFF; // FIFO has data
PDEBUG1_ON; // FIFO has data
PDEBUG1_OFF; // FIFO has data
#endif
}

https://raw.githubusercontent.com/nsaspook/RIOT/PIC32MZEF/cpu/mips_pic32_common/periph/spi.c
 
The cpu driven SPI tx/rx works pretty well up to about a 1MHz clock but at higher transfers rates even the pic32mzef starts to slow down. So the final part of the driver is the DMA engine. Here we've only enabled transmit DMA to make a cpu usage tx/rx comparison using a 10MHz sck with three 18 byte blocks on spi 1&2.


1. Cpu usage (SPIXT) during data transmit. 2. Cpu usage (SPIRT) during receive.

Data stream trace, there are times when almost all the cpu is needed to receive the SPI data stream
when both spi channels are transmitting.

traces
1. Cpu time (260ns per block) to transmit 2 18 complete byte blocks on spi channels 1&2. The DMA controller is moving the data from the output buffer uncached memory to the SPIxBUF register in background.
2. Cpu time to receive 2 bytes each of the transmitted blocks on spi channels 1&2.

With L1 cache it's required to use uncached memory in kseg1 to keep data coherent with program memory in kseg0. You can use static allocation with variables then assign a pointer or malloc coherent memory for DMA.
C:
    static __inline__ void * __pic32_alloc_coherent(size_t size)
    {
        void *retptr;
        retptr = malloc(size);
        if (retptr == NULL) {
            return NULL;
        }
        /* malloc returns a cached pointer, but convert it to an uncached pointer */
        return __PIC32_UNCACHED_PTR(retptr);
    }

    /*  Access a KSEG0 Virtual Address pointer as uncached (KSEG1) */
#define __PIC32_UNCACHED_PTR(v) __PIC32_KVA0_TO_KVA1_PTR(v)
#define __PIC32_KVA0_TO_KVA1_PTR(v) ((__typeof__(v)*)((unsigned long)(v) | 0x20000000u))

    /*
    * Translate a kernel virtual address in KSEG0 or KSEG1 to a real
    * physical address and back. 
    * using compiler KVA_TO_PA macro
    */
    //#define KVA_TO_PA(v)     ((_paddr_t)(v) & 0x1fffffff)
#define PA_TO_KVA0(pa)    ((void *) ((pa) | 0x80000000))
#define PA_TO_KVA1(pa)    ((void *) ((pa) | 0xa0000000))

Example mapping.

Testing program and driver fragments.
C:
    uint8_t tdata[20] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18};
    /* allocate buffer memory in kseg1 uncached */
    uint8_t* td = __pic32_alloc_coherent(32);
    uint8_t* rd = __pic32_alloc_coherent(32);
    uint8_t* bd = __pic32_alloc_coherent(32);
...
        /* copy test-pattern data into DMA buffer */
        memcpy(td, tdata, 18);
        /* loop data for engine testing */
        spi_transfer_bytes(SPI_DEV(1), 0, true, td, rd, 18);
        spi_transfer_bytes_async(SPI_DEV(1), 0, true, rd, td, 18);
        spi_transfer_bytes(SPI_DEV(2), 0, true, rd, bd, 18);

        /* cpu busy loop delay */
        for (dd = 0; dd < 100000; dd++) {
            last_wakeup = xtimer_now();
        }

        /* check for spi #1 async transfer complete */
        while (!spi_complete(SPI_DEV(1))) {
        };

These source and destination address then must be converted to physical addresses for the actual DMA engine.
C:
void Init_Bus_Dma_Tx1(void)
{
    uint32_t physDestDma;
    /* DMA channel 1 - SPI1 TX. */

    physDestDma = KVA_TO_PA(&SPI1BUF);

    IEC4bits.DMA1IE = 0; /* Disable the DMA interrupt. */
    IFS4bits.DMA1IF = 0; /* Clear the DMA interrupt flag. */
    DMACONbits.ON = 1; /* Enable the DMA module. */
    DCH1SSAbits.CHSSA = physDestDma; /* Source start address. */
    DCH1DSAbits.CHDSA = physDestDma; /* Destination start address. */
    DCH1SSIZbits.CHSSIZ = 1; /* Source bytes. */
    DCH1DSIZbits.CHDSIZ = 1; /* Destination bytes. */
    DCH1CSIZbits.CHCSIZ = 1; /* Bytes to transfer per event. */
    DCH1ECONbits.CHSIRQ = EIC_IRQ_SPI_1_TX; /* from board.h defines */
    DCH1ECONbits.SIRQEN = 1; /* Start cell transfer if an interrupt matching CHSIRQ occurs */
    DCH1INTbits.CHBCIE = 0; /* enable Channel block transfer complete interrupt. */
    IPC33bits.DMA1IP = 1; /* DMA interrupt priority. */
    IPC33bits.DMA1IS = 0; /* DMA subpriority. */
    IEC4bits.DMA1IE = 0; /* DMA interrupt enable.  */
}

void Trigger_Bus_DMA_Tx1(size_t len, uint32_t physSourceDma)
{
    DCH1SSAbits.CHSSA = physSourceDma;
    DCH1SSIZbits.CHSSIZ = len;
    DCH1CONbits.CHEN = 1; /* Channel enable. */
}

static inline void _spi_transfer_bytes_async(spi_t bus, spi_cs_t cs, bool cont,
    const void *out, void *in, size_t len)
{
    const uint8_t *out_buffer = (const uint8_t*) out;
    uint8_t *in_buffer = (uint8_t*) in;
    uint32_t physSourceDma;

    assert(bus != 0 && bus <= SPI_NUMOF);

#ifdef _PORTS_P32MZ2048EFM100_H
    PDEBUG3_ON;
#endif
    (void) cs;
    (void) cont;
    /* Translate a kernel (KSEG) virtual address to a physical address. */
    physSourceDma = KVA_TO_PA(out_buffer);

    /* set input buffer params */
    pic_spi[bus].in = in_buffer;
    pic_spi[bus].len = len;
    pic_spi[bus].complete = false;

    switch (bus) {
    case 1:
        Trigger_Bus_DMA_Tx1(len, physSourceDma);
        break;
    case 2:
        Trigger_Bus_DMA_Tx2(len, physSourceDma);
        break;
    default: /* non-dma mode for testing */
        while (len--) {
            if (out_buffer) {
                SPIxBUF(pic_spi[bus]) = *out_buffer++;
                /* Wait until TX FIFO is empty */
                while ((SPIxSTAT(pic_spi[bus]) & _SPI1STAT_SPITBF_MASK)) {
                }
            }
        }
    }

#ifdef _PORTS_P32MZ2048EFM100_H
    PDEBUG3_OFF;
#endif
}

Once the receive DMA is enabled we're ready to fly.
 
Last edited:
Full DMA (four channels) send and receive on spi channels 1 and 2.

Home shop test rig.



Nanoseconds of cpu time to transmit and receive the three 18 byte test streams a 10MHz.


SPI ports 1 and 2 data.

The code works and is a fair example of close to bare-metal MIPS32 programming on the RIOT-OS but it is specific to this board configuration. Some effort could be expended to generalize the setup and data using address macros and structure for others processors and board configurations but what fun is that when I could have done the exact same thing using MPLAPX, xc32 and Harmony in less than a hour using just about any PIC32 processor without a rtos.

https://github.com/nsaspook/RIOT/blob/PIC32MZEF/cpu/mips_pic32_common/periph/spi.c
 
Things seem to be working well so far. Starting a port of a pic24 BLE application that talks to a Android app to the OS port using one of the mikro BUS socket for a RN4020 BLE2 click board.
Need to finish a rewrite of the original PIC24 UART routines to be compatible with RIOT-OS for complete transfers of of board data but it pairs and displays the basic device characteristic information.



RN4020 test interface code.
https://github.com/nsaspook/RIOT/tree/PIC32MZEF/examples/rn4020_riot/test.X
 
Still coding up the OS port with general hardware timers and interrupt hardware handlers and rewriting the BLE application to use more riot-os general services. This has the expected side-effect of deoptimizing the application to a specific processor but generalizing the hardware interface for use on a generic riot-os based machine.




MCP3208 (8 12-bit channels) and ADS1220 (4 24-bit channels single/diff unipolar/bipolar inputs) ADC interface vector board in a mikro socket for
driver development with a RN4020 BLE modules on another vector board socket.
I'm mainly using the ADS1220 temp sensor and my hot finger to simulate heartbeat changes in an Android app until a DIY body sensor
is made for the 24-bit adc.


SPI signals (logic and analog traces) to the ADS1220 @2MHz clock with 47ohm spi damping resistors in series with the CPU MOSI/MISO lines on the vector board.
 
Status
Not open for further replies.
Cookies are required to use this site. You must accept them to continue using the site. Learn more…