Hi Gayan,
Using a single serial data pin to load 12 cascaded shift registers should be fine and the code should indeed be much smaller.
That MacMux 8x128 display was simply an exercise to see how fast I could load 16 shift registers with 128 bits of data. I used unrolled loops (larger code) and I loaded 8 shift register pairs in parallel (pseudo 8-channel SPI) in 64 cycles. That works out to a bit clock rate of 4, 8, or 16 MHz when using an 8, 16, or 32 MHz clock. Another ~280 cycles are used by "data bender" code to build the special 16-byte MacMux array from which the shift registers are loaded. All in all, there's approximately ~350 cycles 'overhead' to refresh each 128-bit row on the display (~2.73 cycles per bit).
Good luck on your project. Cheerful regards, Mike, K8LH