Take any eight 8 bit shift registers and connect the DAT pin on each one to a unique pin, for example RB0 through RB7. Throw eight bytes onto this 8 pin bus with a CLK pulse after each byte. The 1st byte should contain all of the bit 7 values, one bit for each shift register. The 2nd byte should contain all of the bit 6 values, one bit for each shift register. And so on through the 8th byte which contains all of the bit 0 values, one bit for each shift register.
I use a scheme similar to this to load up 64 bits for a matrix display "scan" cycle in about 4.8-usecs.