come on, I don't even use assembly and I'm the first one to answer his question for real?
first off, williB, his loopb was in the right spot. if he moved it down a line like you said, it would never check TMR0, it would just continuously decrement the W register until it equaled zero, which would occur in about 16 loop cycles, instead of 16 TMR0 increments, and would be far less than 0.5 seconds.
you have the right idea. the single purpose of the loop is to wait until TMR0 reaches a value of 16. (in this case, by subtracting 16 and seeing when the result equals zero)
with the clock frequency/prescaler you stated, it will be 1/32 second per increment of TMR0 (as you also stated) however, just to be clear, the prescaler applies to incrementing of TMR0, not to the execution of your program, so everything else is running at normal clock speed; thus, this loop will occur many more than the 16 times that TMR0 will increment. it just keeps checking and checking and checking until TMR0 reaches 16.
and since it waits for TMR0 to increment 16 times,
therefore, 16*1/32 = 0.5 seconds. so it starts the timer, loops for 0.5 seconds, and then returns to what it was doing before; ie - a 0.5 second delay routine. so it does not wait 16 CLOCK cycles, it waits for 16 increments of TMR0. (which, with the prescaler, is actually 16*256 = 4096 clock cycles, plus a couple for code overhead)