You are right, and what you say is what gets implemented in a lot of applications, add one more bit to a lfsr than you can get all the numbers.
And you are right it would be harder to predict the sequence.
But it is fair to say that the approach you describe is a way around the problem, and not the circuit I proposed, which solves that problem.
The same concepts can simply apply to a nlprg, add one more bit than it is harder to predict the sequence.
On the topic of what is faster of slower, it depends on the implementation, for a software implementation you are right. In hardware both the circuit (lfsr and nlprg) work at the same speed, since 1 ck cycle is enough to compute the next state, so there is not difference in terms of latency or speed.
Furthermore, a practical implementation of the nlprg algorithm is more safe than a lfsr, since there are not checks on the seed and not error handling.
Even if you add more bits to a lfsr, you need to make sure that the lfsr will never be in the dead-state.
I hope I answered your questions.