For Kinect, get the SDK from microsoft or use the OpenKinect library. You can get sound, a colour image frame and a depth image frame (there are also some other features such as skeletonisation of the detected object).
For hand shape recognition, once you have obtained the segmented hand image (perhaps using a skin-colour seeded region-growing method), perhaps use one of the multitude of shape detection algorithms, e.g. learnt Fourier Descriptors (FD), Inner Distance (IDSC), multiscale convexity/concavity representation (MCC), Triangle-area representation (TAR), etc.
There are plenty of text-to-speech packages available; just choose one.