When you say, digital video to digital audio, what do you actually mean? Your title "video to text" confuses me... Do you mean "pictures of words are converted to a spoken (audible) word?" Or video clips of text converted to a text file?
Pictures of "Text" converted to a Text file is Optical Character Recognition (OCR), and a whole lot of software, both Free and Paid, are available.
Conversion of pictures of "Text" to Audio involves a second software - Text to Speech, sometimes called a "Reader"