Firstly, big thanks to everyone who's contributed to this list of movies. Its been immensely helpful to me!
Personally, I'm currently only interested in learning to speak Cantonese - I've invested zero time in learning to read, and thus have not learnt any characters. The problem for me is that subtitles are stored as image data rather than text so you cannot simply copy/paste the characters in order to get the romanisation/translation. Therefore, before any of these movies with Cantonese subtitles can be of any real use to me, I must first be able to romanise the subtitles. Fortunately I've found a method to deal with this, at least in the case of removable subtitles (burnt-in subtitles are probably also possible to process but I've not looked into this).
There is a piece of software available called
IdxSubOcr which uses OCR (optical character recognition) to process subtitles for a movie, outputting a .SRT format subtitle file which has all the characters stored as text which can then be manipulated with any text editing program. The accuracy of the conversion is not 100% but it is very high, and any mistakes made should be obvious when going over things. For those already familiar with the process of "ripping" subtitles back to text format, this software is similar to the software "SubRip" but has proper support for Chinese characters and is much easier to use.
IdxSubOcr uses the Microsoft Office MODI framework for OCR, thus a prerequisite is to have Microsoft Office 2003/2007 installed, along with the Traditional Chinese language pack. Office 2010 is not supported as the MODI framework has been removed in this release. Once this prerequisite has been met, use of the software is simple. First you need to have the subtitles in "VobSub format" which consists of one .IDX and one .SUB file. Then you run IdxSubOcr and select the .IDX file. You then select the language you wish to process, specify the subtitle colour and then let the process begin. After processing has completed, you will be left with a .SRT subtitle file which corresponds to the original image format subtitles. From there you can use various tools to romanise/translate the text as you please.
Please note however that the interface of IdxSubOcr uses Chinese, there is no English translation but fortunately its rather trivial to operate and not much guesswork was required on my part to get it working correctly. If anybody else tries this out and needs some help, let me know. Maybe I'll put together a proper tutorial on this some day.
IdxSubOcr homepage: [
www.comicer.com]