Computer vision offers many new potential for making machines attentive and responsive to man. In this paper a computer vision approach for multimodal human computer interaction has been developed. As shown in the diagram to previous paper computer vision takes into account modalities. We have taken into account more than one modality --- audio and video. In this paper we are giving an overview of the techniques used in various papers and work done by different researchers in this field over the last few years. It has been divided into three parts. First is the review on papers of Audio visual speech recognition system, second is the discussion on various techniques used in visual front end that is face and lastly we have dealt with the audio part.