Project P2:
Improved Speech Recognition through Vision Localization

Project Goal

Cars are becoming increasingly complex. Many vehicles have navigation systems as well as advanced climate and entertainment systems. Designing clean, elegant user interfaces for these systems is difficult especially with the added constraint that the interfaces must be safe for the driver to use in heavy traffic. Speech recognition enabled interfaces are a natural and highly desired solution to this problem. However, extremely accurate speech recognition is currently possible only with very clean input. Such input is generally unachievable in automobiles due to noise from the engine, traffic, passenger, climate control, and stereo. Calibrated arrays of microphones can localize one sound among many. But, these arrays must still be told where to look. In this project you will design a system for tracking a human speaker in 3D. The output of your project will be fed into a microphone array to improve speech recognition. To make the project more challenging, see if you can design the system using only one video camera. Use a trained classifier to recognize the location of the human face in the image. Then use a hint about the known size of the speaker's face to infer the distance of the speaker from the camera. (Returning to the car problem, a drivers face size could be settable and remembered much like his/her favorite seat position.)

Project Scope

The main deliverable for this project is a visualization depicting both the input video frames and the positions of the person in as many degrees of freedom as you can recover (6 is the theoretical maximum). If you have a special interest in speech recognition (eg: taking CS 224S this quarter) you may use a microphone array and perform speech recognition experiments.

Tasks

  • Acquire Video of Speaking, Moving Person (use a lab camcorder)
  • Find Features using the Shi-Tomasi Algorithm in OpenCV (1 week)
  • Localize the Speaker in 2D using a Haar Face Detector (1-2 weeks)
  • Localize the Speaker in 3D using Face-Size Cues (1 week)
  • Build Basic Visualization Engine (1 week)
--MIDTERM REPORT--

Select One or More of these Advanced Topics, Depending on Interest:
  • Experiment with More Advanced Features (2-3 weeks)
  • Enhance Depth Tracking using Simple Structure from Motion (2-3 weeks)
  • Use Machine Learning to Determine Face-Size, Eliminate Cues (2-3 weeks)
  • Do Actual Speech Recognition Experiments (2-3 weeks)

Project Status

Jonathan Frank (jonathan dot frank at stanford),
Scott Cannon (canman at stanford),
Fiona Yeung (fyeung at stanford),
Justin Chan (juschan at stanford)

Point of Contact

David Stavens, Hendrik Dahlkamp

Midterm Report

not yet submited

Final Report

not yet submitted






















































































Course overview
Announcements
Time and location
Course materials
Schedule
Instructors
Assignments
Projects
Policies
Links