 |
 |
Project P2:
Improved Speech Recognition through Vision Localization
Project Goal
Cars are becoming increasingly complex. Many vehicles have navigation
systems as well as advanced climate and entertainment systems. Designing
clean, elegant user interfaces for these systems is difficult especially
with the added constraint that the interfaces must be safe for the driver
to use in heavy traffic. Speech recognition enabled interfaces are a
natural and highly desired solution to this problem. However, extremely
accurate speech recognition is currently possible only with very clean
input. Such input is generally unachievable in automobiles due to noise
from the engine, traffic, passenger, climate control, and stereo.
Calibrated arrays of microphones can localize one sound among many. But,
these arrays must still be told where to look. In this project you will
design a system for tracking a human speaker in 3D. The output of your
project will be fed into a microphone array to improve speech recognition.
To make the project more challenging, see if you can design the system
using only one video camera. Use a trained classifier to recognize the
location of the human face in the image. Then use a hint about the known
size of the speaker's face to infer the distance of the speaker from the
camera. (Returning to the car problem, a drivers face size could be
settable and remembered much like his/her favorite seat position.)
Project Scope
The main deliverable for this project is a visualization depicting both
the input video frames and the positions of the person in as many degrees
of freedom as you can recover (6 is the theoretical maximum). If you have
a special interest in speech recognition (eg: taking CS 224S this quarter)
you may use a microphone array and perform speech recognition experiments.
Tasks
- Acquire Video of Speaking, Moving Person (use a lab camcorder)
- Find Features using the Shi-Tomasi Algorithm in OpenCV (1 week)
- Localize the Speaker in 2D using a Haar Face Detector (1-2 weeks)
- Localize the Speaker in 3D using Face-Size Cues (1 week)
- Build Basic Visualization Engine (1 week)
--MIDTERM REPORT--
Select One or More of these Advanced Topics, Depending on Interest:
- Experiment with More Advanced Features (2-3 weeks)
- Enhance Depth Tracking using Simple Structure from Motion (2-3 weeks)
- Use Machine Learning to Determine Face-Size, Eliminate Cues (2-3 weeks)
- Do Actual Speech Recognition Experiments (2-3 weeks)
Project Status
Jonathan Frank (jonathan dot frank at stanford),
Scott Cannon (canman at stanford),
Fiona Yeung (fyeung at stanford),
Justin Chan (juschan at stanford)
Point of Contact
David Stavens,
Hendrik Dahlkamp
Midterm Report
not yet submited
Final Report
not yet submitted
|
 |
|
|