We present an algorithm for fusing monocular and stereo cues to get robust estimates of both motion and structure. Our algorithm assumes the motion to be along a smooth trajectory and the sequence of images to be dense. The algorithm starts by calculating the instantaneous FOE (focus of expansion). Knowing the FOE we calculate a MAP estimate of the displacement at each pixel and an associated confidence measure. Using the displacement estimates we calculate a relative depth map from one of the two frame sequences. By calculating the disparities at some feature points and using information about their relative depths we compute the instantaneous component of velocity in the direction perpendicular to the image plane (the Z direction). Using this information a depth map is calculated, this depth map is then used to derive a prior probability distribution for disparity that is used in matching the two frames of the stereo pairs. We use this method to estimate the disparity at each pixel independently; no assumption about smoothness are used. Experimental results on a real image sequence are given.