Imaging system are routinely deployed for underwater search, inspection and scientific surveys of manmade and natural structures, etc. Optical cameras, while providing high resolution and target details, have range limitation according to water visibility and become ineffective in turbid environments. In comparison, high-frequency (MHz) 2-D imaging sonar video systems, introduced to the commercial market in recent years, image targets at distances of 10's of meters in highly turbid waters. Visibility permitting, the integration of visual cues in 2-D optical and sonar data would enable better performance compared to deploying either imaging system alone. We address the problem of motion estimation- e.g., for vision-based navigation and target-based positioning of a mobile submersible platform- from 2-D optical and sonar images. The application of structure from motion paradigm in this multimodal imaging scenario also enables the 3-D reconstruction of scene features. We rely on the tracking of features in the sonar and optical motion sequences independently, without the need to establish multi-modal association between corresponding optical and sonar features. In addition to improving the motion estimation accuracy, advantages of the proposed method comprise overcoming certain inherent ambiguities of monocular vision, e.g., the scale-factor ambiguity, and although rare, up to three interpretations for certain scene structures and camera motion. Experiment with synthetic and real data are presented in support of our technical contribution.