Understanding crowd behavior using automated video analytics is a relevant research problem in recent times due to complex challenges in monitoring large gatherings. From an automated video surveillance perspective, estimation of crowd density in particular regions of the video scene is an indispensable tool in understanding crowd behavior. Crowd density estimation provides the measure of number of people in a given region at a specified time. While most of the existing computer vision methods use supervised training to arrive at density estimates, we propose an approach to estimate crowd density using motion cues and hierarchical clustering. The proposed method incorporates optical flow for motion estimation, contour analysis for crowd silhouette detection, and clustering to derive the crowd density. The proposed approach has been tested on a dataset collected at the Melbourne Cricket Ground (MCG) and two publicly available crowd datasets—Performance Evaluation of Tracking and Surveillance (PETS) 2009 and University of California, San Diego (UCSD) Pedestrian Traffic Database—with different crowd densities (medium- to high-density crowds) and in varied environmental conditions (in the presence of partial occlusions). We show that the proposed approach results in accurate estimates of crowd density. While the maximum mean error of 3.62 was received for MCG and PETS datasets, it was 2.66 for UCSD dataset.The proposed approach delivered superior performance in 50% of the cases on PETS 2009 dataset when compared with existing methods.