Spatial and Motion Saliency Prediction Method using Eye Tracker Data for Video Summarization

 

Abstract

 

Video summarization is the process to extract the most significant contents of a video and to represent it in a concise form. The existing methods for video summarization could not achieve a satisfactory result for a video with camera movement, and significant illumination changes. To solve these problems, in this paper a new framework for video summarization is proposed based on Eye Tracker data as human eyes can track moving object accurately in these cases. The smooth pursuit is the state of eye movement when a user follows a moving object in a video. This motivates us to implement a new method to distinguish smooth pursuit from other type of gaze points, such as fixation and saccade. The smooth pursuit provides only the location of moving objects in a video frame; however, it does not indicate whether the located moving objects are very attractive (i.e. salient regions) to viewers or not, as well as the amount of motion of the moving objects. The amount of salient regions and object motions are two important features to measure the viewer’s attention level for determining the key frames for video summarization. To find out the most attractive objects, a new spatial saliency prediction method is also proposed by constructing a saliency map around each smooth pursuit gaze point based on human visual field, such as fovea, parafoveal, and perifovea regions. To identify the amount of object motions, the total distances between the current and the previous gaze points of viewers during smooth pursuit is measured as a motion saliency score. The motivation is that the movement of eye gaze is related to the motion of the objects during smooth pursuit. Finally, both spatial and motion saliency maps are combined to obtain an aggregated saliency score for each frame and a set of key frames are selected based on user selected or system default skimming ratio. The proposed method is implemented on Office video dataset that contains videos with camera movements and illumination changes. Experimental results confirm the superior performance of the proposed Spatial and Motion Saliency Prediction (SMSP) method compared to the state-of-the-art methods.

 

Proposed System

 

a number of different approaches have been proposed to summarize video. In, story driven egocentric video is summarized by discovering the most influential ob-jects within a video. Object bank and object-like windows are applied to extract objects and are then utilized to detect ob-jects for story-driven egocentric video summarization in [12]. In addition to these, a Bayesian foraging strategy is applied in  for objects and their activities detection to summarize a video. In , a key-point matching based video segmenta-tion method is employed to locate the visual objects in a video. Spatio-temporal slices are applied in to select the states of the object motion for video summarization. Moreo-ver, image signature is applied for foreground object detection and then fused with motion information to summarize ego-centric video. In, the integral image technique is used for salient object detection and surveillance video sum-marization. In, Deep Event Network (DevNet) is intro-duced for high level events detection and spatial-temporal im-portant evidence localization in a user generated video.

 

Proposed System

 

The proposed method is based on a new saliency prediction method aligned to the human visual system. The main steps of the proposed method are (a) Smooth Pursuit Detection from Eye Tracker Gaze Data, (b) Spatial Saliency Prediction, (c) Motion Saliency Estimation, (d) The Saliency Score Gen-eration, and (e) Video summary generation. The block dia-gram of the proposed method is shown in Fig. 1. The human observer watches videos in the display system and at the same time the eye tracker monitors the eye movements of the ob-server and records various data such as saccades, fixation, and pupil size. Thus, the eye tracking data block should take hu-man feedback and videos as inputs. The sub-sequence pro-cesses need to be done for video summarization.

 

CONCLUSIONS

 

In this paper, an effective and robust framework is proposed by combining spatial and motion saliency score using eye tracker data for summarization. According to the findings of psychophysical experiments, human eye can easily perceive objects in case of camera movement and/or illumination changes. This motivates us to implement human gaze data obtained by Tobii eye tracker to extract informative contents of a video containing illumination change and/or camera movement for generating summarization. A the human gaze data (fixation, saccade, and smooth pursuit), human tries to track moving objects during smooth pursuit. Therefore, only smooth pursuit gaze data is applied in the proposed method. We find that only smooth pursuit gaze data provide the most informative events within a video. According to the structure of a human retina, human visual field has three dif-ferent visual regions around the gaze point, namely foveal, parafoveal, and peripheral [10]. Human eyes pay high, mid, and low level concentration in foveal, parafoveal and periph-ery regions respectively. Tem and has been widely studied. The existing works have demonstrated the utility of gaze in object segmentation, ac-tion recognition and action localization. Gaze measure-ments contain importance cues regarding the most salient objects in the scene. A system with a camera can simulate the important cues extracted by the proposed Eye Tracker based solution and facilitate automatic saliency mapping. Moreover, the knowledge of eye tracker saliency can be used to formulate compressed domain features together with motion and coefficient features for summarization. These twoareour future works.

 

REFERENCES

[1] M. Haque, M. Murshed, and M. Paul, ‘A hybrid object detection technique from dynamic background using Gaussian mixture models’, IEEE 10th Work. Multimed. Signal Process., pp. 915–920, 2008.

[2] M. Paul, W. Lin, C. T. Lau, and B.-S. Lee, ‘Direct intermode selection for H.264 video coding using phase correlation.’, IEEE Trans. image Process., vol. 20, no. 2, pp. 461–73, Feb. 2011.

[3] Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, Z. Zhou, and S. Member, ‘Multi-View Video Summarization’, IEEE Trans. Multimed., vol. 12, no. 7, pp. 717–729, 2010.

[4] T. V Nguyen, M. Xu, G. Gao, M. Kankanhalli, Q. Tian, and S. Yan, ‘Static Saliency vs . Dynamic Saliency : A Comparative Study’, ACM Int. Conf. Multimed., pp. 987–996, 2013.

[5] Y. Adini, Y. Moses, and S. Ullman, ‘Face recognition: The problem of compensating for changes in illumination direction’, IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 721–732, 1997.

[6] S. Karthikeyan, T. Ngo, M. Eckstein, and B. S. Manjunath, ‘Eye tracking assisted extraction of attentionally important objects from videos’, in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3241–3250.

[7] W. T. Peng, W. T. Chu, C. H. Chang, C. N. Chou, W. J. Huang, W. Y. Chang, and Y. P. Hung, ‘Editing by viewing: Automatic home video summarization by viewing behavior analysis’, IEEE Trans. Multimed., vol. 13, no. 3, pp. 539–550, 2011.

[8] U. Vural and Y. S. Akgul, ‘Eye-gaze based real-time surveillance video synopsis’, Pattern Recognit. Lett., vol. 30, no. 12, pp. 1151–1159, 2009.

[9] H. Shih, ‘A novel attention-based key-frame determination method’, IEEE Trans. Broadcast., vol. 59, no. 3, pp. 556–562, 2013.

[10] E. R. Schotter, B. Angele, and K. Rayner, ‘Parafoveal processing in reading’, Attention, Perception, Psychophys., vol. 74, no. 1, pp. 5–35, 2012.

[11] D. Gao, V. Mahadevan, and N. Vasconcelos, ‘On the plausibility of the discriminant center-surround hypothesis for visual saliency’, J. Vis., vol. 8, no. 7, pp. 1–18, 2008.

[12] Z. Lu and K. Grauman, ‘Story-Driven Summarization for Egocentric Video’, IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2714–2721, Jun. 2013.

[13] P. Napoletano, G. Boccignone, and F. Tisato, ‘Attentive Monitoring of Multiple Video Streams Driven by a Bayesian Foraging Strategy’, IEEE Trans. Image Process., vol. 24, no. 11, pp. 3266–3281, 2015.

[14]M. Otani, Y. Nakashima, T. Sato, and N. Yokoya, ‘Textual description-based video summarization for video blogs’, in IEEE International Conference on Multimedia and Expo, 2015, pp. 1–6.

[15] Y. Zhang, R. Tao, and Y. Wang, ‘Motion-State-Adaptive Video Summarization via Spatio-Temporal Analysis’, IEEE Trans. Circuits Syst. Video Technol., vol. pp, no. 99, pp. 1–13, 2016.