Title: Efficient visual search of videos cast as text retrieval
Author: J. Sivic, and A. Zisserman
Year: IEEE TPAMI, 2009
Goal: given a query object, find its occurence in pre-processed video database using the way similiar to text retrieval
It adopt the framework of text retrieval using TF-IDF & Removal of Stop-words
- a frame vs. a document
- visual word vs. word
- for each keyframe, detect affine covariant regions (Shape Adapted & Maximally Stable) and represented by 128-dimensional SIFT descriptor.
- Quantized to visual word by K-means. (SA: k=6000; MS: k=10000)
Use inverted file to index for fast retrieval.
the online-part :
determine vw within query region.
use vw frequencies to first retrieval the top-N keyframes
then re-ranking by consider Spatial Consistency of the region of interest

live demo
the concept is simple and easy to realize. (a good paper must like this)
But the hard is data-preprocessing...it's not an interesting part.
I love the re-ranking mechanism with spatial consideration also it will take time.
This work can also apply to many applications if we replace the feature for each keyframe
(ex: face feature)

0 Comment(s):
Post a Comment