TY - GEN
T1 - Multi-modal clustering for multimedia collections
AU - Bekkerman, Ron
AU - Jeon, Jiwoon
PY - 2007
Y1 - 2007
N2 - Most of the online multimedia collections, such as picture galleries or video archives, are categorized in a fully manual process, which is very expensive and may soon be infeasible with the rapid growth of multimedia repositories. In this paper, we present an effective method for automating this process within the unsupervised learning framework. We exploit the truly multi-modal nature of multimedia collections - they have multiple views, or modalities, each of which contributes its own perspective to the collection's organization. For example, in picture galleries, image captions are often provided that form a separate view on the collection. Color histograms (or any other set of global features) form another view. Additional views are blobs, interest points and other sets of local features. Our model, called Comraf* (pronounced Comraf-Star), efficiently incorporates various views in multi-modal clustering, by which it allows great modeling flexibility. Comraf* is a light-weight version of the recently introduced combinatorial Markov random field (Comraf). We show how to translate an arbitrary Comraf into a series of Comraf* models, and give an empirical evidence for comparable effectiveness of the two. Comraf* demonstrates excellent results on two real-world image galleries: it obtains 2.5-3 times higher accuracy compared with a uni-modal k-means.
AB - Most of the online multimedia collections, such as picture galleries or video archives, are categorized in a fully manual process, which is very expensive and may soon be infeasible with the rapid growth of multimedia repositories. In this paper, we present an effective method for automating this process within the unsupervised learning framework. We exploit the truly multi-modal nature of multimedia collections - they have multiple views, or modalities, each of which contributes its own perspective to the collection's organization. For example, in picture galleries, image captions are often provided that form a separate view on the collection. Color histograms (or any other set of global features) form another view. Additional views are blobs, interest points and other sets of local features. Our model, called Comraf* (pronounced Comraf-Star), efficiently incorporates various views in multi-modal clustering, by which it allows great modeling flexibility. Comraf* is a light-weight version of the recently introduced combinatorial Markov random field (Comraf). We show how to translate an arbitrary Comraf into a series of Comraf* models, and give an empirical evidence for comparable effectiveness of the two. Comraf* demonstrates excellent results on two real-world image galleries: it obtains 2.5-3 times higher accuracy compared with a uni-modal k-means.
UR - http://www.scopus.com/inward/record.url?scp=35148817020&partnerID=8YFLogxK
U2 - 10.1109/CVPR.2007.383223
DO - 10.1109/CVPR.2007.383223
M3 - Conference contribution
AN - SCOPUS:35148817020
SN - 1424411807
SN - 9781424411801
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
BT - 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR'07
T2 - 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR'07
Y2 - 17 June 2007 through 22 June 2007
ER -