This paper addresses the problem of joint modeling of multimedia components in different media forms. We consider the information retrieval task across both text and image documents, which includes retrieving relevant images that closely match the description in a text query and retrieving text documents that best explain the content of an image query. A greedy dictionary construction approach is introduced for learning an isomorphic feature space, to which cross-modality data can be adapted while data smoothness is guaranteed. The proposed objective function consists of two reconstruction error terms for both modalities and a Maximum Mean Discrepancy (MMD) term that measures the cross-modality discrepancy. Optimization of the reconstruction terms and the MMD term yields a compact and modality-adaptive dictionary pair. We formulate the joint combinatorial optimization problem by maximizing variance reduction over a candidate signal set while constraining the dictionary size and coefficients' sparsity. By exploiting the submodularity and the monotonicity property of the proposed objective function, the optimization problem can be solved by a highly efficient greedy algorithm, and is guaranteed to be at least a (e - 1)=/e≈0.632- approximation to the optimum. The proposed method achieves state-of-the-art performance on the Wikipedia dataset.
|Title of host publication||Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management|
|Publisher||Association for Computing Machinery (ACM)|
|Number of pages||10|
|Publication status||Published - 3 Nov 2014|