Meta has introduced Chameleon, a new family of multimodal AI models designed to use various modalities natively.

Unlike the common “late fusion” approach, Chameleon uses an “early-fusion token-based mixed-modal” architecture, transforming images into discrete tokens and using a unified vocabulary for text, code, and image tokens.

This allows Chameleon to achieve state-of-the-art performance in tasks like image captioning and visual question answering (VQA), while remaining competitive in text-only tasks.

Chameleon's training involves a vast dataset of 4.4 trillion tokens and significant GPU resources. The model performs well across both text-only and multimodal tasks, surpassing other models in VQA and image captioning benchmarks.

Meta's unified token space enables Chameleon to generate interleaved image and text sequences effectively, making it a strong candidate for open-source multimodal AI applications, potentially inspiring advancements in fields such as robotics.

Meta has recently introduced two major developments in the realm of multimodal models: Chameleon and CM3leon. This model is designed to understand and generate both images and text in any sequence.





It showcases exceptional performance in various tasks such as visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.





Chameleon excels in image captioning, outperforming models like Flamingo, IDEFICS, and Llava-1.5.





It is competitive with models like Mixtral 8x7B and Gemini-Pro on text-only benchmarks in tasks like commonsense reasoning and reading comprehension.





Notably, Chameleon unlocks new possibilities in mixed-modal reasoning and generation, handling tasks where the prompt or output contains mixed sequences of images and text.

