<aside>
<aside>
AI systems that can process and generate multiple human languages — useful in translation, localization, multilingual chatbots, global search engines, etc.
</aside>
<aside>
<aside>
Shared Embedding Spaces
Words from different languages are mapped into a common vector space (e.g., multilingual BERT, XLM-R).
Language Tags in Prompts
Used in multilingual LLMs to set context.
Example: Translate to Spanish: [input]
Transfer Learning Across Languages
Models fine-tuned on high-resource languages generalize to low-resource ones (via shared syntax/semantics).
Mixture of Experts (MoE)
Activates specific "experts" per language to scale efficiently.
Tokenizer Innovations
Unicode-based or sentence piece tokenizers allow support for multiple scripts and writing systems.
</aside>
<aside>
</aside>
<aside>
<aside>
AI systems that understand and process more than one data modality — for example, combining text + image + audio + video + code.
</aside>
<aside>
<aside>
1. Cross-Attention Fusion
Aligns and fuses information across modalities (e.g., CLIP aligns image and text embeddings).
2. Vision-Language Models (VLMs)
Models trained on image-text pairs.
Examples:
3. Multimodal Transformers
Unified architecture that handles various inputs with modality-specific embeddings.
4. Contrastive Learning
Used in CLIP, Flamingo, and similar models to align image and text representations.
5. Multimodal Prompting
Combine text + image or other data into prompts.
Example:
"Here’s a photo of a traffic light. What does the sign say?" → with image and text input.
</aside>
</aside>
<aside>
| Model | Modalities | Capabilities |
|---|---|---|
| GPT-4V | Text + Image | Visual reasoning, image captioning |
| Gemini 1.5 | Text + Image + Audio + Video | Multimodal synthesis + long context |
| CLIP | Text ↔ Image | Embedding alignment, similarity search |
| DALL·E | Text → Image | Image generation |
| Whisper | Audio → Text | Speech recognition |
| Flamingo | Video + Text | Video QA, multimodal chat |
| Kosmos-2 | Text + Image + OCR | Grounded understanding |
| </aside> |
<aside>
These systems:
Example Use Case:
Upload a Japanese poster → AI summarizes it in English, identifies images, reads text, suggests cultural context.
Examples:
<aside>
<aside>
</aside>