<aside>

Multilingual AI Systems

<aside>

Definition:

AI systems that can process and generate multiple human languages — useful in translation, localization, multilingual chatbots, global search engines, etc.

</aside>

<aside>

Capabilities:

<aside>

Techniques:

</aside>

<aside>

Examples of Multilingual Models:


</aside>

<aside>

Multimodal AI Systems

<aside>

Definition:

AI systems that understand and process more than one data modality — for example, combining text + image + audio + video + code.

</aside>

<aside>

Capabilities:

<aside>

Techniques:

1. Cross-Attention Fusion

Aligns and fuses information across modalities (e.g., CLIP aligns image and text embeddings).

2. Vision-Language Models (VLMs)

Models trained on image-text pairs.

Examples:

3. Multimodal Transformers

Unified architecture that handles various inputs with modality-specific embeddings.

4. Contrastive Learning

Used in CLIP, Flamingo, and similar models to align image and text representations.

5. Multimodal Prompting

Combine text + image or other data into prompts.

Example:

"Here’s a photo of a traffic light. What does the sign say?" → with image and text input.

</aside>

</aside>

<aside>

Examples of Multimodal Models:

Model Modalities Capabilities
GPT-4V Text + Image Visual reasoning, image captioning
Gemini 1.5 Text + Image + Audio + Video Multimodal synthesis + long context
CLIP Text ↔ Image Embedding alignment, similarity search
DALL·E Text → Image Image generation
Whisper Audio → Text Speech recognition
Flamingo Video + Text Video QA, multimodal chat
Kosmos-2 Text + Image + OCR Grounded understanding
</aside>

<aside>

Multilingual + Multimodal Systems

These systems:

Example Use Case:

Upload a Japanese poster → AI summarizes it in English, identifies images, reads text, suggests cultural context.

Examples:

<aside>

Challenges

<aside>

</aside>