[Old] Multi-modal retrieval with CLIP </examples/multi_modal/multi_modal_retrieval.ipynb>
```
### Retrieval-Augmented Image Captioning
Oftentimes understanding an image requires looking up information from a knowledge base. A flow here is retrieval-augmented image captioning - first caption the image with a multi-modal model, then refine the caption by retrieving from a text corpus.