Speaking of CLIP, I'm always troubled that the next CLIP might not get released as both OpenAI and Google are shifting into competition mode. Sad to think there might be a more advanced version of CLIP already but sitting in a secret vault somewhere.
Edit: I'm not referring to a CLIP-2 but any advance on the same level of importance as CLIP.
it really depends on what you're trying to achieve, if you want to build a semantic image search then a small/base model would be fine, I think that bigger models usually leak to much information that makes the embeddings space to difficult to interpreter for simple algorithm like cosine similarity, if you want to condition a generative model then a bigger model should provide more information about the prompt or the image.
Edit: I'm not referring to a CLIP-2 but any advance on the same level of importance as CLIP.