Playing with moondream2 model, it is available on HuggingFace
Loop through images that we created from a video snippet and ask questions to retrieve the language content.
Ask questions, find out what it sees.
Install Libraries
!pip install transformers timm einops
Loop though images
def answer_with_moondream(filepath: str):
model_id = "vikhyatk/moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
for filename in os.listdir(filepath):
image = Image.open(filepath + filename)
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))
print(model.answer_question(enc_image, "What is the girl wearing?", tokenizer))
print('************************************')
Question: “Describe this image.” “What is the girl wearing?”
************************************
The image features a woman walking down a runway in a black dress, with the runway being illuminated by spotlights. She appears to be the center of attention as she confidently struts her stuff. The runway is lined with several chairs, possibly for the audience to sit and watch the models showcase the clothing.
In the background, there are multiple people, some of whom are seated on chairs, while others are standing. The audience members are scattered throughout the scene, observing the models and the runway. The overall atmosphere is one of elegance and sophistication.
The girl is wearing a black dress and gloves.