infer_florence_2_caption

infer_florence_2_caption

About

1.0.0
MIT

Image captioning with Florence-2

Task: OTHER
Florence
Microsoft
Captioning
Unified
Pytorch

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. In this algorithm you can use Florence-2 for image captioning.

Man & dog cozy living room

Output: 'The image shows a young man sitting at a wooden table in a room with a large window in the background. He is wearing a white long-sleeved shirt and has a beard and dreadlocks. On the table, there is a laptop, a cup of coffee, and a small plant. A dog is lying on the floor next to the table. The room is decorated with potted plants and there is an air conditioning unit on the wall. The overall atmosphere of the room is cozy and relaxed.'

🚀 Use with Ikomia API

1. Install Ikomia API

We strongly recommend using a virtual environment. If you're not sure where to start, we offer a tutorial here.

pip install ikomia

2. Create your workflow

from ikomia.dataprocess.workflow import Workflow

# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name="infer_florence_2_caption", auto_connect=True)

# Run on your image  
wf.run_on(url="https://images.pexels.com/photos/5749076/pexels-photo-5749076.jpeg?cs=srgb&dl=pexels-zen-chung-5749076.jpg&fm=jpg&w=640&h=960")

# Save output .json
caption_output = algo.get_output(0)
caption_output.save('caption_output.json')

☀️ Use with Ikomia Studio

Ikomia Studio offers a friendly UI with the same features as the API.

  • If you haven't started using Ikomia Studio yet, download and install it from this page.
  • For additional guidance on getting started with Ikomia Studio, check out this blog post.

📝 Set algorithm parameters

  • model_name (str) - default 'microsoft/Florence-2-base': Name of the Florence-2 pre-trained model. Other models available:
    • microsoft/Florence-2-large
    • microsoft/Florence-2-base-ft
    • microsoft/Florence-2-large-ft
  • task_prompt (str) - default 'MORE_DETAILED_CAPTION': Level of detail of the captioning. Other levels available:
    • CAPTION
    • DETAILED_CAPTION
  • num_beams (int) - default '3': By specifying a number of beams higher than 1, you are effectively switching from greedy search to beam search. This strategy evaluates several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with a lower probability initial tokens and would’ve been ignored by the greedy search.
  • do_sample (bool) - default 'False': If set to True, this parameter enables decoding strategies such as multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.
  • early_stopping (bool) - default 'False': Controls the stopping condition for beam-based methods, like beam-search. It accepts the following values: True, where the generation stops as soon as there are num_beams complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).
  • cuda (bool): If True, CUDA-based inference (GPU). If False, run on CPU. Optionally, you can load a custom model:

Parameters should be in strings format when added to the dictionary.

from ikomia.dataprocess.workflow import Workflow

# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name="infer_florence_2_caption", auto_connect=True)

algo.set_parameters({
    "model_name":"microsoft/Florence-2-large",
    "task_prompt":"MORE_DETAILED_CAPTION",
    "max_new_tokens":"1024",
    "num_beams":"3",
    "do_sample":"False",
    "early_stopping":"False",
    "cuda":"True"
})

# Run on your image  
wf.run_on(url="https://images.pexels.com/photos/5749076/pexels-photo-5749076.jpeg?cs=srgb&dl=pexels-zen-chung-5749076.jpg&fm=jpg&w=640&h=960")

# Save output .json
caption_output = algo.get_output(0)
caption_output.save('caption_output.json')

🔍 Explore algorithm outputs

Every algorithm produces specific outputs, yet they can be explored them the same way using the Ikomia API. For a more in-depth understanding of managing algorithm outputs, please refer to the documentation.

import ikomia
from ikomia.dataprocess.workflow import Workflow

# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name="infer_florence_2_caption", auto_connect=True)

# Run on your image  
wf.run_on(url="https://images.pexels.com/photos/5749076/pexels-photo-5749076.jpeg?cs=srgb&dl=pexels-zen-chung-5749076.jpg&fm=jpg&w=640&h=960")

# Iterate over outputs
for output in algo.get_outputs():
    # Print information
    print(output)
    # Export it to JSON
    output.to_json()

Developer

  • Ikomia
    Ikomia

License

A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code.

PermissionsConditionsLimitations

Commercial use

License and copyright notice

Liability

Modification

Warranty

Distribution

Private use

This is not legal advice: this description is for informational purposes only and does not constitute the license itself. Provided by choosealicense.com.