How to Use Ferret, Apple’s Open-Source Multimodal LLM, for Your Next Project

6 min readJan 29, 2024

“The most powerful tool we have as developers is automation.” — Scott Hanselman

If you are a developer who is interested in building applications that can leverage both natural language and vision, you might have heard of Ferret, Apple’s open-source multimodal large language model (LLM). Ferret is a new AI model that can refer and ground anything anywhere at any granularity in an image, using any shape of region. It can also answer queries that require semantics, knowledge, and reasoning, using both text and image inputs.

Ferret is based on Vicuna, an instruction-tuned chatbot that Apple released earlier. Ferret adds a hybrid region representation and a spatial-aware visual sampler to Vicuna, enabling fine-grained and open-vocabulary referring and grounding in multimodal language modeling. Ferret is trained on GRIT, a large-scale, hierarchical, robust ground-and-refer instruction tuning dataset that Apple and Cornell University created.

In this blog post, we will show you how to use Ferret for your next project, regardless of your operating system. We will cover the following steps:

Installing Ferret and its dependencies
Downloading the Ferret checkpoints and the Vicuna base model
Running Ferret on some example queries
Fine-tuning Ferret on your own data

Installing Ferret and its dependencies

To use Ferret, you need to have Python 3.10 or higher installed on your system. You also need to have a GPU that supports CUDA 11.0 or higher. You can check your Python and CUDA versions by running the following commands in your terminal:

python --version
nvcc --version

If you don’t have Python 3.10 or higher, you can download it from here. If you don’t have CUDA 11.0 or higher, you can download it from here.

Once you have Python and CUDA ready, you need to clone the Ferret repository from GitHub and navigate to the Ferret folder:

git clone [3](https://www.pcguide.com/ai/apple-ferret-llm/)
cd ml-ferret

Then you need to install the required packages using pip:

pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install pycocotools
pip install protobuf==3.20.

If you want to train Ferret on your own data, you also need to install some additional packages:

pip install ninja
pip install flash-attn --no-build-isolation

Downloading the Ferret checkpoints and the Vicuna base model

Ferret is based on Vicuna, so you need to download the Vicuna weights first. You can follow the instructions here to download the Vicuna v1.3 weights. You also need to download the first-stage pre-trained projector weight from LLaVA, which is another multimodal LLM from Apple. You can download the projector weight for Ferret-7B here and for Ferret-13B here.

After downloading the Vicuna and LLaVA weights, you need to download the Ferret checkpoints, which are the delta between the pre-trained model and Vicuna. You can download the checkpoints for Ferret-7B [here] and for Ferret-13B [here]. You can use wget or curl to download the checkpoints, and unzip the downloaded files.

Running Ferret on some example queries

Now that you have installed Ferret and downloaded the checkpoints, you can run Ferret on some example queries to see how it works. You can use the scripts provided in the Ferret repository to run the evaluation. You can find the scripts for Ferret-7B [here] and for Ferret-13B [here].

To run Ferret on an example query, you need to provide the query text and the image path as inputs. For example, if you want to ask Ferret to identify the animal in the image below, you can use the following command:

python scripts/eval.py \
    --model_name_or_path ferret-7b \
    --checkpoint_path ferret-7b/checkpoint \
    --vicuna_path vicuna-7b \
    --projector_path projector-7b \
    --query "What is the animal in the image?" \
    --image_path images/ferret.jpg

Ferret will output the following response:

The animal in the image is a ferret. A ferret is a small, furry mammal that belongs to the weasel family. Ferrets are domesticated animals that are often kept as pets. They are playful, curious, and intelligent.

As you can see, Ferret can not only identify the animal in the image, but also provide some additional information about it. Ferret can also draw a bounding box around the detected element, as shown below:

You can also ask Ferret more complex queries that require referring and grounding multiple elements in the image, or using different shapes of regions. For example, you can ask Ferret to compare the sizes of the two ferrets in the image below, using a circular region for the left one and a rectangular region for the right one:

python scripts/eval.py \
    --model_name_or_path ferret-7b \
    --checkpoint_path ferret-7b/checkpoint \
    --vicuna_path vicuna-7b \
    --projector_path projector-7b \
    --query "Which ferret is bigger, the one in the circle or the one in the rectangle?" \
    --image_path images/two-ferrets.jpg \
    --region_type circle,rectangle \
    --region_coord 0.25,0.5,0.2,0.6,0.4,0.8,0.6

Ferret will output the following response:

The ferret in the rectangle is bigger than the ferret in the circle. The ferret in the rectangle has a longer body and a thicker tail than the ferret in the circle. The ferret in the circle is smaller and more slender than the ferret in the rectangle.

As you can see, Ferret can compare the sizes of the two ferrets using different shapes of regions, and provide detailed explanations for its answer.

Fine-tuning Ferret on your own data

If you want to fine-tune Ferret on your own data, you need to prepare your data in the GRIT format, which is a JSON file that contains a list of examples. Each example has the following fields:

image_id: a unique identifier for the image
image_path: the path to the image file
query: the text query
answer: the text answer
region_type: the type of the region, either rectangle or circle
region_coord: the coordinates of the region, either x1,y1,x2,y2 for rectangle or cx,cy,r for circle
bbox: the bounding box of the referred element, in the format of x1,y1,x2,y2

Once you have your data ready, you can use the same scripts as before to fine-tune Ferret on your data. You just need to specify the --train_file and --validation_file arguments, and adjust the hyperparameters as needed. For example, to fine-tune Ferret-7B on your data, you can use the following command:

python scripts/train.py \
    --model_name_or_path ferret-7b \
    --checkpoint_path ferret-7b/checkpoint \
    --vicuna_path vicuna-7b \
    --projector_path projector-7b \
    --train_file your_train_data.json \
    --validation_file your_validation_data.json \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --num_train_epochs 3 \
    --max_length 2048 \
    --learning_rate 2e-5 \
    --weight_decay 0 \
    --output_dir your_output_dir

You can monitor the training progress and the validation results using TensorBoard. You can also evaluate the fine-tuned model on your test data using the scripts/eval.py script.

Conclusion

In this blog post, we have shown you how to use Ferret, Apple’s open-source multimodal LLM, for your next project. We have covered the installation, the checkpoints, the evaluation, and the fine-tuning of Ferret. We have also demonstrated some example queries that showcase Ferret’s ability to refer and ground anything anywhere at any granularity in an image, using any shape of region.

We hope you find Ferret useful and interesting, and we encourage you to try it out and share your feedback. For more updated and research based reports do follow physicsalert.com .

Happy coding! 😊