From Selfies to AI Masterpieces: Fine-Tuning Flux Model with Personal Images

6 min readSep 30, 2024

Generative AI tools like DALLE and Midjourney are being widely used by people all over the world to generate a wide variety of fascinating images. While these publicly available tools are truly amazing for generating generic images, they need to be fine-tuned to generate images of a specific person. For example, if I want to generate an image using AI that contains my own picture in an arbitrary setting, I can’t do that using these publicly available tools. Earlier, fine tuning these models using custom images was very hard and expensive, and the results were also not very good. However, the recently released Flux model is a game changer in this domain! Many of the key individuals from the original Stable Diffusion team are behind Black Forest Labs and their flagship model, Flux.1.

As we will show in this blog, it is relatively easy to fine-tune this model using just about 40 personal images, and the fine-tuned model works like magic! And the cost of fine-tuning is less than USD 2.0 (~ INR 170) using Nvidia H100 GPU hardware. Imagine how much people can save on their pre-wedding shoots!

We fine-tuned the FLUX model using images of Swami Vivekananda, and here are the before and after results for the prompt “A young Indian monk with sunglasses in a cricket ground, holding a bat in his hand.”:

Stable Diffusion Vs. Flux

The FLUX Model Family

Preparing the Image Fine-Tuning Data

To start fine-tuning, you’ll need a collection of at least 40 images of size 512x512 in JPEG or PNG that represent the concept you want to teach the model. Here, by concept we mean the specific person whose face and figure you wish the model to learn. These images should be diverse enough to cover different aspects of the concept (i.e. different angles of the person). For example, if you’re fine-tuning on a specific character for example Swami Vivekananda include images in various settings, poses, and lighting.

First step is to get all of our images uniformly cropped so that the training notebook (which only processes square images) doesn’t squash rectangular images into 1:1 aspect ratios.

For this step, head over to BIRME (Bulk Image Resizing Made Easy) and drag/drop the files you’ve saved your dataset in. Once all your images upload (might take a minute, depending on the number of images), you’ll see that all but a square portion of the images you’ve uploaded are greyed out. The link we have provided should have “autodetect focal point” enabled, which will save you a ton of time manually choosing what you want included in the square, but you can also do your selections by hand, if you wish. When you are satisfied with all the images you’ve selected, click “save as Zip.”

Resizing images can also be easily done in Python, but the sliding functionality using BIRME is crucial for proper creation of the dataset. For instance, if we have an image where we want to spotlight the main subject, we need to utilize the sliding functionality effectively. Take the example which contains many subjects. With BIRME, you can easily slide to focus on the main subject.

We are choosing to save images as 512x512 squares instead of 256x256 squares, even though our model outputs will be 256x256. This helps in improving the output quality.

After resizing the images, you may need to improve the image quality depending on how good your images are. In our case, we were fine-tuning the model using publicly available images of Swami Vivekananda which were not of very good quality. We tried some of the standard image enhancement techniques but they did not work very well. So we then used an AI tool by Picsart which worked very well for us.

Writing Image Prompts and Structuring Your Dataset

You then need to write a clear prompt for each image, and maintain continuity in the phrase you are using to refer to the main protagonist in the image. So for example, for our fine-tuning using images of Swami Vivekananda, we used the phrase “A young Indian monk” in all the image prompts followed by a description of the actual scene. You can also use some generative AI tools to generate prompts for each of the images, and then manually edit them to perfection.

Trigger Word Functionality

Each prompt can include a special string [trigger]. If a trigger_word is specified, it will replace [trigger] in the captions.

trigger_word:

A string to be used in the prompt.
If None, [trigger] remains unchanged.
If no captions are provided, the trigger_word will be used instead of the prompt.
If captions are provided, the trigger_word replaces [trigger].

For example:

trigger_word: “A young Indian monk”
captions: “This black and white [trigger] image shows a portrait of a man with a contemplative expression.”

Final Output:
“This black and white A young Indian monk image shows a portrait of a man with a contemplative expression.”

This allows for dynamic customization of captions based on the specified trigger word.

Fine-Tuning the Flux model on Custom dataset

Black-forest-labs release the FLUX.1 suite of text-to-image models that define a new state-of-the-art in image detail, prompt adherence, style diversity and scene complexity for text-to-image synthesis. For fine-tuning, we chose fal.ai which has a pretty simple process.

To get the best performance from this model, make sure the image is provided as a base64 encoded data URL

You can generate these base64 encoded URLs using this link
You can also generate the base64 format in bulk just by providing the path of the image and corresponding text file data directory using this script link

One thing you need to be very careful image and corresponding text file should have same name for example Swami_Vivekananda(1).png , Swami_Vivekananda(1).txt. These images and prompts need to be then arranged in a JSON format, where each entry in the JSON contains the prompt (text description) and the image_url (Base64 encoded image).

Make sure the image_url is exactly 512x512 in size.
Make sure sync_mode is set to True. This will make sure you also get a base64 encoded data URL back from our API during inference.

You can also use 768x768 or 1024x1024 as your image dimensions, but that will increase the training and inference cost and time.

You can fine tune the model using this Python code. You just need to replace the URL of the image and prompt.

You can also do the fine-tuning without using Python code, by using the fal.ai playground training page. You have to select all the respect parameters as per your requirements.

Once training is complete you can publish your model on fal.ai GPU or HuggingFace .

Inference using the fine-tuned model

There are two options:

You can download weights or publish this model at HF where inference will be free of cost (but a bit slow).
If you want to host it privately you can utilize fal.ai [code link] where it will charge you around $ 0.03 per image for inference.

#pip install fal-client
export FAL_KEY="PASTE_YOUR_FAL_KEY_HERE"
import fal_client
 
handler = fal_client.submit(
    "fal-ai/lora",
    arguments={
        "model_name": "fal-ai/fast-lcm-diffusion",
        "num_inference_steps": 50,
        # Replace the name of your fine-tuned model in place model name.
        # To update the model name:
        # 1. Go to Fal.ai's official website: https://fal.ai/
        # 2. Log in using your authenticated Git credentials.
        # 3. Click on the "Home" icon.
        # 4. You will see your fine-tuned model listed.
        # 5. Click on "Playground" to access the model.
        # 6. Copy the model name and update it in the script.
        "prompt": "A young Indian monk playing basketball"
    },
)
 
result = handler.get()
print(result)

Dataset used for fine-tuning
Public model on HuggingFace

This work was done by my Research Fellow, Nitin Kushwaha.