Рубрики

picture

Embed doggy expression in picture

If you want to use any of the literal ()[] characters in the prompt, use the backslash to escape them: anime_(character) .


Using OpenCLIP for Image Search and Automatic Captioning

Published in

Towards Data Science

12 min read
Mar 7

I have been using and writing about OpenAI’s CLIP system since it came out in 2021 [1]. It consists of image and text encoding models that can be used for various forms of cross-modal comparison, like using a text query to find the best matching image in a library quickly.

In December 2022, an independent group of researchers known as LAION released a paper called “Reproducible scaling laws for contrastive language-image learning” [2] that describes how they first reimplemented and trained a model similar to CLIP and then experimented with improving the system by training with a larger dataset and using new ML techniques. They call their new model OpenCLIP.

In this article, I will provide some background info on the original CLIP, describe how LAION improved the model, and show some results from my experiments with the two systems using images from the Library of Congress’s Flickr photostream. I also implemented a cool technique called “embedding arithmetic” from Meta AI to search for photos using both images and text prompts [3].

I’ll end the article with a demo of using a variant of LAION’s OpenCLIP model that automatically generates picture captions. For example, I created the title picture for this article using Midjourney with the text prompt, “high-tech computer system for finding pictures in a large library.” When I sent the image into OpenCLIP, it generated the caption, “a girl is standing in a library looking at a television.” Not bad!





Installing Stable Diffusion

The official Stable Diffusion repository named AUTOMATIC1111 provides step by step instructions for installing on Linux, Windows, and Mac. We won’t go through those here, but we will leave some tips if you decide to install on a Mac with an M1 Pro chip. If you are not using M1 Pro, you can safely skip this section.

Installation on Mac M1 Pro

If Xcode is not fully installed. Run this to complete the install:

xcodebuild -runFirstLaunch 

While the web UI runs fine, there are still certain issues when running this fork on Apple Silicon. All samplers seems to be working now except for “DPM fast” (which returns random noise), and DDIM and PLMS (both of which fail immediately with the following error: “AssertionError: Torch not compiled with CUDA enabled”).

Install HomeBrew

First, you need to install the required dependencies using Homebrew.

xcode-select --install /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/benny.cheung/.bash_profile eval "$(/opt/homebrew/bin/brew shellenv)" brew -v 

Install Rosetta 2

The magic behind translating intel_64 bits code to arm64 automatically!

sudo softwareupdate --install-rosetta --agree-to-license 

Install Stable Diffusion Requirements

The script can be downloaded from here, or follow the instructions below.

brew install cmake protobuf rust python git wget 

The script can be downloaded from here, or follow the instructions below.

  1. Open Terminal.app
  2. Run the following commands:
$ cd ~/Documents/ $ curl https://raw.githubusercontent.com/dylancl/stable-diffusion-webui-mps/master/setup_mac.sh -o setup_mac.sh $ chmod +x setup_mac.sh $ ./setup_mac.sh 

After installation, you’ll now find run_webui_mac.sh in the stable-diffusion-webui directory. Run this script to start the web UI using ./run_webui_mac.sh . This script automatically activates the conda environment, pulls the latest changes from the repository, and starts the web UI. On exit, the conda environment is deactivated.

The post brew installation notes:

After the installation, we don’t want PostgreSQL and Redis always start when Mac rebooted, do this to remove the service

brew services stop postgresql@14 brew services stop redis 

How Stable Diffusion Works?

Stable Diffusion is a text to image generation model where you can enter a text prompt like,

half (realbenny-t1 | yoda) person, star war, art by artgerm and greg rutkowski and alphonse mucha 

Steps: 30, Sampler: LMS, CFG scale: 7, Seed: 3711471449, Size: 512×512, Model hash: 7460a6fa, Denoising strength: 0.75, Mask blur: 4

  • combination of realbenny-t1 a self-trained embedding and yoda from the default network, displayed as a person.
  • background and theme as in star war
  • combination of art styles from artgrem , greg rutkowski and alphonse mucha

It output an image (512×512) as output like this, who looks like myself and yoda hybrid creature!

Personal Embedding Mixture Example

High-level view

There are three parts to the system:

  • A language model that transforms the text prompt you enter into a representation that can be used by the diffusion model via cross-attention mechanisms. They use a “from the shelf” Bert tokenizer with a transformer for this part.
  • The Diffusion Model is a time-conditional U-Net. It takes as input your text prompt representation and a source of Gaussian noise. It removes the noise from the representation, getting closer to your text prompt. This is repeated multiple times, and these changes are known as time steps.
  • A Decoder which takes the output of the diffusion model and upsamples it to a full image. The diffusion model operates at 64×64, and the Decoder upscales this to a resolution of 512×512, with a goal of maintaining similar characteristics across the resulting image as seen in the 64×64 input.

Stable Diffusion How It Works

Figure. The language model creates an embedding of the text prompt. It’s fed into the diffusion model together with some random noise. The diffusion model denoises it towards the embedding. This is repeated several times. Then in the end the decoder scales the image up to a larger size.

See Marc Päpper’s post has a great explanation on the technical training concepts, which is not covered in this article.

From Noise to Realistic Image

All the parts of the Stable Diffusion architecture have been trained on a massive amount of images and text to create embedding that cover most of our human semantic space. When concepts are combined together in a new text prompt, the concepts get combined together into a new representation that covers the input concepts. The latent diffusion model is trained to uncover an image out of noise, but guided by the embedding from the autoencoder and text model, the image ends up being a combination of the concepts that were input in the prompt. The decoder then helps to scale up and create a high-resolution image from the image creation process.

The random images are created using “noise”. Noise is a type of data that doesn’t contain any useful information. The model uses a process called “diffusion” to generate the output images. In diffusion, data flows through several layers of noise, which are randomly generated images that contain no information. When the model is initially presented with a text prompt, it has no prior knowledge of what the target image should look like. As a result, it starts by randomly generating an image that contains no visual elements at all.

That means, when the initial image is just “noise”. The algorithm picks up some images that matched the prompt’s keywords. It is trying to fit the images into the “noise”. The algorithm starts by picking up on broad patterns in the input images. For example, it might learn that certain colors are more commonly associated with certain types of images. It then uses these broad patterns to create detailed images that have the appropriate colors. After generating a few images, the algorithm will start to pick up on more specific patterns, and it will continue to refine the images

The magic is that the Stable Diffusion architecture has learned overlapping visual concepts during its training. It is able to combine concepts like the “half man / half yoda,” and create a full image because it learned both individual pieces of that concept during training. The architecture was also able

  • to use the specific concepts, like Star Wars or Yoda, by training on a large amount of images of those concepts
  • to use a additional trained embedding from realbenny-t1 on top as a new concept

As the data flows through each layer, it picks up on more and more patterns from its surroundings, resulting in a more detailed image. As the models learns more and more patterns from its surroundings, it slowly starts to add in more details. This continues until it reaches the end of the diffusion process, at which point it is able to produce a hyper-realistic image that perfectly matches the user’s input.

When the model reaches the final layer, it can produce highly realistic images that match the text prompt. The process of creating an image from a text prompt is known as “textual inversion”. By combining all these images and concepts, it can create new images that are realistic, using the knowledge gained.

Before we get into the training process for a personal embedding model, let’s discuss the difference between an embedding and a hypernetwork.

Difference between Embedding and Hypernetwork

The Embedding layer in Stable Diffusion is responsible for encoding the inputs (for example, the text prompt and class labels) into low-dimensional vectors. These vectors help guide the diffusion model to produce images that match the user’s input. The Hypernetwork layer is a way for the system to learn and represent its own knowledge. It allows Stable Diffusion to create images based on its previous experience.

Embedding

  • Embedding: The result of textual inversion. Textual inversion tries to find a specific prompt for the model, that creates images similar to your training data. Model stays unchanged, and you can only get things that the model already is capable of. So an embedding is basically just a “keyword” which will internally be expanded to a very precise prompt.

An embedding is when the features of the objects are mapped into a vector space. For instance, in a machine learning task, a training set might consist of vectors of features representing the objects it’s trying to learn about. The embedding technique then converts each object’s features into a vector of numbers. The vector can then be used as input to another machine learning algorithm like a neural network.

Hypernetwork

  • Hypernetwork: An additional layer that will be processed, after an image has been rendered through the model. The Hypernetwork will skew all results from the model towards your training data, so actually “changing” the model with a small filesize of ~80mb per hypernetwork.

In Stable Diffusion, the hypernetwork is responsible for retaining memory of images the system has previously generated. This means that when the user provides a new input, the system can use its previous knowledge to create a more accurate image. Hypernetworks allow the system to learn faster and improve over time.

Advantages and Disadvantages

Advantage and disadvantage are basically the same: Every image containing something that describes your training data, will look like your training data. If you trained a specific cat, you will have a very hard time trying to get any other cat using the hypernetwork. It however seems to rely on keywords already known to the model.

For our use case is generating self portrait from text prompts. We found that training with personal embedding resulted in faster and better results when compared to hypernetworks. We noticed less errors from the AI when trying to generate realistic images, and our test subjects found that the resulting images were more desirable.

Colin Wynn
the authorColin Wynn

Leave a Reply