LongCat-Image / CLAUDE.md
tchung1970's picture
Redesign UI to match Z-Image-Turbo 2K dark theme
1182d82

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

LongCat-Image is a text-to-image generation model built on diffusion transformers, deployed as a Hugging Face Space with a Gradio interface. The model is based on the Flux architecture and supports both text-to-image generation and image editing.

Running the Application

# Install dependencies
pip install -r requirements.txt

# Run the Gradio app locally
python app.py

The app launches with MCP server enabled on the default Gradio port.

Architecture

Core Components

Transformer Model (longcat_image/models/longcat_image_dit.py):

  • LongCatImageTransformer2DModel: DiT-based transformer using Flux architecture
  • Uses FluxTransformerBlock (19 layers) and FluxSingleTransformerBlock (38 layers)
  • Supports gradient checkpointing for memory efficiency
  • Position embeddings via FluxPosEmbed with RoPE

Pipelines (longcat_image/pipelines/):

  • LongCatImagePipeline: Text-to-image generation with optional prompt rewriting
  • LongCatImageEditPipeline: Image editing with vision-language conditioning
  • Both pipelines inherit from DiffusionPipeline and support LoRA, CFG renorm, and VAE tiling/slicing

Text Encoding:

  • Uses Qwen-based text encoder with chat template formatting
  • Prompt template wraps user input between <|im_start|> and <|im_end|> tokens
  • Maximum token length: 512

Key Configuration

  • VAE scale factor: 8 (with 2x2 patch packing, effective 16x)
  • Default sample size: 128 (1024px at 8x scale)
  • Latent channels: 16
  • Image dimensions must be divisible by 32

Prompt Rewriting

The pipeline includes built-in prompt engineering via rewire_prompt() that uses the text encoder to expand simple prompts into detailed descriptions. This can be disabled with enable_prompt_rewrite=False.

External prompt polishing is also available via utils/prompt_utils.py using Hugging Face Inference API (requires HF_TOKEN).

Model Loading

from longcat_image.models import LongCatImageTransformer2DModel
from longcat_image.pipelines import LongCatImagePipeline

MODEL_REPO = "meituan-longcat/LongCat-Image"

transformer = LongCatImageTransformer2DModel.from_pretrained(
    MODEL_REPO, subfolder='transformer', torch_dtype=torch.bfloat16
)
pipe = LongCatImagePipeline.from_pretrained(MODEL_REPO, transformer=transformer)

Environment Variables

  • HF_TOKEN: Required for prompt polishing via external API