X Draw inherits the best practices from Dall-E 2 and Latent Diffusion while introducing some new ideas. It uses the CLIP model as the text and image encoder and employs a diffusion image prior (mapping) between the CLIP modalities' latent spaces.
This approach enhances the visual performance of the model and opens up new possibilities for blending images and manipulating images through text. To diffuse the latent spaces, we use a transformer with 20 layers, 32 heads, and a hidden size of 2048.
X AI is an innovative text-to-image diffusion model that allows users to create photorealistic images quickly and easily using only a text input. This empowering technology gives billions of people the potential to quickly create beautiful works of art without limits, offering endless creative freedom.
Text Encoder: ViT-L/14 - 480M
Image Prior: 1B Latent
Image Encoder: CLIP (ViT-L/14) - 480M Diffusion
Diffusion U-Net: 1.22B MoVQ
This model was designed to create portraits that look like actual paintings, not CG or heavily filtered photos. It can also produce stunning backgrounds and anime-style characters. Here are some tips: use LoRA networks to make anime-style images.
Set CLIP to skip 2 on all the images for version 4, and have used the ENSD setting of 31337 for the images. Additionally, all images have had the highres.fix or img2img settings set to a higher resolution.
Right now, the beta version is up and running and it won't cost you a thing: https://xai.gl/draw