Educational MegaGrid



Just want the grid? it's at the bottom, click here to jump.


About The Mega Grid

This page is a Mega-Grid of Stable Diffusion settings, intended to help educate about what each of these options is, and what the results of changing the options are.
If you're brand new to SD, you're on the wrong page, this page will be too overwhelming. Find a beginner's guide to SD, and install the AUTOMATIC1111 WebUI. Consider joining the r/StableDiffusion discord as well.
If you've tried the basics of SD and are looking to understand the features like Samplers, Step counts, CFG Scale, etc. this is the perfect page for you.

For example, to see how Samplers and Step Counts compare, scroll to the grid at the bottom, and select the Steps for the X-Axis and Samplers for the Y-Axis and you'll get a 2D grid showing how they compare.

You can also view the Samplers selector to view descriptions of each Sampler, and a description of the concept of a Sampler on the left side. Note that these descriptions are scrollable.

Every axis is selectable - for example, you can have X=Steps, Y=Samplers, and then click between different Prompts or different Models.

This grid page was generated using Infinite-Axis Grid Generation For SD, you can use it to make your own.

About Stable Diffusion

Stable Diffusion is a Latent Diffusion Text-To-Image Artificial Intelligence model.

• The meaning of the "text to image AI model" part is simple enough: it's a magic blackbox that you put text into, and get images out.

• The "diffusion" word indicates that it works not by having the AI just pump out an image from nothing, but instead it generates random noise and has the AI "denoise" the image: it tries to guess what the image was before it was compressed to random noise. This is a magic trick to make the AI produce much higher quality outputs than it otherwise could. Here's a video from Dr. Mike Pound at Computerphile on how this works.

• The word "latent" means that it's not working on the raw pixels for denoising, instead it's working on seemingly-meaningless data inside the AI's core that represents sections of pixels. The important thing about this is that it's a magic trick to speed up the AI by letting it get the same results while working with much less data, not to mention letting it generate optimized forms of the image data.


Appreciate the work I put into this and my other projects? You can support my work through GitHub Sponsors.


Model

Auto cycle every
Show value:           

VAE

Auto cycle every
Show value:           

Hypernetwork

Auto cycle every
Show value:       

Sampler

Auto cycle every
Show value:                 

CFG Scale

Auto cycle every
Show value:                 

Steps

Auto cycle every
Show value:               

Prompt

Auto cycle every
Show value:               

Negative Prompt

Auto cycle every
Show value:       

Seed

Auto cycle every
Show value:         


Model

The model, sometimes referred to as a 'checkpoint' (due to the historical tendency of releasing models using 'ckpt' python pickle checkpoint files), is the big primary file used by Stable Diffusion. It contains all the data the AI needs to run, other than the processing code.
StabilityAI spent millions of dollars training a brand new model from scratch with SD 1.x. Other releases, especially those from other organizations, started from Stability's model and added more training data into it.
Training takes the form of native training (creating new models or adding into them), finetuning (similar to native training, but with an emphasis on just improving the details or adding some new concepts in), and DreamBooth (adding a single concept or small number of concepts in way that runs very quickly but can damage parts of the model other than the basis).
Model files contain several gigabytes of data (about 2 GiB for fp16 fines, 4 GiB for fp32 files, and 7 GiB for the original releases that contained extra data).
FP32 and FP16 are largely equivalent, with only a small loss of precision for FP16, with the benefit of only need half as much filespace. In most cases, you want the FP16 version of a model. The extra precision of FP32 is most useful for training with.
The primary component of a model is the UNet - the part of the AI that handles the diffusion steps.
It also contains data for the text encoder, and a VAE (refer to the VAE selector for details on VAE).

Models available for download online include native-trained options (SD1.x, SD2.x, NAI, WD, ...), finetuned models, custom DreamBooth models (a variety exist that teach the AI various styles or specific concepts, some popular examples include "EldenRingDiffusion" that teaches the AI the styles of the game 'Elden Ring', "HassanBlend" which teaches the AI NSFW concepts, ...), merged models (a merged model contains data from multiple different other models, for example one might have 50% WD and 50% SD for a mix of the anime and realistic stylings).

Checkpoint files are "pickles", meaning they contain python executable code. This can theoretically be used for malicious purposes. Be careful downloading ".ckpt" or ".pt" files from unknown sources.
Modern model file distribution is recommended to be done with ".safetensors" files. These files contain only the model data, with no room for code injection.
If you want to use a model from a questionable source, but only a "ckpt" is available, ask the author to post a "safetensors" version.

Note that official SD models come as both "EMA" and "nonEMA" forms. "EMA" means "Exponential Moving Average", and is a way to compensate for an issue with AI training where the AI overrepresents the last few images it sees. In a choice between EMA and nonEMA, you mostly want the EMA version. The nonEMA version is only useful (but still not even required) if you're going to continue training the model further. This gets confused as some older models released had names like "full-ema", wherein the "full" meant "also including non-ema", and the other model didn't mention "ema" at all, but in fact was the EMA-only version, ie the preferred version.
SD1.5, The primary original Stable Diffusion model, released October 20th, 2022, as further training over top of SD1.4 (released August 25th, 2022).
StabilityAI (AI startup founded by Emad Mostaque), RunwayML (company that creates AI-assisted content-creation software), and Compvis (Computer Vision and Learning research group at Ludwig Maximilian University of Munich) worked together to create the original versions of Stable Diffusion. First privately as SD 1.1, 1.2, 1.3. After 1.3 was leaked, they went public with SD 1.4. Soon after, they developed SD 1.5. Each version of the 1.x series was based on the same system, just trained further each time. StabilityAI was unwilling to release SD 1.5 to the public due to pressure from government entities concerned with the potential danger of AI image generation. RunwayML took it into their own hands to release the model anyway, as they had equal rights to do so.
As an interesting bit of history trivia: because SD1.3 was leaked, SD1.4 was released by CompVis, and SD1.5 was released by RunwayML, StabilityAI never actually released any stable diffusion model until the later release of SD2.0.
This model is trained to work with OpenAI's CLIP to encode text. While the software for this is open source, CLIP's model was created using private/secret training data and methods, leading to some to consider it counterproductive to open source AI to rely on it. It's known to heavily weight images from modern online content creators moreso than anything else, which fueled controversy about potential artist copyright abuse.
This model is trained for 512x512 images primarily, and struggles with any resolution more than a small range away from this.
SD2.1, released by StabilityAI December 6th, 2022, as further training over top of SD2.0 (released November 23rd, 2022).
The model used for this grid page is specifically 768-v-ema.
This model is trained based on a large image set named LAION-5B, with NSFW content filtered out.
This model is trained for 768x768 images primarily, but is able to work with a much wider range of other resolutions than SD1.x could.
Waifu Diffusion 1.3, or just "WD", is trained on anime images from sites like danbooru, as a project led by respected community member "haru", intended to be explicitly free and open, for the benefit of the community rather than the author.
The training data used danbooru tags as the text prompt, and so the best usages of WD will use danbooru tags separated by commas.
The model is continued from SD 1.x, and so has the same limitations and features of SD 1.x.
NovelAI is a for-profit company that developed their own SD model and features for a web interface they charge for access to.
When their "NovelAI Anime-Final" SD model was leaked, they were upset and tried to put a stop to its spread. Soon after, however, they claimed that they intended to release their work to the public anyway, and they were only upset that somebody stole their ability to have an awesome public launch, not that the content itself was available freely to users.
There are a lot of back and forth claims and arguments from NovelAI and in response to them. It is not the job of this grid website to tell you what to think about that or who's wrong or right. However, the NovelAI model is easily available to the public and popular in many communities, therefore I decided it is worthwhile to include in this grid. (I am only including generated output examples for educational usage, and am not distributing any of their private data).
This model is varyingly referred to as "NAI", "NovelAI", "The leaked anime model", ...
This model is continued from SD 1.x, with some resolution-handling improvements, and is optimal for generations ranging between 512x512 to 768x768 but can also go a little outside of that range.
The training data used danbooru tags as the text prompt, and so the best usages of NAI will use danbooru tags separated by commas.

NovelAI based models work best with the NovelAI VAE loaded.

VAE

The "VAE", short for "Variational AutoEncoder", is the part of the SD model that converts between real images and latent-space. For SD 1.x, the VAE scales by a factor of 8 - meaning a 512x512 image gets encoded down to only a 64x64 grid of latent space values.
This means that for each single latent data point, the VAE must produce 8x8 (64 total) pixels.
In early days of SD, the importance of the VAE was underestimated - it was later discovered that a VAE could be separately trained and extended and the results are dramatically better quality of output than training the base model on its own.
As such, StabilityAI and other organizations began releasing separated VAE models, which can now be loaded in and swapped around freely in modern SD UIs.
A VAE can be thought as a very specialized type of AI image upscaler, that only works with the main AI model's latent data - and, thanks to this data, works much better than a normal image upscaler.
When 'auto' is selected, whatever VAE came with the original model file is used. This is often a lower quality VAE.
This was the first big VAE release - SD 1.5's VAE, further trained independently of the base model. Its release saw immediately improvements to output quality, including for frequent trouble spots like faces and hands. The specific file is named "sd-1-5-vae-ft-mse-840000-ema-pruned".
This VAE was released by the creators of Waifu Diffusion, based on WD1.3's VAE, improved in a similar way as SD1.5's release, but focused on anime.
This is NovelAI's official VAE. Not using it can lead to grayed out images for NovelAI based models.

Hypernetwork

Hypernetworks are the answer to the question "what if we take the AI image generator, and shove another AI onto it".
Hypernetworks sit in the middle point of AI custom training, capable of more than Textual Inversion is, but not as powerful as DreamBooth.
A hypernetwork is a (relatively) small file that can have a large impact on the final output of SD, by influencing how the AI generates its results.
The original authorship of the Hypernetwork concept is disputed, with NovelAI claiming to have invented it, while others claim the concept predates NovelAI's work.
No hypernetwork loaded.
An example hypernetwork to showcase the effects a hypernetwork can have, this is the leaked "anime3" model from NovelAI.

Sampler

The sampler is the algorithm used to process each step of the AI diffusion model.
The details are a deeply technical topic, and the descriptions provided here are just loose summaries provided by a non-expert.
There's a long list of samplers, each with different details. For the purposes of this generated grid, only a few representative samplers from each category of related samplers are provided.
A key term to know here is "convergence": a sampler "converges" when it reaches the number of steps where increasing the step count wouldn't change the output noticeably.
The categories can be roughly separated into:
- the main converging set (DDIM, PLMS, Euler, DPM2, ...) - these all get the same results as eachother using different methods.
- The Karras converging set (DPM++ 2M Karras, LMS Karras, ...) - these all also get the same results as eachother, but slightly different from the main set. 'Karras' refers to an alternative 'noise scheduler' (the thing that builds the random start to each image based on seed) that enables convergence in fewer steps.
- Ancestral samplers - these produce wildly variant outputs based on the addition of extra random noise between steps. They don't necessarily converge at all - adding more steps adds more random noise, and so produces different outcomes.
Some samplers measure step counts different - for example, DPM2 actually does 2 steps for every 'step', and 'DPM Adaptive' ignores your step count input and automatically determines how many steps are needed for convergence.
DDIM, "Denoising Diffusion Implicit Models", is an old sampler that predates Stable Diffusion (published by Stanford researchers in 2020, modified from DDPM, published in 2020 by a Berkeley research group), and was included in the original CompVis version of SD. DDIM converges to a good image within 50 fast iterations.
PLMS predates Stable Diffusion (developed in 2022 based on a paper named "Pseudo Numerical Methods for Diffusion Models on Manifolds" from September 2021). It runs at the same speed as DDIM. It starts off worse than DDIM at low step counts, but then converges faster.
Euler, or "k_euler", was developed by Katherine Crowson as an implementation of one of several algorithms described by the paper "Elucidating the Design Space of Diffusion-Based Generative Models" by Karras et al. in 2022. It converges much faster than DDIM while running at the same speed. It is hypothetically one of the simplest algorithms, but in practice turned out to work very well.
Euler a, for "Euler Ancestral", aka "k_euler_ancestral", is similar to the "Euler" sampler, but uses "ancestral" sampling - meaning essentially it adds extra random noise as it goes, which in practices leads to a much larger variety of output images, with the downside that it does not converge with other samplers. Many samplers have ancestral variants available.
DPM2 or "k_dpm_2". DPM stands for "Diffusion probabilistic models", and DPM2 comes from a paper titled "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps". As the paper title says, it does - it converges in about 10 steps. The downside of this sampler is that every 'step' is actually two steps - ie it takes as much time to run 10 DPM2 steps as it does 20 Euler steps, making the improved convergence rate fairly redundant.
DPM++ 2M is related to DPM2, and comes from a paper named "DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models" from the same authors as the DPM2 paper. In practice, it converges almost as fast as DPM2, but without needing double-steps - so it actually achieves the speed boost originally promised by DPM2.
DPM++ 2M Karras is a variant of DPM++ 2M, merged with a noise scheduler from Karras. Samplers with the Karras noise scheduler converge to slightly a different final output than non-Karras variants. This particular sampler does so extremely quickly - it's basically done within 10 steps. Many samplers have Karras variants available.

CFG Scale

CFG Scale, short for "Classifier Free Guidance" scale, is a multiplier on how much your input prompt text affects the image each step.
To understand why this system is used, look no further than the output of the '1' option - without multiplying, the image barely resembles the text at all.
For whatever reason, the AI just doesn't prioritize the text enough - rather than finding a native solution to this, the developers of Stable Diffusion chose instead to just multiply the effects of text.
At an internal level, this works by running every diffusion step twice - once with text, once without. Then, the 'without' is subtracted from the 'with', to get a value that represents just the effects of the text input - then, this value can be multiplied by CFG Scale, and added back ontop of the 'without'.
Or, in (simplified) mathematical form: finalGen = emptyGen + ((textGen - emptyGen) * cfgScale), where each variable ranges from 0 to 1.
While low scaling values ignore text, high values overbake text - view the '20' scale example to see how that goes. An overbaked image tends to look very highly saturated, with very sharp lines between black and white.
An image that's slightly overscaled is essentially made to approach the Text Encoder's quintessential image for that text, with no room left for the image-generator's creativity on details.
An image that's extremely overbaked stops having meaning of its own, and is instead a simple mathematical error.
Consider for example a single pixel, where it has an emptyGen of 0.4 and textGen of 0.45:
- for cfgScale of 1, the equation is "f = 0.4 + ((0.45 - 0.4) * 1) = 0.4 + (0.05 * 1) = 0.45".
- For cfgScale 7, "f = 0.4 + (0.05 * 7) = 0.75".
- For cfgScale of 20, "f = 0.4 + (0.05 * 20) = 0.4 + (1.0) = 1.4" ... this is a problem because "1.4" is higher than the maximum "1", therefore it gets clipped off. When a difference of "0.05" or "0.03" or "0.07" all end up with a pixel value of "1" due to clipping at the top, it's inevitable that the image simply becomes a flat empty image. All the detail is in the value over "1", and so all the detail is removed.
Note also how changing the step count affects the results of the CFG Scale.
1 is here to demonstrate what happens when you don't apply CFG Scale.
3 is a pretty low scale value, that encourages AI creativity rather than prompt following.
If your prompt is too strongly affecting your image and you want more creativity, dropping it down a little might help.
7 is a good default value for CFG Scale. It solidly encourages your text as the image guidance, but leaves room for AI creativity.
If your prompt is getting ignored, bumping it up to 9 might help.
If your prompt is getting ignored, bumping it up to 11 might help. This is on the edge of potential overbaking range.
This example of 20 is just here to demonstrate overbaking when CFG Scale is too high. In certain cases it might still work, but those cases are rare.

Steps

The step count is, in short, how many times to run the Diffusion model on the image before producing an output.
More steps means it runs more times, and also asks for less denoising between each step.
Broadly speaking, more steps means better quality output, up to a point.
More steps also naturally means longer to run.
Many users like to run at lower step counts (for speed) until they get what they want, then re-run with a very high step count (for quality) to create their final output image.
Step counts strongly relate to samplers - different samplers treat step counts differently. 'DPM Adaptive' for example ignores the step count value entirely, to calculate it on its own. Other samplers might secretly run two or more steps per 'step' for various forms of internal benefit. Refer to the sampler selection for specifics.
50 steps is the original default for the early versions of Stable Diffusion, intended to work with with the original sample, DDIM. For other modern samplers like Euler, 50 is enough to create a very high quality output.
20 steps is the default in WebUI's, because it is a good balance between speed and quality. On an RTX 30xx card, 20 steps with 'Euler' sampler can run in only 2-3 seconds, but still gets very close quality to 50 steps (which would take more than twice as long).
15 steps is getting a bit low for most samplers. It should still clearly show what the image is going to turn out to be, more or less.
10 steps is very low for most samplers. It will show the broad strokes of the final output, but will often look blurry.
5 steps is extremely low for most samplers. Depending on sampler, it might end up only creating amorphous blobs. For some samplers, it will suffice to get a lowres blurry image.
2 steps is included on this grid mostly just to show what the AI is limited to when it can't run many steps, to demonstrate why the repeated-step system is used. This is too low for any current sampler to create anything usable from.

Prompt

Prompts are the most important setting in Stable Diffusion: it's your description of what you want the AI to generate.
Anything from a cat to a landscape to, well, whatever you want, prompts are the wide open free choice you have available to express your creativity through.
Prompt text gets fed into a Text Encoder model (in SD 1.x, this is OpenAI CLIP, in SD 2.x, this is a custom LAION-trained Text Encoder), which produces a latent representation of the text in an image-ready format.
The latent text representation then gets fed into the generation parameters of the image, and the different this creates gets multiplied by the CFG Scale.
Generally, prompt-writing is an artform all its own. There isn't really a proper science to it, you just gotta experiment, see what works and what you like. As with any art, there are at least some guidelines to get you started doing well.
View the example prompts to see different prompt styles and to learn how these different styles affect things.
A very simple prompt, just:
a photo of a cat
Short and simple prompts can tend towards more variety between seeds.
This prompt will tend to produce relatively realistic pictures of cats in base SD.
a beautiful green grassy landscape, a castle in the distance atop a mountain, blue sky with light clouds, peaceful river, masterpiece, canvas painting, watercolor, highly detailed, trending on artstation, cinematic, sharp focus.
This is a longer prompt, trying to get specific details.
This type of longer prompt will keep more consistency between seeds.
This prompt was crafted to create a very particular artistic style.
a picture of a beautiful and majestic woman posing for a professional photoshoot, 4k, medium shot
This prompt attempts to create a beautiful woman, in the standard SD prompting format - roughly a sentence - just as might appear on the caption of a real image in the wild, with some tag-words on the end - just as might appear on real pictures out in the wild. Most images you find online have a title and some keyword tags, and so standard SD is trained to work best with this style.
Prompting for these models work best by simply trying to write a description of the image you're expecting, and then tacking tags onto the end to tweak as-needed. Look up public examples of other AI users' prompts to try to learn phrasings and keywords that work best. Experiment!
1girl, beautiful, long hair, smile, sweater
Unlike base SD, several other trained models (such as Waifu Diffusion and NovelAI) are trained on anime booru tags. These are the tags found on sites such as danbooru. These are simply lists of relevant tag names, separated by commas.
Prompting for these models works best by looking through images posted on these booru sites to learn tag names and how they look in the original training images.
Because these models still use SD as the base, sentence-style phrasings and words that aren't officially tags on these sites still work, they're just less optimal.
superman flying through the sky with his hand outstretched
This is an example of a rather unfortunate prompt.
It asks for a specific person, with a specific but unusual pose, and a visible hand. This is a really bad case for stable diffusion, as each of these three parts can be difficult for the base model to do well.
This prompt is useful to let you play with options such as the negative prompt or VAE, to see which options help 'cure' the badness of this prompt.
light dust, magnificent, theme park, medium shot, details, sharp focus, elegant, highly detailed, illustration, by jordan grimmer and greg rutkowski and ocellus and alphonse mucha and wlop, intricate, beautiful, triadic contrast colors, trending artstation, pixiv, digital art
This prompt was suggested on reddit as an approximation of midjourney's styling. It's not quite there, but it creates some beautiful output. In this case it is being used with no further prompting than the suggested style.

Negative Prompt

Negative prompts are just like regular prompts, but backwards - it tells the AI to *not* generate the specified content.
This can be useful in a variety of contexts - for example, if you're seeing images with 'Shutterstock' style watermarks or similar, often just adding a negative of 'watermark' is sufficient to put a stop to that.
This is just nsfw, explicit to be safe and nothing else.
nsfw, explicit, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry
This is reported to be the official standard negative prompt from NovelAI (With nsfw, explicit added to be safe).
The theory of prompts like this is that by specifying a standard list of unusual attributes that almost nobody wants, output images are more likely to look nice. Nobody wants fake watermarks or signatures to be generated, so just toss 'em in. Some users dispute the benefit of particularly unusual negatives like 'extra digit', arguing that such a prompt won't really mean much to the AI.

Seed

A 'seed' is the primary input value to the random-noise generator used for SD.
A seed of '-1' indicates a random seed will be used. Any other value is a manual seed.
If all other parameters are the same, changing the seed can still change a lot - it will change more on simpler prompts than it will on complex ones.
For many simple prompts, you might notice more similarities between seeds than you might expect, particular at the level of basic structure and image composition - for example, the line separating sky from ground in a landscape image might be the exact same line that separates a person's neck from their shirt.
When playing with prompts, it is important to try many different seeds - at the easiest, just have a large batch count (4 or more) or use a random seed. This helps separate whether your prompt is missing something, or if a seed just has 'bad luck' of not seeding the details you need.
The name 'seed' is a metaphor - the same way a large and complicated plant grows from a small physical seed, a large complex digital-random-noise image is grown from a simple integer number seed.
Seed manually specified as '1'.
Seed manually specified as '2'.
Seed manually specified as '3'.

X Axis: 

Y Axis: 

X Super-Axis: 

Y Super-Axis: