GPU: Nvidia GeForce RTX 3060 with 12GM of VRAM<\/li>\n<\/ul>\n\n\nTxt2Img<\/h1>\n\n\n Stable Diffusion can generate images from 2 primary inputs: a standalone text description of what you want the image to be, or an image with a complementary text providing more information about the desired output. These two types of generation are referred to, respectively, as \u201ctxt2img\u201d and \u201cimg2img\u201d (pretty straightforward so far).<\/p>\n\n\n\n
In most interfaces you\u2019ll encounter, the primary panel will be txt2img. When Auto\u2019s UI is loaded, this is the first tab to the left at the top. So let\u2019s start by looking at the parameters that are tweak-able when converting text to an image.<\/p>\n\n\n
Prompt<\/h1>\n\n\n When you initialize Stable Diffusion (on your computer or using an online provider), the first feature you will likely notice is a text box titled \u201cPrompt\u201d. As obvious as it sounds, this box is where you will describe the picture you want to see so the AI knows what to make. This is perhaps the single-most important parameter in Stable Diffusion. The difference between a good image and a heap of meaningless pixels often comes down to the text prompt you write.<\/p>\n\n\n\n <\/figure>\n\n\n\nWhile prompt writing is a whole topic unto itself, I\u2019ll give you the basics of how to use the Prompt box. To create an image from text, you need to generally tell the AI three things:<\/p>\n\n\n\n
\nWhat content<\/strong> should be in the image (what subject matter, background, etc. you want to see?)<\/li>\n\n\n\nWhat medium<\/strong> the image should look like (do you want a painting? Or a photograph?)<\/li>\n\n\n\nWhat style<\/strong> the image should emulate (do you want it to look like a specific artist?)<\/li>\n<\/ol>\n\n\n\nYou can be as generalized or as specific in your text description as you want. A big part of playing with AI art is learning which prompt structures work for you and which generate passable results. Let\u2019s look at an example prompt; for this example, we\u2019ll leave all of the other parameters at their default for now.<\/p>\n\n\n
Our practice prompt is:<\/h3>\n\n\nA painting of a Kyoto cityscape by Satoshi Kon<\/em><\/pre>\n\n\n\nWhen we hit generate, the AI will use what it knows about this text description\u2019s elements to create an image. Let\u2019s break down our prompt and look at each phrases impact on the end result.<\/p>\n\n\n\n
A painting of a Kyoto cityscape by Satoshi Kon<\/strong><\/em><\/pre>\n\n\n\n <\/figure>\n\n\n\n\u201cA painting\u201d is our medium modifier;<\/strong> we are telling Stable Diffusion to draw on what it knows about paintings when making the piece of art. This is a very general modifer, as paintings come in many different styles (impressionist, digital, abstract, etc) and can be done in a variety of mediums (watercolor, acrylics, etc).<\/p>\n\n\n\n\u201cKyoto cityscape\u201d is our subject qualifier; <\/strong>Stable Diffusion will ruminate on all the images it\u2019s scene of Kyoto, cityscapes, and cityscapes in Kyoto, to determine what such an image would look like. This term is a moderately specific, as we\u2019ve noted which city the AI should take inspiration from rather than just ask for any old cityscape.<\/p>\n\n\n\n\u201cby Satoshi Kon\u201d is our style modifier;<\/strong> Stable Diffusion will use this text to take whatever image we\u2019re asking and try to make it look like the specific artist\u2019s style. A good tip here is to know what an artist\u2019s style looks like and what kind of mediums and subjects are most common for that artist. Doing this will help the AI to make a more believable image. <\/p>\n\n\n\nIf, for example, we asked Stable Diffusion to make a pencil sketch in the style of Wes Anderson (a film director), it may have trouble anticipating what such an image would look like because it doesn\u2019t have pencil sketches by Wes Anderson in it\u2019s references. It might then spit out something garbled or unrelated to your prompt.<\/p>\n\n\n\n (actually, this example worked quite well! Never underestimate the power of AI learning!)<\/figcaption><\/figure>\n\n\nNegative Prompt<\/h1>\n\n\n Directly underneath the main Prompt box is another text box titled \u201cNegative Prompt\u201d. This is the area where we can specify things we do not want to see in our output image. In our last generation, you\u2019ll notice that the image came out in black and white. I like the style of the image, but I would prefer something in color. We could change the regular prompt to specify a \u201ccolored pencil sketch\u201d, but first let\u2019s test out the negative prompt.<\/p>\n\n\n\n
I will add the terms \u201cmonochrome\u201d and \u201cblack and white\u201d to the negative prompt. I\u2019ll separate them with a comma to specify they are two separate terms. While these terms have very similar meanings, I want to exclude both of them because I don\u2019t know which phrase the AI is more likely to associate with this image style. Let\u2019s look at the results when I generate the image with this negative prompt:<\/p>\n\n\n\n <\/figure>\n\n\n\nNot too bad…but what if I want to avoid having people appear in the image? Let\u2019s add the term \u201cpeople\u201d to the negative prompt as well and see what happens:<\/p>\n\n\n\n <\/figure>\n\n\n\nAs you can see, even one word added to the positive or negative prompt sections can affect the results that Stable Diffusion gives us. Let\u2019s move on to the other parameters before we get to bogged down in this topic. Prompt writing really is an entire lesson unto itself.<\/p>\n\n\n\nThe controls for image parameters in Auto111’s Web UI<\/em><\/figcaption><\/figure>\n\n\nSampling Steps<\/h1>\n\n\n The first slider you\u2019ll see in this web UI is labeled \u201cSampling Steps\u201d. Steps refer to how many passes the AI will make when taking visual noise and refining it into your desired image. Think of it like layers of paint put down by a painter.<\/strong> A painter will start with a wash which has no details at all; next, he will add large blocks and shapes of paint to get the overall design of the painting on the canvas; once that dries, he can then take a smaller brush and paint in the details of the piece.<\/p>\n\n\n\nStable Diffusion operates in a similar fashion: when you give the AI a prompt to generate, it starts with nothing but a canvas of latent space. Which each step, it puts another layer of \u201cpaint\u201d down on that latent space, first with blurry blocks of color to define what goes where. With each extra step, the AI will add more and more detail to the image until it reaches the number of steps that you specified.<\/p>\n\n\n
Steps vs Quality<\/h2>\n\n\n Now, this explanation may lead you to believe that more steps mean better pictures. But that\u2019s not always the case. At a certain point, adding more steps doesn\u2019t help improve the quality of an image; it can actually cause the image to start looking overworked.<\/strong><\/p>\n\n\n\nIf we go back to the painter analogy, we will find that the painting he puts on canvas only needs so many layers before it forms a coherent image. If he keeps piling on layers of paint, the image will start to look messy. Let\u2019s say this hypothetical painter is making a landscape with pine trees. If you ask him to keep adding more and more detail after the landscape is already there, all he can do is add increasingly minute details to the canvas. If he spends long enough doing that, he might brush individual pine needles onto every tree in the whole forest…and lose his mind in the process!<\/p>\n\n\n\n
Stable Diffusion works in a similar way: at a certain point, the noise becomes coherent enough to make a clear image. If you keep pushing for more steps beyond that point, the AI will have no choice but to keep looking for tiny minute details to throw in here and there. If you push too hard, it will start making up weird details just for the sake of adding more to the image.<\/p>\n\n\n
Goldilocks Zone of Denoising<\/h2>\n\n\n Suffice to say, there is a Goldilocks zone of denoising. The number of steps you should aim for will depend on your preferred subject matter and art style. We will discuss that topic further in another article. For beginners, I would recommend using 20 or 30 steps.<\/p>\n\n\n\n <\/td> <\/td> <\/td><\/tr>Prompt:<\/strong> A painting of a Kyoto cityscape by Satoshi Kon<\/em>Steps:<\/strong> 2Sampling Method: <\/strong>EulerSize:<\/strong> 512 x 512 pixelsCFG Scale: <\/strong>7Seed: <\/strong>2691516055Time to Make:<\/strong> 0.81 seconds<\/td>Prompt:<\/strong> A painting of a Kyoto cityscape by Satoshi Kon<\/em>Steps:<\/strong> 20Sampling Method: <\/strong>EulerSize:<\/strong> 512 x 512 pixelsCFG Scale: <\/strong>7Seed: <\/strong>2691516055Time to Make:<\/strong> 3.84 seconds<\/td>Prompt:<\/strong> A painting of a Kyoto cityscape by Satoshi Kon<\/em>Steps: <\/strong>100Sampling Method: <\/strong>EulerSize:<\/strong> 512 x 512 pixelsCFG Scale: <\/strong>7Seed: <\/strong>2691516055Time to Make:<\/strong> 25.37 seconds<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\nAs you can see from this comparison, there are diminishing returns when increasing steps. <\/strong>The difference between 2 and 20 steps is much greater than the difference between 20 and 150. The generation time also goes up with every additional step you add. Keep that in mind if you have limited CPU or GPU resources.<\/p>\n\n\nSampling Method<\/h1>\n\n\n Just below Sampling Steps, you will find several selection bubbles labeled with \u201cSampling Method\u201d. I was not a mathematics major, so I can\u2019t explain to you exactly how these operate. But, in layman\u2019s terms, this image generation process works by giving a computer images that were reduced to noise and teaching it how to recreate the original image. There are fancy and confusing mathematical equations involved to accomplish this. <\/p>\n\n\n\n
In short, each Sampling Method is just a different approach on how to solve the equations that make image generation possible. <\/strong>As a result, each method will give you a final image based on your prompt, but will have small nuances of difference because they got to the end result in a slightly different way.<\/p>\n\n\n\nPrompt: <\/strong>analog style photo of a chocolate cake sitting on a countertop in a country kitchen, detailed, realistic, by Wes Anderson<\/em> (initial size was 768×512)<\/em><\/figcaption><\/figure>\n\n\n\nI liken this to baking a chocolate cake. You and your friend start with the same ingredients, but you each follow your own set of directions that marginally vary. Both of you will end up with a chocolate cake at the end, but the way in which you mixed and cooked yours will make it somewhat distinct from the one your friend finished. So Sampling Methods are like different techniques to bake the same chocolate cake.<\/p>\n\n\n\n
In Stable Diffusion, there are a variety of sampling methods available. We\u2019ll wait to discuss their difference at length in another article. For now, I would recommend that you start with Euler or <\/strong>Euler a<\/strong> as you begin your image generation journey.<\/p>\n\n\nImage Size<\/h1>\n\n\n Next, we see the Width and Height parameters. These are quite self-explanatory. With these sliders, you can control the size of your output image. It sounds simple enough, but keep these two points in mind when changing the output size:<\/p>\n\n\n\n
\nDivisible Numbers. <\/strong>Please note that, when you drag the sliders around, the numbers on the right will change in increments of 64. This is quite important, as all images generated by Stable Diffusion must be in a size divisible by 64. If you try to enter a custom non-divisible size (for example, you want a video-sized thumbnail at 1920×1080 pixels) and hit generate, you will get an error instead of an image. So if you\u2019re after a specific image size, you\u2019ll want to generate an image with the closest aspect ratio and crop to your desired size in a photo editor later.<\/li>\n\n\n\nSize Limits.<\/strong> Just as importantly, you need to understand that the image size the AI can generate is dependent on the hardware capabilities of your GPU. The larger an image size, the more pixels involved and\u2014as a result\u2014the more time and processing power it will take to make the image. If you choose a height and\/or width that is too big for your GPU to process, you will get an error instead of an image. Thankfully, there are many different AI upscaling solutions you can use to increase the image resolution of your creations later. We\u2019ll talk about that in another guide.<\/li>\n<\/ol>\n\n\n\nNow the aspect ratio<\/strong> of your image will not only impact the processing time: it can also change the way that Stable Diffusion composes the image. I\u2019ve seen it happen many times now where a widescreen aspect ratio, when paired with a human portrait prompt, generates odd results or duplicated faces because the AI is trying to fill in the extra space. As a generally rule of thumb, you should stick to square or vertical sizes for better portrait results.<\/p>\n\n\nBatch Count & Batch Size<\/h1>\n\n\n Batches refer to how many images you want Stable Diffusion to generate in one go with your current prompt and settings. It\u2019s like putting multiple pans of cookies in the oven. Batch count is how many cookie sheets you put in the oven, while batch size is how many cookies (i.e. images) will be on each sheet.<\/strong><\/p>\n\n\n\nCan you tell I was hungry when I wrote this article? Prompt: <\/strong>analog style photograph of a diner, a plate of cookies and a cup of coffee on a table, detailed, realistic, evening light, by Wes Anderson<\/em> (initial size was 768×512)<\/em><\/figcaption><\/figure>\n\n\n\nJust keep in mind that, the more images you ask Stable Diffusion to generate in one sitting, the longer your wait time will be. I would recommend starting with small batch sizes until, between one and four images, as you get a feel for which prompts will give you what results.<\/p>\n\n\n
CFG Scale<\/h1>\n\n\n The CFG Scale<\/strong> stands for \u201cClassifier Free Guidance Scale\u201d. That sounds overly technical, but really this slider just controls how closely the AI will follow your prompt when generating an image. The higher you set the scale, the more strictly Stable Diffusion will interpret your prompt; the lower you set the scale, the more creative it will get in making the output image. Let\u2019s look at two extreme examples while building upon our initial practice prompt. <\/strong><\/p>\n\n\n\nHere are the results of \u201cA painting of a Kyoto cityscape by Satoshi Kon\u201d with a low CFG (at 1.5, the AI gets a lot of freedom with this prompt) and then at a high CFG (28, the AI is strictly trying to capture exactly what the prompt says):<\/p>\n\n\n\n <\/td> <\/td><\/tr>A painting of a Kyoto cityscape by Satoshi Kon<\/em> CFG Scale of 1.5<\/td>A painting of a Kyoto cityscape by Satoshi Kon<\/em> CFG Scale of 28<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\nAs you can see, having an extreme in either direction can make the results look a bit…odd. A good middle ground is too leave the CFG between about 7 and 12. For txt2img, I rarely tweak the CFG unless the AI is throwing a lot of extra stuff into the image that shouldn\u2019t belong. Here is another generation with the CFG Scale positioned at 7.5, giving a more comprehensible but creative result:<\/p>\n\n\n\n <\/figure>\n\n\nSeed<\/h1>\n\n\n A seed<\/strong> is a string of numbers that identify each individual image the AI generates. Every image that comes out of Stable Diffusion will have a seed number. But seeds don\u2019t end there. If you like the visual style of a specific image, you can re-use that same seed number with a different prompt to generate an image with similar visual characteristics to the original image from which that seed was pulled.<\/strong><\/p>\n\n\n\nLet\u2019s say I really liked the look of our very first image and I want to use different prompts with a similar aesthetic. I will first click on the green recycling symbol next to the seed box. This is show the seed of whatever image we last generated. Next, I will copy and paste the seed from our original generation into this box in place of the seed it\u2019s currently showing. (Stable Diffusion saves the output images with the seed number in the file name by default; so to retrieve that seed I will go to my output folder and copy it from the image\u2019s file name).<\/p>\n\n\n\n
For this example, my seed is: 2323820377<\/p>\n\n\n\n
With that seed now locked in, let\u2019s change our prompt a bit. I will change the subject matter from \u201ca Kyoto cityscape\u201d to \u201cLos Angeles\u201d. Here are the results side-by-side for comparison:<\/p>\n\n\n\n <\/td> <\/td><\/tr>Prompt: <\/strong>A painting of a Kyoto cityscape by Satoshi Kon<\/em>Steps: <\/strong>20Sampler:<\/strong> Euler aSeed: <\/strong>2323820377CFG Scale: <\/strong>7.5<\/td>Prompt:<\/strong> A painting of Los Angeles by Satoshi Kon<\/em>Steps: <\/strong>20Sampler:<\/strong> Euler aSeed: <\/strong>2323820377CFG Scale: <\/strong>7.5<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\nNow that we\u2019ve covered the basic features inside Stable Diffusion\u2019s txt2img technology, let\u2019s look at some of the features specific to this Web interface (again, I\u2019m using Automatic1111\u2019s Web UI). These include:<\/p>\n\n\n\n
\nRestore Faces<\/li>\n\n\n\n Tiling<\/li>\n\n\n\n Highres Fix<\/li>\n<\/ul>\n\n\nRestore Faces<\/h2>\n\n\n Restore Faces. When checked, this feature instructs Stable Diffusion to apply an additional algorithm to the generation process that is designed to improve the appearance of human faces. In the Web UI settings, you can specify which algorithm you want to use: Codeformer or GFPGAN. Codeformer is a model that was actually designed to quite literally restore faces in old photos that have damage to them; GFPGAN is an AI model (a GAN model to be exact) that does the same thing but in a slightly different way.<\/p>\n\n\n\n
Generally, both models are good but have a different nuance to the way they process eyes. Codeformer has a bit more highlight or sparkle to eyes for a photorealistic look, while GFPGAN can provide more cartoon or anime-style eyes. In my personal experience so far, I usually prefer the results from Codeformer.<\/p>\n\n\n\n
Below is a side-by-side-by-side comparison of three portraits with the same seed: first with no face restoration, then with Codeformer, and then with GFPGAN.<\/p>\n\n\n\n <\/td> <\/td> <\/td><\/tr>Prompt: <\/strong>A photo portrait of a beautiful young Dutch woman by Marta Bevacqua, detailed, realistic, 50mm lens<\/em>Steps: <\/strong>20Sampler: <\/strong>EulerSeed:<\/strong> 1973059559CFG Scale: <\/strong>7.5 NO FACE RESTORE<\/td>Prompt: <\/strong>A photo portrait of a beautiful young Dutch woman by Marta Bevacqua, detailed, realistic, 50mm lens<\/em>Steps:<\/strong> 20Sampler: <\/strong>EulerSeed: <\/strong>1973059559CFG Scale: <\/strong>7.5 GFPGAN at 0.15 weight<\/td>Prompt:<\/strong> A photo portrait of a beautiful young Dutch woman by Marta Bevacqua, detailed, realistic, 50mm lens<\/em>Steps:<\/strong> 20Sampler:<\/strong> EulerSeed:<\/strong> 1973059559CFG Scale:<\/strong> 7.5 Codeformer at 0.15 weight<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\nHere is another portrait example, same side-by-side:<\/p>\n\n\n\n <\/td> <\/td> <\/td><\/tr>Prompt: <\/strong>A close-up photo portrait of a handsome young American man in a park by Marta Bevacqua, detailed, realistic, 50mm lens<\/em>Steps:<\/strong> 20Sampler: <\/strong>EulerSeed: <\/strong>1364391388CFG Scale:<\/strong> 7.5 NO FACE RESTORE<\/td>Prompt:<\/strong> A close-up photo portrait of a handsome young American man in a park by Marta Bevacqua, detailed, realistic, 50mm lens<\/em>Steps:<\/strong> 20Sampler:<\/strong> EulerSeed:<\/strong> 1364391388CFG Scale:<\/strong> 7.5 GFPGAN at 0.15 weight<\/td>Prompt:<\/strong>