When you should use a robots.txt file. And when you shouldn’t.

Robot.txt files can be very helpful in controlling which content Google gets to see and what content it doesn’t. A robots.txt file can also block search engines like Google from ever finding your website. This video tells the story of a client who blocked their entire site from Google.

Transcript

0:00 Last week we had a referral from a colleague. They sent over a website and made an introduction. And what we typically do with that is go to the website, do a little preliminary check to see what’s going on with the website.


0:14 Is it worth the conversation for both parties? And, uhm, and, and start the conversation from there. So when we, when I put the, the URL in, domain name in, uhm, I couldn’t find anything in Google and that was weird.


0:29 So what I did was I went to the robots.txt file, which is what we’re looking at here. Not the clients, uh, but we’re looking at mine on opticsin.com.


0:39 You can, and you should have your own file here as well. Your own robot robots.txt. You can find it at mysite.com slash robots.txt.


0:48 And what we found on their site, uh, was something, I’m going to walk you through this. So we see user agent here, star, which means all the user agents, which is basically any bot that’s going to come to the website.


1:04 Ideally they are going to, uhm, abide by the commands in here. No guarantee, but ideally they do. And, so what we found was this line, user agent star, and then this one, disallow slash.


1:22 So again, the star is all user agents and the slash means stay away from this website. Don’t crawl it. Don’t index it.


1:31 Which is why Google had zero, none, not even one. Not one. One page out of about 700 pages on this website.


1:41 None of them were available for SEO traffic. So how did we get there? Well, there’s a couple reasons why that may happen.


1:49 In this case, it looked like, uh, the site which went live in September. So that’s months of SEO traffic, months of history that’s been lost because the site was pushed from a staging environment to production, and they had that file, which is typical.


2:06 It’s typical to have the robots.txt file, uh, blocking search engines and staging. But we don’t want that when we get, when it gets pushed to production.


2:17 Sometimes the server is set up to not bring that file over. Sometimes it is, sort of depends on who you’re hosting with and your process.


2:24 But in this case, the site went live in September, and the site still had a robots.txt disallow command telling Google and all other search engines to stay away.


2:37 So, uh, we were able to validate that by going to, again, just going to use Optics In as the example, but if you go to Google and type in site colon, and then the domain that appears in the address bar for your website, dump it in here, site colon, no space.


2:57 Some people make that mistake. And then scroll down, you can see that command picks up everything that Google has in Google for OpticsIn.com. uh, San Diego SEO training for beginners, coaching, conferences, keynotes, SEO client results, got some pretty good ones in there, you may want to check it out.


3:24 And more blog posts, blog posts, blog posts, right. So for this site, there was nothing. It was going to be the equivalent of, let’s see if this works.


3:34 I just made up this domain. It was It’s sort of like that. Nothing came back, right? So that’s a problem.


3:41 So, uh, during the call, you know, I, I told the, uh, the prospect, I said, you know, I really hope we start working together, but before we get into the conversation, I’m going to let you know what’s going on with your website.


3:53 and I told her about robots.txt, and then she told me about September being the go-to-life of this website. And then sure enough, um, probably a couple hours later, I checked the website and it was fixed.


4:04 So now Google has access to it. So cool. It’s going to get crawled index and ideally start ranking well. Uh, but what you want to do is make sure that your website is not blocking anything from Google the way you do not want it to be blocking.


4:17 So again, walk you through it again, user agent star. And if this is the next line underneath it, you were blocking all sites.


4:25 Search engines, you don’t, you probably don’t want that, um, not feel like SEO. We, we, we typically we’re in, we’re in the business of getting people more of that traffic, not, not less of that traffic.


4:36 Like most websites, uh, they don’t want less, less of that traffic. So let’s understand what’s going on with this in particular, just so you get a sense of, robots.txt.


4:43 And this is a small file. Some, some files can be very, very long. The more complex a website is. The longer it is, but let’s start to break down exactly what robots TXT is and why you should use it, where you can find it, how you can build one.


4:57 Okay. So again, my site.com robots.txt. This is a comment I put in here to tell me next time I come in here.


5:07 When, when the last time I updated this file user agent is all bots coming to this website. Here’s a sitemap, an XML sitemap, which secondary findability tool gives Google and all the other search engines access.


5:22 Here’s where it gets interesting, right? So typically folks use robots.txt to have Google not access certain web pages. But if we’re going to disallow a certain folder, right?


5:37 But you want them to get access to a file that’s in that folder. You need to do something like this, right?


5:43 So this basically says, Dear Google, stay away from everything in WP Admin with the exception of this line here. And then we say, please don’t go to slash search.


5:56 We don’t want search results in search results. Disallow feed. I don’t want any of the URLs. I don’t from the feed in there.


6:02 I just never found a good purpose for it. Disallow CDN CGI. There’s nothing in there that’s going to help the SEO OpticsIn.com in any good way.


6:16 And here’s a good one, right? So this is a user agent, a very specific bot, AI Archiver, which is from Archive.org.


6:24 And I said, please don’t please go away. Now I put that on this file in robots.txt CXT years, years later, just wasn’t thinking about it.


6:34 I decided I didn’t want this website to be found in Archive.org. And what you can see is the site went live in September of 2019. You can see that it starts to get picked up in 2020. You know, there’s a lot of details.


6:49 It’s an activity 2021, 2022, 2023, 2024, and it was in 2025 that I put that, uh, that command in there.


6:58 And you can see it drops off, but it’s not going to get rid of that, right? And this is, this is not Google.


7:04 This is Archive.org, very different. But imagine if this was Google and imagine if your site was brand new. Say in 2029, and you had that disallow on it, like, it wouldn’t be able to be found in Archive.org.


7:21 And that’s essentially what this client did with their brand new website, except they didn’t block out Archive.org. They blocked out Google and all the other search engines.


7:32 Again, we don’t want that. All they had to do is remove that line and Google will show and start to crawl and index that website.


7:40 I hope this has been helpful.

Book a Discovery Call

Contact us to explore how Optics In, LLC can support you with your SEO goals and/or WordPress web design needs.