SEO best practices
Best practices for search engine optimization of web sites
SEO is a set of best practices whose purpose is to increase visibility and relevancy of a web site by having it achieve the highest possible ranking in search engines results.
- indexing : storage of data collected during the crawling of a site so links to it can be displayed in search engine results.
- de-indexing : removal of the reference to a given site / page from the indexed data.
- ranking : ordering of search engine results by relevancy to the original user query.
- organic traffic : site traffic generated though clicks on indexed, non sponsored search engine results.
- impressions : number of times a link to the site was displayed by search engine results.
- clicks : number of times someone clicked on a link to the site from search engines results.
- clickthrough rate (CTR) : clicks / impressions ratio.
- backlinks : external links to the site from other sites (an indication of quality and trustfulness).
- Search engines must collect informations about a web page in order to be able to index it.
- To do so, search engines use Web Crawlers (automated bots that emulate user agents).
- The process for the indexation of an url by a web crawler is as follows :
- Retrieve the url of a web page (through url discovery, links or sitemaps).
- Add the url to a crawl queue (an algorithmic process determines which sites/pages to crawl and how often).
- Issue a http request to the url (an url may be indexed only if it returns HTTP 200 and contains valid html).
- Add the returned HTML to the render queue (optional, used mainly for web pages that use client-side rendering).
- Parse the returned HTML/the rendered page, extract information from it and index the url.
Notes :
- HTTP redirects (3xx) do not prevent urls indexation (web crawlers follow redirects).
- Other HTTP responses important for SEO performance are 404, 410, 500 and 503.
The internet is being constantly crawled by all sorts of bots for all kinds of different purposes. At this stage it is important to distinguish the good bots from the bad bots :
-
Good bots are run by companies and integrate the site to the whole internet experience :
- Search engine bots (web pages indexing)
- Copyright bots (copyrighted contents verification)
- Status detector bots (site outages detection)
- Feed bots (search site for content to add to a news feed)
-
Conversely, a bad bot implements a malicious use case of crawling the web, for example :
- Credential stuffing bots (runs a stolen credentials dataset against the service to gain authentication)
- Content scraping bots (automatically downloads all the content served)
- Spam bots (pushes spam messages to the service)
-
A
robots.txt
file defines a set of rules that good bots will always follow (on the other hand, bad bots will always ignore this file). -
It has to be served statically at the root of the site :
https://www.example.com/robots.txt
. -
The simplest and safest option for managing bots is to maintain an allowlist of authorized bots in
robots.txt
:- Bots can be identified with their
user-agent
string, their ip address or both. -
robots.txt
must list all the site routes that authorized bots can crawl (and index in the case of SEO) - It also must prevent/block all other sites routes from being crawled.
- Bots can be identified with their
-
Authorizations for specific routes will preempt authorizations for more generic ones, as follows :
# allow google and yahoo crawler for /homepage so it is indexed in both search engines
User-agent: Googlebot
User-agent: Yahoo! Slurp
Allow: /homepage
# block all other site routes for all crawlers ...
User-agent: *
Disallow: /
# include the path to sitemap.xml ...
Sitemap: https://www.example.com/sitemap.xml
- As a result, authorization for the
/
route is the de-facto default authorization.
Notes :
-
robots.txt
follow the robots exclusion protocol and the sitemap protocol. - More often than not, the wildcard
*
is used to specify the user agent even for theAllow
directives. - Subdomains need their own
robots.txt
file. - Here is an exhaustive list of web crawlers user agent strings.
- When
robots.txt
isn't enough, bots exclusion at the page level can be achieved with Meta robot tags.
-
Site maps are files following a XML based inclusion protocol that complements the robots exclusion protocol.
-
It provides more informations about specific urls from the site (change frequency, priority, last modification date, etc).
-
A site map can list up to 50,000 URLs and its size must be below 50MiB. Site maps also support gzip compression.
-
A site can have multiple site maps. In this case, they are referenced by sitemap index files.
-
Site maps are especially useful in specific scenarios :
- The site hosts a very large number of pages (online stores, documentation websites, etc).
- Most pages of the site do not have backlinks and are not linked to one another (thus may escape web crawler's discovery phase).
- Site pages are dynamic and change often, in this case site maps will keep the web crawler's up to date.
- Site pages contain a lot of rich media content (video, images, etc).
Notes :
- XML formatting requires any value to be escaped with entity codes for the characters
&
,'
,"
,<
and>
. - Here's a useful site map validator.
- When a site serves duplicate content (different urls serving the same page), canonical tags can be used to de-duplicate pages and keep the indexed results relevant.
- Search engine web crawlers consider the same page served through http and https as 2 different results - duplicated content is always an issue to be considered for SEO performance.
-
Once a web page has been crawled, search engines must be able to understand its content to be able to extract relevant informations from it.
-
The page will then be indexed in such a way that it will appear as a relevant result for user search queries.
-
To achieve that, sites and pages have to follow a number of best practices.
- Different options are available when rendering a page, from the most to the least SEO friendly :
page rendering method | pros | cons |
---|---|---|
Static Site Generation (SSG) | pre-rendering at build time | contains static content only |
Server-Side Rendering (SSR) | supports dynamic content | pre-rendering on request |
Incremental Static Regeneration (ISR) | per-page pre-rendering | performance is impacted |
Client Side Rendering (CSR) | supports dynamic content | no content at page load |
-
Rendering options are important in the context of how critical an effective SEO is to the site's performance.
-
Rendering method are to be considered on a per-page basis if possible (for instance next.js allows it).
- The organization of a site's routes should follow those principles :
-
semantics :
- Use words instead of ids or sequence numbers.
- Use the same character to separate words in route segments (for example
website-seo-basics
).
-
consistent pattern :
- Starting at
/
, routes segments should be explicit and go from general to specifics. - Each route segment has to be relevant to the previous route segment.
- Starting at
-
keywords focused :
- Keywords in the route segments should match the
<meta name="keywords">
in the page's<HEAD>
tag.
- Keywords in the route segments should match the
-
avoid parameters and query strings :
- They are rarely semantic.
- Other options exist (dedicated cookies, async HTTP to non exposed endpoints, etc).
-
semantics :
-
Adding
<meta>
tags to a site page is critical for SEO, especiallytitle
,description
andkeywords
. -
They will greatly improve the understandability (thus the visibility) of the page for web crawlers or for search engine users.
meta tag name | meta tag content |
---|---|
title | should contain a clear value proposition |
description | should expand on the title as it will show in the search engine impressions |
keywords | words relevant to the page's content separated by commas, should match the route segments (5 max) |
-
<meta>
tags also support the opengraph protocol which relies on the no standardproperty
attribute. -
While opengraph is not relevant to SEO, it is very useful in that it allows the customization of links shared on social media.
-
Opengraph is supported on many platforms including Facebook, Twitter, Linkedin and Discord.
Notes :
- Opengraph and twitter cards 101
- Here's a website that offers a free opengraph
<meta>
tags generator and shared links previews for any site. - Twitter cards homepage and reference
- While AMP seems not that relevant in the context of modern SEO and rendering methods, it still exists (apparently).
- details on the google page ranking algorithm.