SEO best practices

SEO fundamentals

Concept

SEO is a set of best practices whose purpose is to increase visibility and relevancy of a web site by having it achieve the highest possible ranking in search engines results.

Terminology

indexing : storage of data collected during the crawling of a site so links to it can be displayed in search engine results.
de-indexing : removal of the reference to a given site / page from the indexed data.
ranking : ordering of search engine results by relevancy to the original user query.
organic traffic : site traffic generated though clicks on indexed, non sponsored search engine results.
impressions : number of times a link to the site was displayed by search engine results.
clicks : number of times someone clicked on a link to the site from search engines results.
clickthrough rate (CTR) : clicks / impressions ratio.
backlinks : external links to the site from other sites (an indication of quality and trustfulness).

Web pages indexing

Search engines web crawlers

Search engines must collect informations about a web page in order to be able to index it.
To do so, search engines use Web Crawlers (automated bots that emulate user agents).

The process for the indexation of an url by a web crawler is as follows :

Retrieve the url of a web page (through url discovery, links or sitemaps).
Add the url to a crawl queue (an algorithmic process determines which sites/pages to crawl and how often).
Issue a http request to the url (an url may be indexed only if it returns HTTP 200 and contains valid html).
Add the returned HTML to the render queue (optional, used mainly for web pages that use client-side rendering).
Parse the returned HTML/the rendered page, extract information from it and index the url.

Notes :

HTTP redirects (3xx) do not prevent urls indexation (web crawlers follow redirects).
Other HTTP responses important for SEO performance are 404, 410, 500 and 503.

`robots.txt`

The internet is being constantly crawled by all sorts of bots for all kinds of different purposes. At this stage it is important to distinguish the good bots from the bad bots :

Good bots are run by companies and integrate the site to the whole internet experience :
- Search engine bots (web pages indexing)
- Copyright bots (copyrighted contents verification)
- Status detector bots (site outages detection)
- Feed bots (search site for content to add to a news feed)
Conversely, a bad bot implements a malicious use case of crawling the web, for example :
- Credential stuffing bots (runs a stolen credentials dataset against the service to gain authentication)
- Content scraping bots (automatically downloads all the content served)
- Spam bots (pushes spam messages to the service)
A robots.txt file defines a set of rules that good bots will always follow (on the other hand, bad bots will always ignore this file).
It has to be served statically at the root of the site : https://www.example.com/robots.txt.
The simplest and safest option for managing bots is to maintain an allowlist of authorized bots in robots.txt :
- Bots can be identified with their user-agent string, their ip address or both.
- robots.txt must list all the site routes that authorized bots can crawl (and index in the case of SEO)
- It also must prevent/block all other sites routes from being crawled.
Authorizations for specific routes will preempt authorizations for more generic ones, as follows :

# allow google and yahoo crawler for /homepage so it is indexed in both search engines
User-agent: Googlebot
User-agent: Yahoo! Slurp
Allow: /homepage

# block all other site routes for all crawlers ...
User-agent: *
Disallow: /

# include the path to sitemap.xml ...
Sitemap: https://www.example.com/sitemap.xml

As a result, authorization for the / route is the de-facto default authorization.

Notes :

robots.txt follow the robots exclusion protocol and the sitemap protocol.
More often than not, the wildcard * is used to specify the user agent even for the Allow directives.
Subdomains need their own robots.txt file.
Here is an exhaustive list of web crawlers user agent strings.
When robots.txt isn't enough, bots exclusion at the page level can be achieved with Meta robot tags.

`sitemap.xml`

Site maps are files following a XML based inclusion protocol that complements the robots exclusion protocol.
It provides more informations about specific urls from the site (change frequency, priority, last modification date, etc).
A site map can list up to 50,000 URLs and its size must be below 50MiB. Site maps also support gzip compression.
A site can have multiple site maps. In this case, they are referenced by sitemap index files.
Site maps are especially useful in specific scenarios :
- The site hosts a very large number of pages (online stores, documentation websites, etc).
- Most pages of the site do not have backlinks and are not linked to one another (thus may escape web crawler's discovery phase).
- Site pages are dynamic and change often, in this case site maps will keep the web crawler's up to date.
- Site pages contain a lot of rich media content (video, images, etc).

Notes :

XML formatting requires any value to be escaped with entity codes for the characters &, ', ", < and >.
Here's a useful site map validator.

Canonical tags

When a site serves duplicate content (different urls serving the same page), canonical tags can be used to de-duplicate pages and keep the indexed results relevant.
Search engine web crawlers consider the same page served through http and https as 2 different results - duplicated content is always an issue to be considered for SEO performance.

Search results ranking

Once a web page has been crawled, search engines must be able to understand its content to be able to extract relevant informations from it.
The page will then be indexed in such a way that it will appear as a relevant result for user search queries.
To achieve that, sites and pages have to follow a number of best practices.

SEO friendly page rendering

Different options are available when rendering a page, from the most to the least SEO friendly :

page rendering method	pros	cons
Static Site Generation (SSG)	pre-rendering at build time	contains static content only
Server-Side Rendering (SSR)	supports dynamic content	pre-rendering on request
Incremental Static Regeneration (ISR)	per-page pre-rendering	performance is impacted
Client Side Rendering (CSR)	supports dynamic content	no content at page load

Rendering options are important in the context of how critical an effective SEO is to the site's performance.
Rendering method are to be considered on a per-page basis if possible (for instance next.js allows it).

SEO friendly routes design

The organization of a site's routes should follow those principles :
- semantics :
  - Use words instead of ids or sequence numbers.
  - Use the same character to separate words in route segments (for example website-seo-basics).
- consistent pattern :
  - Starting at /, routes segments should be explicit and go from general to specifics.
  - Each route segment has to be relevant to the previous route segment.
- keywords focused :
  - Keywords in the route segments should match the <meta name="keywords"> in the page's <HEAD> tag.
- avoid parameters and query strings :
  - They are rarely semantic.
  - Other options exist (dedicated cookies, async HTTP to non exposed endpoints, etc).

Page metadata

Adding <meta> tags to a site page is critical for SEO, especially title, description and keywords.
They will greatly improve the understandability (thus the visibility) of the page for web crawlers or for search engine users.

meta tag name	meta tag content
title	should contain a *clear value proposition*
description	should expand on the title as it will show in the search engine impressions
keywords	words relevant to the page's content separated by commas, should match the route segments (5 max)

<meta> tags also support the opengraph protocol which relies on the no standard property attribute.
While opengraph is not relevant to SEO, it is very useful in that it allows the customization of links shared on social media.
Opengraph is supported on many platforms including Facebook, Twitter, Linkedin and Discord.

Notes :

Opengraph and twitter cards 101
Here's a website that offers a free opengraph <meta> tags generator and shared links previews for any site.
Twitter cards homepage and reference

Notes

While AMP seems not that relevant in the context of modern SEO and rendering methods, it still exists (apparently).
details on the google page ranking algorithm.