Have you ever wanted to know how Google really works? Well, this article attempts to not only explain how Google works but to make our own search engine!

I am in no way an expert at search engines, I am an undergraduate Computer Science student at the University of Liverpool so this article is heavily researched and referenced and was created to increase my understanding of search engines.

# Introduction

When Google first started out in the 90s the one thing that set them apart from their competitors is that its search results always seem to show the most relevant and helpful results. Google’s page ranking algorithm was at least 10 times better than any other search engine at the time.

Search engines typically have 3 main objectives:

1. Crawl the web and find all the web pages with public access
2. Index the data from step 1 so the pages can be searched efficiently
3. Rate the importance of each page in the database.

# Web Crawlers

A webcrawler, also called a spider crawls the web looking for new websites. Once a website has been found, the spider regularly checks to see if the information has been updated and it crawls all links on that website to find other websites.

Wikipedia gives a good oversight into the details of a web crawler:

A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites it copies and saves the information as it goes. The archives are usually stored in such a way they can be viewed, read and navigated as they were on the live web, but are preserved as ‘snapshots’.[4]

The archive is known as the repository and is designed to store and manage the collection of web pages. The repository only stores HTML pages and these pages are stored as distinct files. A repository is similar to any other system that stores data, like a modern day database. The only difference is that a repository does not need all the functionality offered by a database system. The repository stores the most recent version of the web page retrieved by the crawler.[5]

The large volume implies the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change can imply the pages might have already been updated or even deleted.

The number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

As Edwards et al. noted, “Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.”[6] A crawler must carefully choose at each step which pages to visit next.

Note that a web crawler must start with a list of URLs. You may be wondering “how did web crawlers even start out if search engines did not exist? How would one know what URLs to start with?”. Well, Tim Berners-Lee kept a list in his backpocket of all the URLs of all the websites on the web at the time according to this. This was probably very common, since back then only 2 maybe 3 websites existed at any given time and the adoption of the world wide web was very slow in the first year or two.

Tim Berners-Lee actually has a FAQ on this website which answers some very interesting questions about the world wide web.

One question that you may still have is “what is the difference between the internet and the web?”. The Computer History Musuem has an easy to understand answer

The Internet, linking your computer to other computers around the world, is a way of transporting content. The Web is software that lets you use that content…or contribute your own. The Web, running on the mostly invisible Internet, is what you see and click on in your computer’s browser.

The Internet’s roots are in the U.S. during the late 1960s. The Web was invented 20 years later by an Englishman working in Switzerland—though it had many predecessors.

To keep things “interesting,” many people use the term Internet to refer to both.

## Designing a crawler

We need to design what the crawler will do, because making software right off the bat is just plain stupid. So we want a box called website and items in the website with links in them, where each link is a mini-box with other links in it until it stops being a box. This is kind of confusing so…

Each website will contain links in it that send it to other websites. Each of those websites also have links in them which send them off to other websites. So we start off with one URL (probably Hacker News) and each link will also be a website with lniks and so on.

+-----------------------------------------------------------------+
|                    HACKERNEWS                                   |
|                                                                 |
|                                                                 |
|    +----------------------+   +---------------------------+     |
|    |      BBC             |   |          NETFLIX          |     |
|    |                      |   |                           |     |
|    |    +---------------+ |   |   +------+                |     |
|    |    |    GUARDIAN   | |   |   |  BBC |       +------+ |     |
|    |    |               | |   |   |      |       |  Yahoo |     |
|    |    |               | |   |   +------+       |      | |     |
|    |    |               | |   |                  +------+ |     |
|    |    |               | |   |                           |     |
|    |    +---------------+ |   |                           |     |
|    |                      |   |                           |     |
|    +----------------------+   +---------------------------+     |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
+-----------------------------------------------------------------+


Transcription: A large box labelled “HACKERNEWS” which contains two smaller boxes respectively named “BBC” and “NETFLIX”. Each of the two boxes also have boxes inside of them. The smaller box inside the larger HACKERNEWSBOX labelled BBC contains 1 smaller box labelled GUARDIAN. The Netlfix box contains 2 boxes, YAHOO and BBC. In total the YAHOO, BBC and GUARDIAN boxes are boxes inside boxes inside a box. The author would like to note that he is not an expert at transcriptions so I apologise to any transcription experts cringing at how bad I am, I just wanted to make this article more accessible.

So it’s links all the way down. Something cool to note here is that links inside links inside links can point to the original parent link.

So what we want is for our crawler to be able to tell the parents and children links apart. This is really important in future pageranking algorithms, but at the moment we just need to know this.

One way we could represent this is with a directed graph like so:

                                         +--------------+
+----->              |
|     +--------------+
+---------+
+---------+------------->+  NETFLIX+-------^---------------+
| BBC     |              +---------+       |   GOOGLE      |
+---------+---+                    |       |               |
|                    |       +---------------+
|                    |
|                    |         +----------------+
|                    +--------->   INSTAGRAM    |
+-v----------+                   +----------------+
|  UNIVERSITY|
|            |
+------------+



Transcription: A graph with nodes (boxes) in it with lines connecting boxes to eachother. The lines have a direction (arrow) on them. The first box, BBC points to 2 boxes which are UNIVERSITY and NETFLIX. The UNIVERSITY box does not point to anything however the NETFLIX box points to 3 boxes labelled YOUTUBE, GOOGLE, INSTAGRAM. These 3 boxes do not point to anything.

Something important to note is that it’s entirely possible for a child to be a parent of it’s parent, like so:

+--------------+----> +-------------+
|    NETFLIX   |      |    BBC      |
|              |      |             |
|              |      |             |
+--------------+ <----+-------------+


Transcription: A box labelled NETFLIX is pointing to another box labelled BBC. The BBC box is pointing to the NETFLIX box. They are both pointing at eachother.

Something to note is that some websites don’t want to be crawled. A website will often have a file called robots.txt which gives advice to web robots about how best to crawl their site and whether their site should be crawled at all.

If the robots.txt file contains

User-agent: *
Disallow: /


This means it wants to disallow all robots from visiting any pages on the site. You can still visit the page if you want, but it’s extremely rude and the website won’t like it. We’ll use my personal website in this tutorial (feel free to do the same if you don’t want to be invasive!).

Here’s a small diagram of what my website roughly looks like:

                                    +---------+
|  github |
+-------v---------+
|
|        +--------+
| brandonskerritt.github.io +--------v--------+
+---------------------------+
|
|        +-----------+    +----------+    +--------+
|        |blog posts +---->blogpost 1+----> link 2 |
+--------v-----------+    +--------+-+    +--------+
|
|
|     +-------+
+-------+



TK Transcript

So my website has links to my LinkedIn, GitHub and it also has a link to my blog posts. Each blog post (still under my domain) has outgoing links to other external services (most of the time not on my domain).

## Programming a crawler

Firstly, let’s use the Python package requests to download my page. You can use

pip install Requests


To get the requests module.

We’ll then start writing some code:

import requests

r = requests.get("http://brandonskerritt.github.io/")
if r.status_code != 200:
print("Error. Something is wrong here")


Okay, so my website should hopefully return a status code of 200. We use the requests module to download my webpage (or attempt to) and store it into a variable called “r”.

import requests
from bs4 import BeautifulSoup

r = requests.get("http://brandonskerritt.github.io/")
if r.status_code != 200:
print("Error. Something is wrong here")

soup = BeautifulSoup(r.content)


Beautiful Soup is a library which let’s us parse HTML. Here we are simply importign Beautiful Soup and creating a beautiful soup object in order to use the libraries methods. You can find out how to install Beautiful Soup here.

import requests
from bs4 import BeautifulSoup
import re

r = requests.get("http://brandonskerritt.github.io/")
if r.status_code != 200:
print("Error. Something is wrong here")

soup = BeautifulSoup(r.content, "lxml")

for link in soup.findAll('a', attrs={'href': re.compile("^http")}):


Here we are using regular expressions to only allow websites that with with http because sometimes in HTML you can link to parts of your own website using ID names.

We get this output:

https://medium.com/@brandonskerritt
https://www.hackernoon.com
http://vip.politicsmeanspolitics.com/
https://www.liverpool.ac.uk/
https://www.liverpool.ac.uk/sports/sports-clubs/student-sports/karate/
https://github.com/brandonskerritt/Computer-Science/tree/master/Course-rep
https://www.startupschool.org/
https://medium.com/@brandonskerritt51
http://github.com/brandonskerritt/
https://github.com/chrisjim316/Liljimbo-Chatbot
https://devpost.com/software/bet-a-way
https://devpost.com/software/bank-of-alexa
https://devpost.com/software/ka-bomb-com


So what we now have is our original website with all these wonderful links on that website:

                                +-------------------------------------------------------------------------+
| https://medium.com/@brandonskerritt                                     |
| https://www.hackernoon.com                                              |
| http://vip.politicsmeanspolitics.com/                                   |
| https://www.li^erpool.ac.uk/                                            |
| https://www.liverpool.ac.uk/sports/sports-clubs/student-sports/karate/  |
+--------------------------+    | https://github.com/brandonskerritt/Computer-Science/tree/master/Course- |
| BRANDONSKERRITT.GITHUB.IO+----> https://www.liverpoolguild.org/main-menu/representation/halls/halls-stu |
+--------------------------+    | https://www.startupschool.org/                                          |
| https://medium.com/@brandonskerritt51                                   |
| http://github.com/brandonskerritt/                                      |
| https://github.com/chrisjim316/Liljimbo-Chatbot                         |
| https://devpost.com/software/bet-a-way                                  |
| https://devpost.com/software/bank-of-alexa                              |
| https://devpost.com/software/ka-bomb-com                                |
|                                                                         |
+-------------------------------------------------------------------------+


TK Transcription

Okay, so now we have code that can get all the links on a webpage. Let’s get all the links from those links too.

    import requests
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(r.content, "lxml")

if r.status_code != 200:
print("Error. Something is wrong here")
else:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):

for i in links:



So this code gets every link on my website, and then it calls a function which gets every link from the links from my website. Once it’s done that, it sends those links back to the same function and carries on doing this…. forever, really. TK

This approach ignores some major points of web crawling:

• Robots.txt isn’t followed
• What if we have 2 websites that link to eachother? Yes.
• It’ll explore the same websties over and over again.
• How do we know when a link is a parent of links or a child of links?

The Pareto Principle states that for many events roughly 80% of the effects comes from 20% of the sources. This principle can be extended and used in many other places. 80% of the websites we crawl will be the same 20% of the websites again and again.

Now we have a rough understanding of how web crawling works, we can begin to use Python libraries to make life easier for us.

# Hello, Scrapy.

Scrapy is a Python library for scraping the web.

From Wikipedia

Scrapy is a free and open source web crawling framework, written in Python.

TK install scrapy

A normal crawl is of one specific website, usually an online product business like eBay or Amazon.

# Page ranking

Here is an image of a very small web

Taken from here.

We will have an importance score that will represent how important a webpage is. One of the core ideas of assigning a score is that the page’s score is derived from the links made to the page from other web pages. The links to a given page are called the backlinks for that page.

In such a scoring system a democracy is created where web pages “vote” for the importance of web pages by lniking to them.

The web seen in the imaeg is an example of a directed graph, a graph that holds direction. A typical graph will consist of a set of vertices (in this context, they are called the web pages) and a set of edges. Each edge joins a pair of vertices (webpages). The graph is considered directed because each edge has a direction.

An approach one could make is to take $x_k$ as the number of backlinks for any page, k. In the image above we then have:

As can be seen here the most “important” web page is $x_3$ since it is linked to by 3 pages.

This approach ignores a large problem, mainly that a link from an already important page should have more weight to it than a link from a non-important page. In the image above pages 1 and 4 both have two backlinks: each links to the other, but page 1’s backlink is from an important web page, page 3. While page 4’s backlink is from an unimportant page 1.

In our new and improved version we don’t want pages to be able to “vote” for themselves. We don’t want a page to gain popularity by linking to lots of other pages. If page j contains $x_j$ links, one of them linking to page k, then we will boost page k’s score by $\frac{x_j}{n_j}$.

In this version each webpage gets “score” of, say 100, and that score is evenly divided up by all the links on the webpage. If a webpage has 4 links on it, then each link will get an added score of:

This method works really well because it scales. All websites start out with a score of 100. But a website can increase its score by being linked to from other websites. Take Netflix for example. Netflix has lots of websites linking to it, increasing its score to say 64,000. If Netflix links to 1 site, that one site will get a score of 64,000.

This follows the golden rule of “an endorsment from a famous person is better than 10 endorsements from non-famous people”. Because Netflix is more popular (popular here meaning it is linked to by many websites) than Netflix linking to a website is worth more than a 200 small websites linking to a website.

Of course the numbers used are much, much larger. So large in fact that it’s simply not worth it to create 5000 webpages that all link to one page, because it won’t make much of a difference.

When you throw in large websites linking to other large websites, you have this monopoly of virtual internet points (did someone say Reddit?).

This makes you think that linking to no websites on your own website is a good idea, but this can be just as bad. This is just one rule that was originally used by Google, Google now has 200 rules. The website linked there may not even be correct. Google doesn’t really share their metrics because people could growth hack their way to the top of the search page.

Here’s a good explanation from Wikipedia. I know that my lecturers tell me every 5 minutes that Wikipedia isn’t a relibable source, but their explanations are good.

Mathematical PageRanks for a simple network, expressed as percentages. (Google uses a logarithmic scale.) Page C has a higher PageRank than Page E, even though there are fewer links to C; the one link to C comes from an important page and hence is of high value. If web surfers who start on a random page have an 85% likelihood of choosing a random link from the page they are currently visiting, and a 15% likelihood of jumping to a page chosen at random from the entire web, they will reach Page E 8.1% of the time. (The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.) Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have PageRank zero. In the presence of damping, Page A effectively links to all pages in the web, even though it has no outgoing links of its own.

Previous articles

Link to github repo

CHECK TKS

title ideas:

learn search engines by building one

learn how google works by building google

https://blog.siliconstraits.vn/building-web-crawler-scrapy/

https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/

# References

[0] - https://nlp.stanford.edu/IR-book/html/htmledition/web-crawling-and-indexes-1.html

[1] - https://moz.com/learn/seo/robotstxt