What is Google?
Google is an American Internet and software corporation that specializes in not only Internet search but also in cloud computing and advertising technologies. It hosts and develops a variety of Internet based services and products. It also generates profit mainly from advertising through a program called AdWords. The company was created by two students of Stanford University.
Google has had a huge growth since its inception and it has created a chain of products, acquisitions and partnerships beyond the company’s main web search engine. The company also offers online software that includes email service, office suite and social networking. There are other products that are available for the desktop with applications such as a web browser, a photo organizing and editing software and instant messaging. It is at the top in development of the Android mobile operating system along with a browser only operating system.
Google is being used in over a million servers in different data centers around the globe. It also processes over a billion search requests and over twenty-four petabytes of data every day created by users. It was listed at the most visited U.S. website in 2009 and a large number of international Google sites have been listed in the top 100.
What are the basics of Google?
It seems fairly simple, but how does Google function? A user will do a search on Google and then the user is immediately given a list of links from all over the Internet. Imagine that the Internet is a large book with a very detailed index giving you exact locations on where everything is located. When a user does a search, programs are used to check the indexes to figure out what is the move appropriate search results that need to be given to the user. There are three processes that are needed to give the user search results.
What is crawling?
Crawling is the process in which Googlebot finds new and updated pages that need to be added to the Google index. There are a lot of very large computers that find or “crawl” billions of web pages located on the Internet. The program that does the crawling is called Googlebot. Googlebot uses a program that figures out which websites to crawl, how often to crawl and how many pages to obtain from each website.
The process of crawling starts with a list of web page URLs that have been created from previous crawls. It is then enlarged with Sitemap data that is provided by webmasters. As Googlebot visits each website it will detect a link or links on each page and will then add it to a list of pages that need to be crawled in the future. There are always new websites, changes to existing websites and dead links that need to be checked and noted when updating the Google index.
What is indexing?
Indexing is the processing of pages being scanned and an index is created that is used by Google to give results when searched. Gogglebot processes each individual page it crawls to be able to compile a large index of all of the words that it sees and the location on each page. Information is also processed that include important content tags and attributes. This includes Title tags and ALT attributes. Googlebot is able to process a majority of content types.
What is serving?
When a user enters a keyword, phrase or questions to search, machines search the index for pages that match and come up with results that are the most relevant to the user. There are over 200 factors that decide if a page is relevant or not. One of the factors is page rank. Page rank is the measure of importance on a page that is based on the incoming links from other pages. Basically, every time a link from a page is clicked on from another website, it improves the website’s page rank.
Google makes sure they are constantly improving the user experience by identifying spam links and other issues that might negatively effect the search results of a website. The most important types of links are the ones that produced based on the quality of your subject matter on the website.
It is important to each owner of a webpage to be able to rank high in search results pages. In order for this to happen, Google must be able to crawl and index the website correctly. Google provides webmaster guidelines that will help avoid any normal pitfalls as well as ways to improve the website’s ranking.
Google has features that are designed to help users save time by showing related keywords or phrases, misspellings and any other popular topics. The keyword or phrases that are used by these features are routinely created by their web crawlers and search programs. There is a display of keywords or phrases when Google thinks it might be able to save the user some time in their search. If a website is ranked high for a keyword or phrase it is because Google has figured out if that keyword or phrase will be more relevant in the user’s search.
What is a PageRank?
A PageRank is a link analysis algorithm that is trademarked and used by Google in its search engine. It assigns a numerical weight to each link using a hyperlinked set of documents. Its main function is to measure the importance of each link within groups. Once measured, the link is given a rank. The rank value shows the importance of that particular link. A hyperlink is attached to it and if it is clicked on, it is counted as a vote of support. A webpage that is linked to other webpages that have a high PageRank will also rank high. If there are no links to a webpage, then there is nothing that is supporting that page.
What is Googlebots?
Googlebot is a web crawling robot that finds and retrieves webpages and then gives the pages to the indexer. It works very much like a web browser. It sends a request to a web server for a specific web page. It then downloads the entire page and after it has been crawled, it gives the information to the indexer.
Googlebot is made up of a variety of computers that request and grab webpages a lot faster than what a user could do on a web browser. It can scan thousands of different webpages at the same time. In order to not overwhelm Googlebot with web servers or requests from people, it purposefully makes requests for each web server slower that its actual capabilities.
People who send out spam have unfortunately figured out how to produce automated bots that overwhelm the add URL form with millions of URLS that are directing to commercial information. Google is able to throw out those URLs that are submitted by spam through a program that is able to detect when users are trying to trick employing tactics. This includes hidden text or links on a page, using irrelevant words on a page, redirects, creating doorways, domains or cloaking. Google has come up with a way to test users to make sure you are not adding spam.
When Googlebot grabs a page, it cleans out all of the links that are on the webpage and adds them to a group for crawling. It doesn’t encounter spam very often because the majority of webmasters only add links to what they feel are quality content pages. It then collects links from every webpage that it comes across and it can quickly build a list of links that cover a large variety of websites across the Internet. This is called deep crawling. Deep crawling allows Googlebot to search deep within individual websites. This can take up to a month to finish one website.
Googlebot must be able to handle several different types of challenges. Since it sends out multiple requests at the same time on thousands of pages, it must be constantly examined with what is in the index. There are usually duplicates and they must be gotten rid of within the group to keep the Googlebot from crawling the same webpage again. Googlebot must figure out how often to visit a webpage. It can be a waste of time to re-index a webpage that has been unchanged but it is important to makes sure webpages that have been updated have been re-indexed.
In order to keep the index up to date, it must always recrawl webpages are favorites and ones that are always changing. These types of crawls that keep the index up to date are known as fresh crawls. Some examples are newspapers and stock quotes are that updated daily. The difference between a fresh crawl and a deep crawl is that fresh crawls return fewer webpages than a deep crawl. It is the combination of these two types of crawls that allows Google to use its resources efficiently and effectively and will keep the index up to date as much as possible.