How to Clone Websites in Under 5 Minutes

Daniel Pericich
7 min readMay 11, 2023

--

Photo by freestocks on Unsplash

We have all been there before. Our client has given us a tight deadline to update their website but has become unresponsive to our messages. We need source files to be able to make fixes and updates but do not have access to the hosting site or a Github repository with the code. We could wait, or there is another way we can make updates to the code. We can clone the pages if the website is simple and not built with a framework or a platform like Shopify or WordPress.

What is a Website Made of?

Modern web development can seem overwhelming and confusing with all the frameworks available. You can use templating languages to create hydrated, dynamic HTML files. You can streamline CSS with styling frameworks like Bootstrap or Tailwind, and there is a new flavor of JavaScript framework every day to 10x some attributes of your app.

These are all flashy extensions for web development, but web development still boils down to 3 components. These components are HTML to structure your page, CSS to style your page, and JavaScript to make your page functional. That is it.

Figure 1. The pillars of modern web development.

If you have these files, you have everything you need to make a functional webpage. For cloning, we need these assets and will see that there are a few ways to compile these.

Using cURL to get the Source Code

We want the code. Entering the target URL in the search bar yields the webpage, but this is not the source code. So how do we get it? There are two ways to be able to get the source code. The first tool we will look at for getting the source code is cURL.

The open-source tool cURL is both a library and a command line tool that allows us to specify a website target to get the response from a request. In other words, we decide which website page we are interested in and use cURL to return the code for the webpage.

This exercise will look into getting a local version of the Apple homepage working. Our first step is to run the cURL command in our terminal to retrieve the target page source code:

Figure 2. cURL command to get source code of the index page for Apple.

Running this command returns the HTML file passed back from making a GET request to Apple’s servers. Having it as an output to the terminal is not helpful. We will modify our command:

Figure 3. Writing cURL output to an HTML file.

Instead of printing the Apple homepage HTML to the window, we will write it to an HTML file, apple.html.

To preview the contents of this file as a browser webpage we can start a live server with the given file. If you have our apple.html file open in VS Code, press “cmd + l + o” to open a live server. With that command, we get the following webpage:

Figure 4. Local version of apple.com missing assets.

Wow, this looks terrible. What we wanted was something more like this:

Figure 5. Expected version of apple.com webpage.

Why does our page look so terrible compared to the actual website? It has to do with how a webpage is pulled together and leads us to the second way to get the source code for a website.

Relative Paths and CDNs

Before we get into the second method for getting our functional source code, we need to discuss how assets are loaded. In the past, servers stored and served assets from one location. This meant all your HTML files, images, fonts, and script files would be located and delivered from one place.

As the internet grew and tools like script and image/icon packages, became more common, developers started leveraging assets stored on other servers. This development led to a blend of user server assets and remote server assets to make beautiful, functional web pages.

One of the most common ways to access remote assets is through a Content Delivery Network (CDN). CDNs are geographically distributed groups of servers that cache content close to an end user. CDNs offer benefits like decreased webpage load times and resiliency to server downtime. However, we only care about how webpages call CDNs to receive assets.

Figure 6. Diagram from Cloudflare showing how CDNs work.

Why is this important? It is important because by getting assets from a CDN, our server does not need to contain the assets. This is helpful when trying to clone a webpage because we do not need to store and serve assets as the link tag in the Head section of our HTML will take care of this for us.

The assets we will need to worry about are the relative path assets or those loaded from one of our servers. We are not using Apple’s servers with our cloned webpage, so we locally need all assets served from their servers. The cURL command grabs the HTML for us, so we will need a new approach for retrieving assets to make this page functional.

Using the Network Tab to get the Source Code

We have discussed cURL, CDNs, and relative paths, and now we are ready to get the assets to make our page functional. There are a few ways to check what we are missing. The first is to go through the HTML.

Start in the Head section of the HTML at the top file. We want to look for <link> tags that import stylesheets, scripts, and other assets into our webpage. A quick search on our webpage shows 148 instances of the <link> HTML tag. Here are some of our first hits:

Figure 7. Some results from searching for link tags in our HTML.

Not every <link> tag requires action, but we will not know this until we look at the href attribute value. If the href value is a complete URL like “https://my-cdn-assets-puppy-image.com" we do not have to worry about the link. This is a full link to CDN or other asset servers that will serve what we need.

We must address the <link> tags if they look like the ones above. Here we have href attributes with values that are relative paths. We know this because the value starts with a backslash “/“ showing a path to where Apple’s server stores the file. This is good information but, we have more work to do.

How would we access these files if they were located in directories on an Apple server? We do not have access to Apple servers unless we try to hack them, but that would create a lot of sunk time. Our client needs fixes soon, so let us see what we can get with minimal effort. Let us look at our browser’s Network tab.

You can access the Network tab by left-clicking anywhere on your current webpage and clicking “Select Element.” That should pull up the developer console, then you navigate from the “Elements” tab to the “Network” tab:

Figure 8. Network tab with activity for requesting and building apple.com

Whenever you enter a website’s URL in your search bar, the browser makes an initial request to the target server to get the HTML. This HTML has more instructions on what assets it needs and how to build the page. The Network tab keeps track of any calls made for assets and displays the asset name, status of the request, and request duration.

We can use this tab to see all the outgoing calls for assets and then get copies of these assets for our page. To do this, click on the not-allowed icon next to the red dot towards the top of the network tab. This will clear out all the information about assets loaded in our current table.

Once this is complete, refresh the page and you will see all the assets load in the table. This includes the HTML for the page which will either be the name of the website “www.apple.com" or “index.html”. If you click on the HTML and look at the “Response” you should have the html code you need for the page structures.

Past this is where it can get tedious. Our Network tab now has ALL the assets, whether from a CDN or the site’s server, but we only need to grab the assets from the server. There is no efficient way to retrieve all the files, and to get the page to work locally we will need to modify the href values of each asset.

We do not want to use the existing file structures for references, and can instead create a few directories for our assets within VS Code. Our directory structure should look like this:

Figure 9. Directory structre for our clone webpage.

Create an assets directory with three subdirectories to contain our image files, JavaScript scripts, and CSS stylesheets. After this, we need to download our images, styles, and scripts from the network tab to our local project and update the paths for referencing these assets.

Our Final (Functional) Website

If you have followed all these instructions, you should have a local clone version of your target website. If you know what you are doing and how to troubleshoot missing images and broken styles and functionality (target server assets vs. CDNs) this process is quick and easy. Next time a demanding, but flaky client throws you a time-sensitive project without sharing the source code, you are ready to succeed anyways.

--

--

Daniel Pericich
Daniel Pericich

Written by Daniel Pericich

Former Big Beer Engineer turned Full Stack Software Engineer

No responses yet