Getting the HTML of a Web Page from PowerShell

Getting the raw html of a website can be helpful when you want to search for specific text or are building a web scraping application. It can also be handy if you need to set up some kind of automation that interacts with a web site such as downloading a file. The .NET Framework's WebClient class includes some methods for getting the source of a web page.

The code below defines a url variable and then grabs the HTML into another variable:

                                        
  # Create a WebClient object
  $webClient = New-Object System.Net.Webclient

  # The URL we want to get the HTML for
  $url = "http://www.blogs.rememberwhens.com"

  # Download the raw HTML
  $rawHTML = $webClient.DownloadString($url)

                                        

One thing to note is that you want to be sure to include the "http://" in the url or else the method thinks this is a local file and will cough up an error.

Now that you have the HTML you can make use of the Select-String cmdlet to get portions of the content to find the information you want. More info on this cmdlet and using the .NET Regex class in PowerShell is coming in a future article.

For now, consider the following which uses regular expression to extract the "Title" from the HTML:
                                        
  # Define regular expression pattern
  
  $reg = "<(?i:Title)>(?<titleContent>(\D*?))</(?i:Title)>"
  
  #
  # Match content within the  tags
  #
  $groups = [System.Text.RegularExpressions.RegEx]::Match($rawHtml,$reg).Groups

  # Get the match in the TitleContent group which will be the title
  $title = $groups["titleContent"].value

                                        

Bonus Tip: You can pipe the raw html to the Out-File cmdlet if you want to save the html to a local file:

  $webClient.DownloadString($url) | Out-File -FilePath C:\WebBackup\SiteHTML.txt

                                        


© 2024 Embrs.net