Getting the raw html of a website can be helpful when you want to search for specific text or are building a web scraping application. It can also be handy if you need to set up some kind of automation that interacts with a web site such as downloading a file. The .NET Framework's WebClient class includes some methods for getting the source of a web page.
The code below defines a url variable and then grabs the HTML into another variable:
# Create a WebClient object
$webClient = New-Object System.Net.Webclient
# The URL we want to get the HTML for
$url = "http://www.blogs.rememberwhens.com"
# Download the raw HTML
$rawHTML = $webClient.DownloadString($url)
One thing to note is that you want to be sure to include the "http://" in the url or else the method thinks this is a local file and will cough up an error.
Now that you have the HTML you can make use of the Select-String
cmdlet to get portions of the content to find the information you want. More info on this cmdlet and using the .NET Regex
class in PowerShell is coming in a future article.
For now, consider the following which uses regular expression to extract the "Title" from the HTML:
# Define regular expression pattern
$reg = "<(?i:Title)>(?(\D*?))(?i:Title)>"
# Match content within the tags
$groups = [System.Text.RegularExpressions.RegEx]::Match($rawHtml,$reg).Groups
# Get the match in the TitleContent group which will be the title
$title = $groups["titleContent"].value
: You can pipe the raw html to the Out-File
cmdlet if you want to save the html to a local file:
$webClient.DownloadString($url) | Out-File -FilePath C:\WebBackup\SiteHTML.txt