Code Answer: Quickest way to get list of <title> values from all pages on localhost website

I essentially want to spider my local site and create a list of all the titles and URLs as in:

http://localhost/mySite/Default.aspx      My Home Page
http://localhost/mySite/Preferences.aspx  My Preferences
http://localhost/mySite/Messages.aspx     Messages

I'm running Windows. I'm open to anything that works--a C# console app, PowerShell, some existing tool, etc. We can assume that the tag does exist in the document.

Note: I need to actually spider the files since the title may be set in code rather than markup.

From stackoverflow

Ok, I'm not familiar with Windows, but to get you in the right direction: use a XSLT transformation with

<xsl:value-of select="/head/title" /> in there to get the title back or if you can, use the XPath '/head/title' to get the title back.
A quick and dirty Cygwin Bash script which does the job:
```
#!/bin/bash
for file in $(find $WWWROOT -iname \*.aspx); do
  echo -en $file '\t'
  cat $file | tr '\n' ' ' | sed -i 's/.*<title>$[^<]*$<\/title>.*/\1/'
done
```
Explanation: this finds every .aspx file under the root directory $WWWROOT, replaces all newlines with spaces so that there are no newlines between the <title> and </title>, and then grabs out the text between those tags.

Larsenal : This doesn't seem to quite work. What am I doing wrong?
I think a script similar to what Adam Rosenfield suggested is what you want, but if you want the actual URLs, try using wget. With some appropriate options, it will print out a list of all the pages on your site (plus download them, which maybe you can suppress with --spider). The wget program is avaliable through the normal Cygwin installer.

Dustin : Yeah, that is what I was trying to get working to post here! Here's a snippet: site=mysite.com wget --recursive --accept \*.html http://$site ;for file in $( find $site -name *.html ); do // adam's for-body
I would use wget as detailed above. Be sure you don't have any spider traps on your site.
you should consider using scrapy shell

check out

http://doc.scrapy.org/intro/tutorial.html

in console put something like this :

hxs.x('/html/head/title/text()').extract()

if you want all titles, you should do a spider...it really easy.

Also consider to move to linux :P

Code Answer

Wednesday, March 16, 2011

Quickest way to get list of <title> values from all pages on localhost website

0 comments:

Post a Comment

Blog Archive