web | John's blog

I have a computer sitting on my desk that is always on (it’s my file server) and it has a monitor attached which is almost never in use (I ssh to that server if I want to do things so it’s hardly ever logged in).

I thought it would be cool if on that monitor the web-logs from all of the systems I manage were shown so I could keep an eye on things and maybe learn a thing or two about my web-sites and how people are using them.

So the first thing I did was write a script to grab any given web log:

root@orac:~# cat /root/get-web-log.sh
#!/bin/bash
echo Starting download of $3...
while : ; do
  su -c "ssh $1 tail -f /var/log/apache2/$2 < /dev/null" jj5 \
    | tee -a /var/log/web.log \
    | grep --line-buffered -v "Mozilla.5.0 .compatible. Googlebot.2.1. .http...www.google.com.bot.html." \
    | grep --line-buffered -v "Baiduspider...http...www.baidu.com.search.spider.htm." \
    | grep --line-buffered -v "Mozilla.5.0 .compatible. Baiduspider.2.0. .http...www.baidu.com.search.spider.html." \
    | grep --line-buffered -v "Mozilla.5.0 .compatible. Exabot.3.0. .http...www.exabot.com.go.robot." \
    | grep --line-buffered -v "Mozilla.5.0 .compatible. YandexBot.3.0. .http...yandex.com.bots." \
    > /var/log/web/$3
  sleep 60
  echo; echo; echo Restarting download of $3...; echo; echo;
done

Then I wrote a series of scripts which call the get-web-log.sh script for specific web-sites on specific servers, e.g.:

root@orac:~# cat /root/web-log/get-jsphp.co
#!/bin/bash
/root/get-web-log.sh honesty www.jsphp.co-access.log jsphp.co
exit

Then I wrote a main script, rather unoriginally called info.sh, that kicks off the web logs downloads and then monitors their progress as they come through:

root@orac:~# cat /root/info.sh
#!/bin/bash

# disable the screensaver
setterm -blank 0 -powersave off -powerdown 0

# start downloading the web-logs
cd /root/web-log
./get-jsphp.co &
sleep 1
#...all the other downloaders, one for each site

# watch the web-logs
cd /var/log/web
tail -f *

# stop downloading the web-logs
kill %1
#...all the other kills, one for each downloader

exit

Then I edited /etc/init/tty1.conf so that on tty1, instead of having a login console, I automatically ran my info.sh script:

root@orac:~# cat /etc/init/tty1.conf
# tty1 - getty
#
# This service maintains a getty on tty1 from the point the system is
# started until it is shut down again.

start on stopped rc RUNLEVEL=[2345]
stop on runlevel [!2345]

respawn
#exec /sbin/getty -8 38400 tty1
exec /root/info.sh < /dev/tty1 > /dev/tty1 2>&1

And that was it. The only trick was that I needed to disable the screen saver (as shown in the info.sh script) so that the screen didn’t constantly blank.

And now I can watch the web activity on all of my sites in real time.

I found this article (Some Guidelines for Determining Web Page and File Size) today which talks about the average size of HTML and other files on the web. According the article (and I’m not clear how they got their data) the average HTML file is 25k, JPEG 11.9k, GIF 2.9k, PNG 14.5k, SWF 32k, external scripts 11.2k and external CSS 17k with the average total size of a web page being 130k. Interesting stuff. Particularly that scripts are typically 11.2k given that jQuery is 90k.

I’m really struggling with a design decision at the moment, being that I’m not sure whether it’s better to embed CSS/JavaScript content or to link it. The thing is that if you link it then the client has to send extra HTTP requests (at least two) to get the content, which is overhead and takes time. The thing is, if your users are returning customers then they might already have the linked files in their cache, meaning they don’t need to send extra HTTP requests, or if they do maybe those requests won’t need to return content. But then maybe a browser will cache a file when it shouldn’t (this can be avoided with good design), or maybe the user’s connection will fail while loading the linked files and they’ll see an unstyled page in their browser.

So many pros and cons, and it’s all hypothetical… what I really need is data. Anyway, I don’t have data, nor do I really have the tools to get it. So given that I have to fly in the dark, here’s my plan:

When I’m processing a request for a user who doesn’t have a browser cookie set I will embed CSS and JavaScript in the HTML. This is because if their browser cookie isn’t set then this is their first request to my web-site, maybe ever, or maybe just in a while. Either way, it’s probably safe to assume they’re a first-time visitor so they won’t have any content in their cache and they’d need to send additional requests for linked files. So I can save those additional requests and hopefully make my web pages load faster for users who are probably one-off visitors.

But for regular users having to download the same content over and over in every request gets tired fast. The linked files can be about half the size of the page, so embedding doubles the size of each transfer. When I’m processing a request if the user’s browser cookie is already set then I’ll assume they’re a regular visitor and link my JavaScript files rather than embedding them. I’ll still embed CSS content though, because my CSS content is relatively small and I want to avoid errors where the page loads but the styles don’t.

Then I’ll make the system configurable so users can change their link/embed settings for CSS and JavaScript if they’re not happy with the defaults. Regular power users can use this feature to turn on linking for all content so pages load as fast as possible for them.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

John's blog

Talking about technology (mostly)

Tag Archives: web

Watching the web-logs on all of my servers in real time

Web page HTML/CSS/JavaScript file size