Difference between revisions of "Cweb"

Latest revision as of 15:08, 5 July 2012

This is a draft document, very much a work in progress. See the talk page for notes and caveats. See Projects for other projects.

Project status

In the planning phase.

Contributors

Members who have contributed to this project. Newest on top.

John

All contributors have agreed to the terms of the Contributor License Agreement. This excludes any upstream contributors who tend to have different administrative frameworks.

Copyright

Source code

Subversion project isn't configured yet.

Links

mitsukeru on reddit

TODO

Things to do, in rough order of priority:

Create project in svn

Done

Stuff that's done. Latest on top.

JE 2011-08-08: started documentation

System design

Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.

The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.

Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.

There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.

Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.

There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on Morpheus for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:

Nominating and distributing the blacklist
Nominating and distributing Cweb Providers (name and 32-bit ID)
Nominating and distributing Domain IDs (name and 32-bit ID)
Nominating and distributing URL IDs (URL and 64-bit ID)
Nominating and distributing Query IDs (string and 64-bit ID)
Coordinating the Word database

Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.

Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.

Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).

Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link Cweb, then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.

We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.

We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:

Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
Is the word punctuation?
Is the word a number, and if so what is its value?
Is the word a plural, and if so, what is the root word?
Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
Does the word have synonyms, and what are they?
Is the word a proper name?
Is the word an email address, or a URL?
What languages is the UTF-8 string a word in?
What senses does the word have?

Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.

The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).

It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.

Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.

There will be a table with the schema,

result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );

In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.

Other tables in the system will be,

cweb ( cweb_id, cweb_name ) key ( cweb_id );
domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
url ( url_id, url, canonical_url_id ) key ( url_id );
query ( query_id, query ) key ( query_id );

There will be the following set of tables too,

cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
domain_source( domain_id, source_id ) key ( domain_id, source_id );
url_source ( url_id, source_id ) key ( url_id, source_id );
query_source ( query_id, source_id ) key ( query_id, source_id );
result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );

The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.

There will also be the following tables,

http_content ( url_id, time_of_retrieval, mime_type, content );
html_content ( url_id, time_of_index, total_size, heading_text, 
  navigation_text, content_text, alt_text, title_text );
link_content ( source_url_id, url_id, title_text, anchor_text );

The master database will have the following additional tables,

cweb_availability ( cweb_id, url_blocks, queries_per_month );
cweb_url_allocation ( cweb_id, block_id );
cweb_query_allocation ( cweb_id, year, month, count );

URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.

Cweb sites that are not the master will have a table,

query_max ( last_block_id );
cweb_query ( block_id, cweb_id, quota );

In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.

@@ Line 1: / Line 1: @@
-How to Jam on Guitar - Jamming For Beginners
+This is a draft document, very much a work in progress. See the [[Talk:Cweb|talk page]] for notes and caveats. See [[Projects]] for other projects.
-Guitar jam sessions are an incredible way to improve your playing skills and your confidence as a guitarist. They can also become very daunting for guitarists who are brand-new to jamming, particularly beginners. But jamming is actually supposed to end up being fun, and will be if you know what to expect, and go prepared. This specific post takes a look at what's involved, commencing with the techniques needed when jamming, followed of the different situations where you can rely on them.
+== Project status ==
-Portion 1: Jamming On Guitar - Just how To acheive it
+In the planning phase.
-Basically, jamming is truly improvising with other musicians - generally 1 or much more people play rhythm parts, to get a beat going, and others improvise solos over the top. A jam may end up being entirely free-form, or may be based on a particular song or chord collection. Unfortunately, even though may very well not always end up being preparing to play any distinct song or item, you can (and should) prepare yourself by making certain you have a good expert of one's instrument. It means practicing chords and scales in various keys, which means you'll always be able to play both rhythm and solo parts while jamming. You don'capital t must a good expert guitarist to jam, but you do need to at least comprehension the fundamentals.
+== Contributors ==
-Skills needed for jamming:
+Members who have contributed to this project. Newest on top.
-Strumming chords in a variety of keys, with the ability to alter chords cleanly. If you're new to the guitar, start off with the important chords in the more common keys (these kinds of since C, G, D, A, At the, F and so forth), and develop through there.
+* [[User:John|John]]
-The ability to play in time. You don'big t have to play complex rhythms if you're not cozy with that, but you must be able to remain to the beat. If you're playing a solo, the rhythm must take priority - in other thoughts, if you come unglued, it's ok to overlook out a number of notes in the melody, but you must retain upwards with the beat. Learn to listen closely to the bass and/or drums - this specific will help you to stay in the right place, and to keep being distracted by nerves or other things going on around you.
-The ability to hear chord progressions and follow along. Ear skills are vital for jamming - you can practice by recognising when chord changes happen in the music you listen to, and later by learning to identify the precise chords that are being used. You'll find that the same patterns tend to recur a lot (particularly in popular music), and will eventually end up being able to recognise them instantly. For a lot more advanced ear training, specialised classes are available.
+All contributors have agreed to the terms of the [[ProgClub:Copyrights#ProgClub_projects|Contributor License Agreement]]. This excludes any upstream contributors who tend to have different administrative frameworks.
-Having the capacity to improvise lead melodies. You might not want to do this straight away, which will be great - you can just strum along with the rhythm if you like. But having the capacity to improvise melodies is actually a key part of much more advanced jamming, and requires some lead guitar skills. Scale exercise is actually essential right here, since is actually some basic theory, which means you understand which notes may be very well played over which chords.
+== Copyright ==
-Jamming step by step
+Copyright 2011, [[Cweb#Contributors|Contributors]]. Licensed under the [[GPL]].
-Jamming will be by its own nature a comparatively unstructured process, but when you're fresh to it, you don't have to jump in at the heavy end. Instead, you can develop your jamming skills steadily. First of all, you'll need to understand which key the music is truly in - for simple pieces, this will define the chords and notes that you will should be able to play (a lot more complex jams may involve lots of key changes and the use of more obscure chords - try to get experience of jamming with easier songs and sequences first!). Having determined the main element, you can decide how you will want to be involved in the jam, according to your skill and confidence level. For example:
+== Source code ==
-Step 1 - supposing that you just're basing the jam surrounding a song you know or a predetermined chord range, just strum along with one strum to each beat using simple downstrokes (or if the pace is actually as well fast - try strumming every other beat, or on the first beat of each bar).
-Step two - strum along, but somewhat than just using downstrokes, use upstrokes way too to play much more complex rhythms that blend with what the others are doing.
+Subversion project isn't configured yet.
-Step three - create some simple riffs. These might be repeated with the chord changes, or varied a bit to make things much more interesting.
-Step four - try improvising some solo melodies. You can retain them very simple at first, sticking with the notes of each and every chord, then get more adventurous because your skills and confidence development.
+== Links ==
-If you're playing an electric powered guitar, you can often experiment with adding effects at any stage in the procedure, if appropriate.
+* [https://github.com/deoxxa/mitsukeru mitsukeru] on [http://www.reddit.com/r/programming/comments/krzys/show_proggit_im_working_on_a_search_engine_does/ reddit]
-Portion II: Putting It Into Exercise - 3 Jamming Scenarios
+== TODO ==
-So, today you have a good concept of exactly how to jam on guitar, lets have a go through the main situations where you can train your new skills, and just how to make the every one of them.
+Things to do, in rough order of priority:
-. Jamming With People
+* Create project in svn
-Jamming in a are living environment with other musicians can'n end up being beat. After practising itself at home every single entire day, it is actually great to get out and connect with a few like-minded others. It even provides invaluable experience if you need to play in a band or other reside situation - playing with others requires listening, improvisation and rhythm skills beyond these you'll normally use when playing on your own.
+== Done ==
-So, what exactly happens at a jam session? It changes, based on situation. For case in point, sometimes people celebration to jam over available songs (or song structures), or they may follow a chord string ideal by one particular registrant, and tabs or chord charts may or will not be used. Sometimes, when with many free-form jams, there's no predetermined structure at just about all, and everyone just improvises based on what they will're hearing. The music may deal with various styles (this kind of like jazz, rock, blues give up). If you're new to jamming, you'll probably find it less difficult at sessions that follow a familiar song or chord progression, with simple structure these kinds of when three chord songs or a 12 pub blues.
+Stuff that's done. Latest on top.
-In a group situation, you might be biggest bank to play a certain role during each and every piece - such because playing rhythm, or soloing. Make sure you stick to your task, but also stay aware of what the other folks in your session are doing. Sight contact could be particularly vital if you're every single one improvising readily (because opposed to following a predetermined structure), since people will apply it to signal when they're about to change chords or rhythms, or finish a solo and many others.
+* [[User:John|JE]] 2011-08-08: started documentation
-You might feel nervous when jamming with others for the first time - that is normal, and you shouldn'capital t worry too much about making mistakes - they will're inevitable. It will help if you're playing with people who aren'big t also advanced, or are prepared to include some less difficult songs in the session for the benefit of the much less experienced. Most musicians are going to be welcoming to newcomers and will with any luck , remember how it felt to always be brand new to jamming - if they're not, find somewhere else to play! If you don'capital t have musician buddies to jam with already, you can often discover community jam sessions organised by music stores, pubs and the like - these will at times always be geared towards players of different standards, so look for beginner jam sessions to focus on.
+== System design ==
-If you're unable to jam with other musicians in man or woman, or you want to improve your jamming skills in between session, you can additionally jam along with recorded tracks, in addition to with tools as being a drum machine.
+Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.
-only two. Jamming With Recorded Tracks
+The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.
-Jamming along with recordings will be the next best thing to playing stay. While this unique doesn'big t have the same aspect of unpredictability, it offers you the chance to training centering on developing your very own improvisational skills versus a constant musical backdrop. You can of course play along with recordings of songs by artists you like - that is a good way to get to know the songs that are probably to be played at your live sessions too.
+Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.
-You can even use tracks that were recorded specifically with jamming in mind - there are lots of free guitar jam tracks in many styles available online (although the exceptional does vary a lot), and there's also professionally recorded tracks available for sale at low prices. These often come in two versions - 1 with a guitar solo included, and 'minus a single' versions where the lead track is truly absent, so you can fill it in yourself.
+There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.
-. Jamming With Computer software as well as other Understanding Aids
+Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.
-Another option is always to training jamming with a virtual drummer or bassist in the form of a drum equipment or software package equivalent. This is definitely a very excellent way to formulate your rhythm skills, which are vital to effective jamming. Computer software that really helps to system your possess drum or basslines, and/or which is pre-programmed with a variety of presets is widely available online. Some software package additionally gives full backing tracks in various keys.
+There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on [http://www.progsoc.org/wiki/Morpheus Morpheus] for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:
-If you're still fresh to playing the guitar, you'll find that practising playing along with others coming from a very early stage in the mastering method will help that you jam far more confidently. Choosing a course of quality guitar lessons that comes with jam tracks that presents you experience of playing with a virtual band right coming from the beginning is actually a single with the best things you're able to do - I recommend Jamorama, a downloadable course which features professionally recorded jam track in a variety of styles, right through the first demonstrations. Discover out more about it at http://learntheguitaronline.info. You can additionally read much more about my experience with the Jamorama guitar classes at my site.
+* Nominating and distributing the blacklist
-http://www.addurlrecreation.com/SubmitYourSite
+* Nominating and distributing Cweb Providers (name and 32-bit ID)
+* Nominating and distributing Domain IDs (name and 32-bit ID)
+* Nominating and distributing URL IDs (URL and 64-bit ID)
+* Nominating and distributing Query IDs (string and 64-bit ID)
+* Coordinating the Word database
+Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.
+Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.
+Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).
+Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link [http://www.progclub.org/wiki/Cweb Cweb], then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.
+We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.
+We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:
+* Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
+* Is the word punctuation?
+* Is the word a number, and if so what is its value?
+* Is the word a plural, and if so, what is the root word?
+* Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
+* Does the word have synonyms, and what are they?
+* Is the word a proper name?
+* Is the word an email address, or a URL?
+* What languages is the UTF-8 string a word in?
+* What senses does the word have?
+Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.
+The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).
+It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.
+Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.
+There will be a table with the schema,
+ result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );
+In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.
+Other tables in the system will be,
+ cweb ( cweb_id, cweb_name ) key ( cweb_id );
+ domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
+ url ( url_id, url, canonical_url_id ) key ( url_id );
+ query ( query_id, query ) key ( query_id );
+There will be the following set of tables too,
+ cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
+ domain_source( domain_id, source_id ) key ( domain_id, source_id );
+ url_source ( url_id, source_id ) key ( url_id, source_id );
+ query_source ( query_id, source_id ) key ( query_id, source_id );
+ result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );
+The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.
+There will also be the following tables,
+ http_content ( url_id, time_of_retrieval, mime_type, content );
+ html_content ( url_id, time_of_index, total_size, heading_text,
+   navigation_text, content_text, alt_text, title_text );
+ link_content ( source_url_id, url_id, title_text, anchor_text );
+The master database will have the following additional tables,
+ cweb_availability ( cweb_id, url_blocks, queries_per_month );
+ cweb_url_allocation ( cweb_id, block_id );
+ cweb_query_allocation ( cweb_id, year, month, count );
+URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.
+Cweb sites that are not the master will have a table,
+ query_max ( last_block_id );
+ cweb_query ( block_id, cweb_id, quota );
+In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.

Difference between revisions of "Cweb"

Latest revision as of 15:08, 5 July 2012

Contents

Project status

Contributors

Copyright

Source code

Links

TODO

Done

System design

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

ProgClub

Wiki

Mailing lists

Tools