New servers and a new IT team...

Please register or login

Welcome to ScubaBoard, the world's largest scuba diving community. Registration is not required to read the forums, but we encourage you to join. Joining has its benefits and enables you to participate in the discussions.

Benefits of registering include

  • Ability to post and comment on topics and discussions.
  • A Free photo gallery to share your dive photos with the world.
  • You can make this box go away

Joining is quick and easy. Log in or Register now!

mike_s:
The biggest problem with platforms like that (or the ones I mentioned) is the support costs... yikes!

They didn't have the normal cgi and direct access to perl but rather a funky interface which required a copy of "Mastering Regular Expressions" to write the simpliest script.
 
DandyDon:
I guess it's kinda rough when we overload the ferry, and it has to be rebuilt during crossings. Couldn't be in better hands, tho...
I like the analogy Don.
Well said.
 
RonFrank:
The whole things sounds rather offbase. Why would anyone need/want/have 50~100G of WebLogs? Why would anyone attempt to store such in a DB? Why would one want to split those logs into multiple flat files? Who is going to look at 50G of TEXT data a day?

50G a day is no strain on Oracle, or a solid UNIX server, and there should be no need of two FULL time DBA's to handle that workload. This assumes adequate design, data scrubbing and archiving, and capacity planning.

The only time we look at WebLogs is when we are having issues. A historical copy of them would not be mandidated by any regulatory agency, and why would anyone want to parse them into different files, and keep them around?

If these things were getting generated by the application, the simple solution is stop doing that! If they are useful for debugging, then put in a verbose option, and leave it off unless you need it.

In any event, I seem to see the problem here. It has little to do with technology, and more to do with design, and appropriate use of technology. We use Oracle. We use a LOT of things (including Linux and MySQL). Ironically people have all sorts of beliefs on what is best, when in fact they are just tools, and one chooses what is best for the needs of the applications, and users.

:lol: :lol: :lol:

Sorry, I find this a little hilarious. We generate about 3 terabytes or so of logfiles a day across probably 4,000 webservers (the designation has grown a little debatable) and another ~25,000 servers that do various middleware functions. The log storage system has about 100T of storage. There is a massive amount of post-processing that occurs on the backend to sort and munge the logs and large numbers of different products are produced from the weblogs. The software is large enough that it constantly throws errors at a certain rate and all of that is trended. It is also useful as a first cut at hitrates per page, velocity of sales and all kinds of other garbage (the databases could be used for some of this, but for data crunching its better not to compete with the transactional load in the daytime). Plus when you find an issue you typically want to look historically in the logs to see when the issue started to occur (or to try to correlate the start of an error in the logs with the start of an effect on the website).

Web Monkey has obviously worked on a site which actually has some traffic, complexity and analysis to deal with...
 
lamont:
:lol: :lol: :lol:

It is also useful as a first cut at hitrates per page, velocity of sales and all kinds of other garbage (the databases could be used for some of this, but for data crunching its better not to compete with the transactional load in the daytime).

With all the t's of storage don't they mirror the logs so you can adhoc query the mirror and not affect production transactions?
 
Gilless:
With all the t's of storage don't they mirror the logs so you can adhoc query the mirror and not affect production transactions?

its the production (typically Oracle, but more and more MySQL) databases that you don't want to interfere with the transactional load in the middle of the day since brownouts there are visible to external customers. the logfiles get hammered on all the time and brownouts there just annoy internal customers... and there the logfile buckets are striped over multiple servers (although not replicated since losing logfiles is generally just inconvenient) so dealing with brownouts is usually a matter of better 'scheduling' different teams buckets to different servers to keep them from stepping on each other...
 
Wow!!!
Thank God I went down the Cisco path...Internetworking forever.
 
RonFrank:
The whole things sounds rather offbase. Why would anyone need/want/have 50~100G of WebLogs?
In order to bill the customers.

RonFrank:
Why would anyone attempt to store such in a DB?
That's the way it was when I got there.

RonFrank:
Why would one want to split those logs into multiple flat files?
To send to the customers so when they get a whopping great bill they can't say "That wasn't my site!

RonFrank:
Who is going to look at 50G of TEXT data a day?
Nobody except me would want to look at them in the form that comes off the cache servers. The individual customers want to see them when they get their bill.

They generally only want to see it once, then never ask again, although we sent them along as backup for the billing anyway.

RonFrank:
50G a day is no strain on Oracle, or a solid UNIX server, and there should be no need of two FULL time DBA's to handle that workload.
Speaking from experience, I can tell you that it was a problem, and that Oracle just couldn't keep up. It wasn't just 50GB incoming, it was import, index, query, export, drop partitions, create new partitions, allocate more disk when the volume picked up, repeat.

The 2 boxes were busy almost 24 hours a day, unless something bad happend like when the Storage Ops manager stole the cable to the SAN for a Really Big customer and took my whole billing system down, in which case it could take a week to make up for the lost day.

The 2 DBAs were because for some reason, you can't have one individual on call 24x7x365. They tend to get a little testy after a while.

Oracle is very good for a lot of stuff, but does have limits. It's a swiss-army knife, not a laser.

RonFrank:
This assumes adequate design, data scrubbing and archiving, and capacity planning.
Scrubbing with what? When? It's a bunch of huge text files. You take them like they come in and deal with it. In reality there isn't anything to scrub. Every line is an HTTP (or FTP or MMS) request.

RonFrank:
The only time we look at WebLogs is when we are having issues. A historical copy of them would not be mandidated by any regulatory agency, and why would anyone want to parse them into different files, and keep them around?
To feed WebTrends and give to the customers.

RonFrank:
If these things were getting generated by the application, the simple solution is stop doing that! If they are useful for debugging, then put in a verbose option, and leave it off unless you need it.
See above.
In any event, I seem to see the problem here. It has little to do with technology, and more to do with design, and appropriate use of technology. We use Oracle. We use a LOT of things (including Linux and MySQL). Ironically people have all sorts of beliefs on what is best, when in fact they are just tools, and one chooses what is best for the needs of the applications, and users.

That's exactly what I did. The best tool for this application wasn't a database server, it was a special-purpose app.

Terry
 
Interesting. I work for a little company called Verizon. My job for the past twelve years has been to pull switch data for first MCI (Local, LongDistance), then Worldcom (Dial, IP, IPVPN), and turn it into something billable. I am now working for Verizon. Our customers are Microsoft, Time Warner, AOL, and the like as my life is now making sure that they can in turn bill their customers, who then bill their customers and down the line until it makes it to sites like Scubaboard for example for IP usage billing. I'm no stranger to huge volume data on a global scale, and ways to handle it.

I've seen just about every given solution to aggregating, storing, and making available huge volume data. My only comment is that most companies (ourselves included) don't do a very good job at capacity planning, hardware allocation, and application and database design.

Oracle is a perfectly good solution for large scale volume data processing, but it must be done well. Most places implement a poor solution on inadequate hardware that is improperly partitioned, with poor database design. Scrubbing, data backup, and archiving is done as an afterthought and even with the very good Can tools that Oracle has out of the box, they can not overcome a poor implementation that was done in a rush with little thought. Implement, and fix it in production seems to be a sad standard.

When things crash, management blames everyone but the people that are truely responsible (look in the mirror). I've seen DB applications trashed because it's MUCH easier to blame a product than the true culprit, which is generally those responsible for the disaster that is the junk system they created often against the advice of the development staff.

I agree that C, PERL, and C++ using hashing strageties can outperform Oracle loaders, however it's a maintenance nightmare. This type of implementation eats up so much man power in maintenance, support, and redesign, that if the bean counters could truely SEE the beans, a quick retreat back to more maintainable SQL driven solutions would be implemented! :lol:

In any event, this is off topic, but I think what we are doing here is agreeing. :D
 
RonFrank:
I
In any event, this is off topic, but I think what we are doing here is agreeing. :D

Probably.

My only counter-point is that Oracle-done-right is generally expensive, while MySQL+Linux/FreeBSD is the cost of the hardware and the expertise... You don't take a $20B company and put its most important financial data onto MySQL, but it should work fine as an architecture for scubaboard...

Oh, actually another thing is that one cheap way to beat Oracle is to use statically chewed data and serve it out of apache.... you can hit 10,000+ transactions per second on $3k of hardware... Of course you can get the best of both worlds by using this as a read-only cache sitting in front of Oracle... The best solution is generally not going to one or the other but a mixture of both... So when you say this:

RonFrank:
I agree that C, PERL, and C++ using hashing strageties can outperform Oracle loaders, however it's a maintenance nightmare. This type of implementation eats up so much man power in maintenance, support, and redesign, that if the bean counters could truely SEE the beans, a quick retreat back to more maintainable SQL driven solutions would be implemented!

I disagree in that having alternatives to SQL in front of SQL is generally going to be preferable.
 
RonFrank:
I agree that C, PERL, and C++ using hashing strageties can outperform Oracle loaders, however it's a maintenance nightmare. This type of implementation eats up so much man power in maintenance, support, and redesign, that if the bean counters could truely SEE the beans, a quick retreat back to more maintainable SQL driven solutions would be implemented!

Embrace the dark side. Return to the days when software was small, fast, reliable and efficient.

Everything doesn't need an Enterprise Application Platform (or whatever the current marketing term is) and some things are better off without one.

Terry
 

Back
Top Bottom