Contents
Introduction
The Services in a Nutshell
Common Failures and Downtime
Basic Design Considerations
Scaling the Infrastructure
Ideas on Where to Go Next
Related Books
Important LinksIntroduction
In any business environment, you'll often find yourself tasked with designing, implementing or maintaining a basic infrastructure to provide some core services to your users or customers. These services are typically Email, Web (with accompanying FTP services) and DNS. If you're an ISP, these basic services are mandatory.
You're probably convinced, and arguably so - that Sun hardware running the Solaris operating system is a tight platform to provide enterprise solutions. Should you rely on the built-in services that Solaris provides, or replace them with third-party software? How should such an infrastructure be designed with maximum uptime and availability in mind, suitable for a business-class environment?
This article covers these basic services and some software that you might want to consider running to replace the stock Solaris solutions to form an integrated, easily administered and scalable solution that can handle most common faults, reducing if not eliminating costly downtime. While each of the specific software solutions are covered under different articles, this one strives to put in perspective the implementation as a whole, providing a basic infrastructure that you can scale depending on your needs and budget.
The Services in a Nutshell
There are a few key services that are required in almost any business situation, and they are primarily Email and DNS services, but can also include Web serving with or without associated FTP access to Web site document trees. They could be for internal company use or for external client or customer use as would be the case with an ISP. Depending on what kind of usage demands you expect to have determines how you design your infrastructure, limited by the budget allocated. You need to make various decisions along the way as to what is important to you; cost, availability, security, scalability, ease of management, etc.
The first service, which is almost universal is Email. Anyone using the Internet knows how important Email is, and how annoyed users can get when it doesn't work right. Ask any system administrator about Email though, and they'll often mutter something incomprehensible about sendmail. It need not be like pulling teeth to provide fully customizable Email capabilities if you consider "Replacing Sendmail with Postfix." To complete the Email picture, you'll probably need a way to allow your users to retrieve their Email from your server(s) via the POP and/or IMAP protocols. As mentioned in the article covering Postfix, Qpopper is a nice solution that integrates and scales well with everything discussed here. Email is a service that really needs to be available as much as possible, although it's possible to provide simple redundancy and failover thanks to its queue-based nature.
The second service is DNS and consists of either simply doing name to IP lookups for your user base to hosting domains for your users. In either case, providing this service is relatively simple and straightforward using BIND. For more information on BIND and using "named" to provide this service, see "DNS for Dummies." Your particular needs will determine how available your servers have to be, and range from being up "most of the time" for day-to-day office needs to "all of the time" if you're acting as a nameserver to one or more domains.
Another common service is Web serving either your company's local Intranet, public Web site, multiple user Websites, development server or any combination of these. Most likely you will use Apache as your Web server of choice, and those new to Apache should read "Apache: The Basics" for more information. Web sites are another one of those services that can have availability described as either "most of the time" for a company Intranet or "all of the time" if you have an E-commerce site that is generating revenue for your company or providing hosting for many users who's needs cover both. Web site redundancy is fun, simple and highly-scalable. The sky, and your wallet are really the limit - but you need to know your traffic to make the call safely.
Lastly, if you provide Web services, your users may or may not require FTP access to their Web site's document root so that they can modify their content and upload new files. Your needs may be basic, or much more extensive if you have many users or of differing classes. Consider "Upgrading to ProFTPD" for a first-class solution that functions much like the Apache Web server for easier maintenance. If you provide this kind of service, availability can be very important if real-time information is to be updated, but can often be available "most of the time" if only occasional or infrequent changes to content are performed as would be the case with a personal site or company Intranet.
No matter what service(s) you provide, the more users you're supporting, the more expensive your infrastructure invariably becomes, and you will be required to have longer uptimes with higher availability as well. With your budget in mind, and the big picture in hand - you need to figure out how to package this all and make it work.
Common Failures and Downtime
Okay, so the question of which software seems fairly straightforward if you follow either the stock Solaris solutions or the above recommendations. What about hardware, and what kind of failures could (and should) you expect? There are several ways of looking at potential failures and how to protect yourself from them as you consider your final design.
Let's take a hardware approach to failures and what can go wrong. Assume we have a simple server hooked up to the Internet in our office, providing all of our services mentioned above. The simplest things to look at would be:
- Is your connection to the Internet redundant?
- Between your router and the Internet?
- What about from your server to the switch or hub?
- Multiple network cards and connections?
- Does your infrastructure have an un-interruptable power supply (UPS)?
- Can it handle your server? And RAID? Monitor, too?
- What about the other accessories such as routers and switches?
- How long is your battery life under this load?
- When the battery dies, does your server just die with it?
- Does your server have more than one power source?
- What about connections to different power circuits?
- Are you paying attention to electrical requirements and limits?
- Have any redundant power supplies, the biggest hardware failure point?
- Does your data reside on a RAID or a single drive?
- What happens if that single drive fails? Do you have a spare?
- What about a backup and recovery plan? Is it current?
- Do you have an off-site copy of your data in case of fire?
- Is your RAID fault-tolerant or redundant? Or a simple chassis?
This is just a top-down, basic look at some of the things that can go wrong. There are more and more as you go into the details of your design. The best approach is to first determine how much uptime and availability you need. Is it an all-or-nothing, up-at-all-costs, money-is-no-object kind of situation, or can you relax certain concerns and solutions due to budget or simple lack of high-volume or importance of availability? Once you know this, you can design around your needs.
After considering what your needs will be, develop an outline like the one above, and don't forget to think on a deeper, system level - including software. What can go wrong? Don't forget Murphy's law, which is usually pretty much on the money... "Whatever can go wrong, will go wrong." If you have a wide user base, especially paying customers - chances are good that you need to focus on all sources of failure, especially those known as single points of failure which are the most troublesome of all.
A single point of failure is one key piece of the design, be it either hardware or software - that can be a showstopper for one or all of your services. A simple, yet extreme example would be your very connection to the Internet. Unless you have a separate connection from a different provider with it's own infrastructure and more importantly - it's own connection to the Internet backbone... Should something happen with your connection, everything you're considering here regardless of expense or care in design - will be offline and useless. This is a single point of failure that is often overlooked, or simply beyond a reasonable budget as this is an expensive situation to remedy especially if you have higher bandwidth or co-location needs instead of a simple, slow connection for your home or office to the Internet. This is where you should start, and drill down to the last detail of your design.
Another approach to looking at how much availability you need and can afford is by looking at how much you stand to lose if these services aren't available. If you have an E-commerce site, this can be a very bad thing indeed, especially if it's a busy, profitable venture. You could be facing hundreds, thousands even millions of dollars in lost revenue if your site is down and inaccessible to your paying customers. If you're an ISP, unreliable service and problems will not only tarnish your image where word-of-mouth is a popular method of referral in lost potential customers but also in real, active customers that get tired of shoddy service. Besides monetary hits to the company's bottom line, you risk losing your job at worst or getting reamed by the boss that couldn't send his "absolutely, positively has to be there immediately" Email because the server was hosed and you don't have an answer to either what's wrong or how long it will take to fix.
Basic Design Considerations
Now that you know to look for things that can take your services offline, let's take a tour of various levels and costs for high-availability solutions to those threats. Is a simple solution adequate, or do you need to take it a step or two further? This is where you weigh your options against reason and budget vs. demands now and future.
Before you can even think about designing an infrastructure to provide these services, you need to establish a budget or total amount of money you're willing to spend implementing it. You'll have to consider the cost of hardware, software, connectivity, power and even salaries of those needed to manage it all. In this article, we're mainly interested in the computer side of the equation, covering all but people.
All of the services mentioned above have fairly similar, basic considerations that you need to address:
- Storage of information or user data
- Perceived speed of said services
- Availability of these services
- Disaster recovery and safety
Storage requirements can affect your infrastructure drastically. You could get by with a single IDE hard drive today, but tomorrow you might need a clustered NAS/SAN solution.
The speed that is perceived by your users is also an important design consideration. We all know that people hate to wait, especially on computers which are "supposed to be fast." What works today might not work tomorrow and how efficiently and unobtrusively you add more power and at a lower cost - the better.
Availability is not to be taken lightly, either. If you only serve a few files for your workgroup, it might not be a big deal if your server is offline for a while. If you're running a revenue generating E-commerce site though, every minute, even every second that your server is down could cost you hundreds or thousands of dollars.
Disaster recovery and prevention is perhaps one of the most overlooked aspects of infrastructure design today. Backups are only part of the picture, yet this very basic of necessities is many times an afterthought! You must also consider recovery of your data in the event of a disaster and how long it will take to restore service to previous levels. Can you afford the time it takes to mount a tape, look at the index and forward the tape, finally restoring the file or files? Do you have smaller requirements where a single CD or DVD would do? Suppose you work in a development environment where files often change, and you aren't (for some reason) using a code management system - can you handle the, "Fred, can you restore this file?" question that will inevitably come up and how long will it take? Using a Network Appliance Filer as a NAS storage solution with it's "SnapRestore" capability, for example - you could restore poor old Fred's file as fast as you can type a couple of commands.
Finally, and in a paragraph all it's own - concerning disaster recovery... You do have an off-site copy of all your critical data, right? What happens if your office burns to the ground. Where is your data stored? Hopefully you're not answering this with "upstairs, in my desk drawer..."
Scaling the Infrastructure
Okay, so the basic infrastructure has been designed or is in place and things are progressing. You may decide, either with plenty of notice or quite suddenly that you need to scale to handle increased or unforseen traffic. The sooner you know and plan for increased demands, the better you will be when the time comes to deal with them. If you design carefully, this experience can be as simple as snapping another hard drive to the RAID or adding another generic Web server to the farm. If you don't, you may very well end up having to rip out your entire infrastructure and start over, often at quite an expense. Be smart about your design choices when weighing against things such as budget allowances and be prepared for the future.
In this age of the information explosion and digital data, scaling isn't just a buzzword - it's a fact of life, and whether you want to deal with it or not, it will be required at some point. If ignored, you'll be faced with resource shortages which can manifest themselves any number of ways:
- Your bandwidth is insufficient and slow
- Running out of storage space for data or logs
- Administration is too complex or time consuming
- The server chokes to a crawl during peak hours
These are all examples of scaling problems, yet they don't have to stop you dead in your tracks if you've designed your infrastructure well to handle scaling, and have left enough room to grow into, rather than cutting corners and making compromises that leave you with little or no options. There are times where you indeed get what you pay for, and while the budget for the infrastructure can be an overriding factor - don't compromise at the cost of backing you into a corner later on if you can at all help it.
Bandwidth is an expensive commodity and one which never quite seems to be enough. We all want everything immediately - right away! Fortunately, bandwidth is usually a very easy thing to scale as your needs grow and usually involve nothing more than a call to your upstream provider to migrate from an ISDN line to a T1, or from a leased line to a T1 or beyond, to a T3. With each step comes an increased cost and may also involve slight networking infrastructure changes as would be the case in moving from ISDN to a leased line or T1. The hardware associated with this change is usually provided and handled by your upstream provider. You'll pretty much always be given an RJ45 cable in hand that you plug into your switch or router and the only thing that changes for you might be your IP addresses. Since we're talking about scaling though, you might reach a point where hosting your own infrastructure becomes too difficult or expensive if you want to provide serious redundancy and safety. This is when you consider "co-locating" your infrastructure in your service provider's data center.
RAID and NAS/SAN solutions are great for your storage requirements and their costs can range from fairly inexpensive to mind-numbingly pricey. For example, would a sub-$1000 external RAID enclosure work for your needs? How about a more mid-range NAS solution such as a Network Appliance Filer for around $20,000-40,000? If you really have a lot of data, you might be looking at an EMC, Fujitsu or other high-end SAN solution costing upwards of $250,000. Each solution has it's own benefits, features and unfortunately, cost. Another thing they all share is in the way they each scale to meet increased storage demands and how available that storage is to your applications and users.
Storage is usually an external (or internal) chassis in which you have several hard drives, which are then manipulated to work to varying degrees of redundancy and performance known as "RAID levels." Depending on the RAID level implemented, a drive failure can either completely shut you down or simply slow down disk I/O a bit. Depending on that same RAID level, and how the RAID is actually implemented - such as having "hot plug capability," determines how easily your storage can scale. You can purchase a new drive and drop it into your enclosure for more storage space as your demands grow, for example. Planning on either having massive storage to begin with, or being able to easily add more storage on a whim later on is an important aspect in your infrastructure's design. Try to determine how fast your storage demands will grow and plan accordingly.
Administration is something that can often get overlooked in your design. While you might only be tweaking one or two things here and there as situations arise, what happens if you need to suddenly make major changes? Can you handle it, and in a timely fashion? This topic is discussed in further detail below, in the "Where to Go from Here" section.
Server capacity is another thing that can make or break your infrastructure. You might start off the entire project by designing a new infrastructure to replace, or migrate - an existing one - or design one from scratch for a new project, company or venture. It's easy to look at your current demands and build something to handle it. As experience tells us though, remaining "current" in the computer industry is an oxymoron. That shiny new, super-expensive server you bought yesterday has already been outclassed by the next whiz-bang supercomputer to come along. Whether you can inexpensively scale your infrastructure as your needs grow or are forced to entirely replace or make massive changes to - depends on how well you plan now.
Consider for example, a Sun Enterprise 420R server. Because you are on a budget now, and have relatively light demands, you might start with such a configuration with the following components outfitted:
- Enterprise 420R server
- Single 450MHz CPU
- Single power supply
- 1GB RAM
This type of server, depending on your exact application - could fit the bill perfectly today. As your number of users increase, or your services used more heavily you'll need to consider how to make this same server better. Why make this server better? For one thing, it's a lot cheaper than replacing it entirely with a new solution and also easier in that you don't have to worry about migrating data, either. For the next bump in performance, consider upgrading that same server to something like the following:
- Enterprise 420R server
- Dual 450MHz CPUs
- Dual power supplies
- 2GB RAM
That same server has now basically doubled in capacity (depending on your exact application, of course) simply by adding an additional CPU and more RAM - not to mention a second power supply for a little extra redundancy, thrown in. You could beef up this server even further with more CPUs and RAM - all the while, never touching your data or operating system. All of these upgrades can also be performed with an absolute minimum of downtime. Of course, you could apply this same strategy and logic to an Enterprise E220R server, if your budget is lower for example. Something to strongly consider when it comes to scaling. It needn't be a nightmare project.
Ideas on Where to Go Next
What else is there besides throwing hardware at the problem and having software scale along? How about integrating and fine-tuning your infrastructure? Another key consideration after the dust settles is automation. Remember, if you have to do it more than once - write a script! Think not about adding more complexity to the mix, but rather on how to make what you have run better for you and your users. Something like adding a user and setting up his or her Website framework shouldn't take you or your techs an hour to complete and if it is, maybe you need a better approach...
What is nice about the suggested replacement software in the other articles is that they all share a similar configuration schema, operate in similar ways and they can all be compartmentalized or tightly integrated with your server's operating system. Nevertheless if you have to add a new user or remove an old one, you could be facing the task of handling several configuration files and having to run multiple commands to put the changes into effect. Now think about adding 20, 50 or 100 users en masse. Sounds grim, doesn't it?
One way you can handle all these configuration files and tasks related to their modification is through the use of a central database and several shell or Perl scripts. Again, think automation at this point, since your infrastructure is now pretty much in place - and if designed correctly, scaling it won't disrupt operations much or at all. Just how you automate various tasks depends on your needs, of course. If you have a small business environment, a simple set of shell or Perl scripts to make things easier when you have the one off request may be all you'll ever need. If you're in a larger corporate environment or especially an ISP which has a very dynamic user base - you'll need to come up with something more encompassing and scalable in short order.
A great way to centralize your information such as the /etc/passwd and /etc/group files among others was to use NIS or NIS+. These solutions had their problems, however and newer, larger-scale solutions have arrived on the scene over the last couple of years. Two examples of these solutions would be a centralized database, be it either a typical SQL-oriented approach or something more specific to the problem at hand like LDAP. LDAP, or "Lightweight Directory Access Protocol" is an ideal solution to this particular problem of managing users and their related information, but can be rather intimidating at first to design and set up. The benefits are worth this steep learning curve, as there are several features that make LDAP great. Things like built-in data replication and the ease in which redundancy can be applied, its designed to handle user information and its lightweight approach make it fast - all of which make scaling almost effortless down the road.
Databases in the form of a SQL server such as the ever-popular MySQL also make a great solution to your user management needs, and in some cases can be much more powerful and useful to you. Using Perl modules, C APIs or CGI programming - you can invent, develop and implement a practically limitless interface to your user data. You could have command-line, GUI or browser-based interfaces or tie in automation through other scripts quite readily. Not that you cannot do this with LDAP, but using a database gives you access to a larger array of data types and the types of middleware available are more common. Initial design and set up of a database driven back-end is also easier for the novice over LDAP.
Another thing to look at for the future is how to make your current infrastructure more robust. For example, if you had to hold back on going all out beefing up that new server due to budget restrictions - how about adding that second power supply or network interface for redundancy? If you only have one dedicated Web server, how about adding a second one not only for redundancy and failover, but load-balancing as well? How about your storage? Can you enhance not only the capacity but also it's availability? A single RAID might have redundancy as far as the drives go, but what about controllers? Power supplies? Network connections if it's a NAS/SAN solution? With a Network Appliance Filer for example, you can cluster two or more together so that in the event one entire unit fails - you still have a second one that can take over instantly.
Now would be a good time to estimate your infrastructure growth plan and break it up into stages by cost, complexity or gain - however you see fit. Preparing for your growth shouldn't be a last minute, "we forgot to plan" kind of affair. Realizing that you don't have enough power for the new project you're suddenly faced with is the wrong time to make that discovery. Be prepared and plan ahead. This will make your life, user experience and management much better and happier down the line.
Related Books
Important Links