Inspection Sample Size

Analysis, Statistics, Control Charts, Statistical Methods, Audit, Auditing

Question

  1. The customer expects certain levels of inspection: pull 157 bottles for visual testing, but then they also want 20 pulled for dimensional testing. Can’t the 20 additional bottles be a subset of the original testing sample?
  1. When calculating the lot, do you pull the samples before or after your calculations? Do the samples get included in the produced quantity or not?  For example: If the customer orders 10,000 bottles and the level 2 inspection pulls 200 bottles that drops the total shipped to the customer to 9,800 pieces.  If 10,200 bottles are produced then the inspection level increases so that 315 bottles need to be pulled for testing.  What is the correct sample size and production number?

Answer

Here are the responses to your questions:

  1. Yes since the first inspection is visual, you can use a subset for the additional testing.
  1. The lot size is 10,000. You should be putting the samples back into the lot if they are not destroyed by the testing. You send what is contracted for.  You are sampling with replacement.

Jim Bossert

SVP Process Design Manger, Process Optimization
Bank of America
ASQ Fellow, CQE, CQA, CMQ/OE, CSSBB, CSSMBB
Fort Worth, TX

For more on this topic, please visit ASQ’s website.

Statistical Methods and Control Charts

Analysis, Statistics, Control Charts, Statistical Methods

Question:
My question is regarding a threading process.  There is 100% inspection for go/no go check and about 5% rejection/rework.  The batch size is 5,000 nos and is completed in 3 days of production. Two such batches are produced in a month.

What type of control chart should be used to monitor the process? How should the process capability be calculated in this case?

Answer:
The type of control chart first depends on what type of data you are measuring.  If you are doing go/no go then you are limited to a “P” chart or a “C” chart.  A “P” chart looks at % good (or bad).  A “C” chart looks at the number of defects found.

If you are measuring thickness or strength, (something that can be measured), then you can use a X-bar/R chart or an X-bar/S chart depending on many samples are taken.

That is the simple answer; part of this depends on how you are taking samples and how often.  If samples are taken at the start and the finish, then I would probably recommend the “P” chart.

If you can measure throughout the manufacturing process, and you look at the type of defects, then I recommend a “C” chart.

Ideally, if you can get measurement data, you are better off with the X-bar/R or the X-bar/S charts.  These tend to be better predictors and it is easier to calculate capability.

With the capability for the go/no go data, you can get % defective, (or % good) and multiply that by 1,000,000 to get your capability estimate in defects per million.

Jim Bossert
SVP Process Design Manger, Process Optimization
Bank of America
ASQ Fellow, CQE, CQA, CMQ/OE, CSSBB, CSSMBB
Fort Worth, TX

For more information on this topic, please visit ASQ’s website.

Control Chart to Analyze Customer Satisfaction Data

Control chart, data, analysis

Q: Let’s assume we have a process that is under control and we want to monitor a number of key quality characteristics expressed through small subjective scales, such as: excellent, very good, good, acceptable, poor and awful. This kind of data is typically available from customer satisfaction surveys, peer reviews, or similar sources.

In my situation, I have full historical data available and the process volume average is approximately 200 deliveries per month, giving me enough data and plenty of freedom to design the control chart I want.

What control chart would you recommend?

I don’t want to reduce my small scale data to pass/fail, since I would lose insight in the underlying data. Ideally, I’d like a chart that both provides control limits for process monitoring and gives insight on the repartition of scale items (i.e., “poor,” “good,” “excellent”).

A: You can handle this analysis a couple of ways.  The most obvious choice and probably the one that would give you the most information is a Q-chart. This chart is sometimes called a quality score chart.

The Q-chart assigns a weight to each category. Using the criteria presented, values would be:

  • excellent = 6
  • very good =5
  • good =4
  • acceptable =3
  • poor =2
  • awful=1.

You calculate the subgroup score by taking the weight of each score and multiply it by the count and then add all of the totals for the subgroup mean.

If 100 surveys were returned with results of 20 that were excellent, 25 very good, 25 good, 15  acceptable, 12 poor, and 3 awful, the calculation is:

6(20)+5(25)+4(25)+3(15)+2(12)+3(1)= 417

This is your score for this subgroup.   If you have more subgroups, you can calculate a grand mean by adding all the subgroup scores and dividing it by the number of subgroups.

If you had 10 subgroup scores of 417, 520, 395, 470, 250, 389, 530, 440, 420, and 405, the grand mean is simply:

((417+ 520+ 395+ 470+ 250+ 389+ 530+ 440+ 420+ 405)/10) = 4236/10 =423.6

The control limits would be the grand mean +/- 3 √grand mean.  Again, in this example, 423.6 +/-3√423.6 = 423.6 +/-3(20.58).   The lower limit is  361.86 and the upper limit is 485.34. This gives you a chance to see if things are stable or not.  If there is an out of control situation, you need to investigate further to find the cause.

The other choice is similar, but the weights have to total to 1. Using the criteria presented, the values would be:

  •  excellent = .3
  • very good = .28
  • good =.25
  • acceptable =.1
  • poor=.05
  • awful = .02.

You would calculate the numbers the same way for each subgroup:

.3(20)+.28(25)+.25(25)+.1(15)+.05(12)+.02(1)= 6+7+6.25+1.5+.6+.02=21.37

If you had 10 subgroup scores of 21.37, 19.3, 20.22, 25.7, 21.3, 17.2, 23.3, 22, 19.23, and 22.45, the grand mean is simply ((21.37+ 19.3+ 20.22+ 25.7+ 21.3+ 17.2+ 23.3+ 22+ 19.23+ 22.45)/10)= 212.07/10 =21.207.

The control limits would be the grand mean +/- 3 √grand mean.  Therefore, the limits would be 21.207+/-3 √21.207= 21.207+/-3(4.605).  The lower limit is  7.39 and the upper limit is 35.02.

The method is up to you.  The weights I used were simply arbitrary for this example. You would have to create your own weights for this analysis to be meaningful in your situation.  In the first example, I have it somewhat equally weighted. In the second example, it is biased to the high side.

I hope this helps.

Jim Bossert
SVP Process Design Manger, Process Optimization
Bank of America
ASQ Fellow, CQE, CQA, CMQ/OE, CSSBB, CSSMBB
Fort Worth, TX

For more on this topic, please visit ASQ’s website.

AQL for Electricity Meter Testing

Chart, graph, sampling, plan, calculation, z1.4

Q: We have implemented a program to test electricity meters that are already in use. This would target approximately 28,000 electricity meters that have been in operation for more than 15 years. Under this program, we plan to test a sample of meters and come to a conclusion about the whole batch  —  whether replacement is required or not. As per ANSI/ISO/ASQ 2859-1:1999: Sampling procedures for inspection by attributes — Part 1: Sampling schemes indexed by acceptance quality limit (AQL) for lot-by-lot inspection, we have selected a sample of 315 to be in line with the total number of electricity meters in the batch.

Please advice us on how to select an appropriate acceptable quality level (AQL) value to accurately reflect the requirement of our survey and come in to a decision on whether the whole batch to be rejected and replaced. Thank you.

A: One of the least liked phrases uttered by statisticians is “it depends.” Unfortunately, in response to your question, the selection of the AQL depends on a number of factors and considerations.

If one didn’t have to sample from a population to make a decision, meaning we could perform 100% inspection accurately and economically, we wouldn’t need to set an AQL. Likewise, if we were not able to test any units from the population at all, we wouldn’t need the AQL. It’s the sampling and associated uncertainty that it provides that requires some thought in setting an AQL value.

As you may notice, the lower the AQL the more samples are required. Think of it as reflecting the size of a needle. A very large needle (say, the size of a telephone pole) is very easy to find in a haystack. An ordinary needle is proverbially impossible to find. If you desire to determine if all the units are faulty or not (100% would fail the testing if the hypothesis is true), that would be a large needle and only one sample would be necessary. If, on the other hand, you wanted to find if only one unit of the entire population is faulty, that would be a relatively small needle and 100% sampling may be required, as the testing has the possibility of finding all are good except for the very last unit tested in the population.

AQL is not the needle or, in your case, the proportion of faulty fielded units. It is the average quality level which is related to the proportion of bad units. The AQL is fixed by the probability of a random sample being drawn from a population with an unknown actual failure rate of the AQL (say 0.5%), creating a sample that has a sample failure rate of 0.5% or less. We set the probability of acceptance relatively high, often 95%. This means if the population is actually mostly as good as or better than our AQL, we have a 95% chance of pulling a sample that will result in accepting the batch as being good.

The probability of acceptance is built into the sampling plan. Drafting an operating characteristic curve of your sampling plan is helpful in understanding the relationship between AQL, probability of acceptance, and other sampling related values.

Now back to the comment of “it depends.” The AQL is the statement that basically says the population is good enough – an acceptable low failure rate. For an electrical meter, the number of out of specification may be defined by contract or agreement with the utility or regulatory body. As an end customer, I would enjoy a meter that under reports my electricity use as I would pay for less than I received. The utility company would not enjoy this situation, as it provides their service at a discount. And you can imagine the reverse situation and consequences. Some calculations and assumptions would permit you to determine the cost to the consumers or to the utility for various proportions of units out of specification, either over or under reporting. Balance the cost of testing to the cost to meter errors and you can find a reasonable sampling plan.

Besides the regulatory or contract requirements for acceptable percent defective, or the balance between costs, you should also consider the legal and publicity ramifications. If you accept 0.5% as the AQL, and there are one million end customers, that is 5,000 customers with possibly faulty meters. What is the cost of bad publicity or legal action? While not likely if the total number of faulty units is small, there does exist the possibility of a very expensive consequence.

Another consideration is the measurement error of the testing of the sampled units. If the measurement is not perfect, which is a reasonable assumption in most cases, then the results of the testing may have some finite possibilities to not represent the actual performance of the units. If the testing itself has repeatability and reproducibility issues, then setting a lower AQL may help to provide a margin to guard from this uncertainty. A good test (accurate, repeatable, reproducible, etc.) should have less of an effect on the AQL setting.

In summary, if the decision based on the sample results is important (major expensive recall, safety or loss of account, for example), then use a relatively lower AQL. If the test result is for an information gathering purpose which is not used for any major decisions, then setting a relatively higher AQL is fine.

If my meter is in the population under consideration, I am not sure I want my meter evaluated. There are three outcomes:

  • The meter is fine and in specification, which is to be expected and nothing changes.
  • The meter is overcharging me and is replaced with a new meter and my utility bill is reduced going forward. I may then pursue the return of past overcharging if the amount is worth the effort.
  • The meter is undercharging me, in which case I wouldn’t want the meter changed nor the back charging bill from the utility (which I doubt they would do unless they found evidence of tampering).

As an engineer and good customer, I would want to be sure my meter is accurate, of course.

Fred Schenkelberg
Voting member of U.S. TAG to ISO/TC 56
Voting member of U.S. TAG to ISO/TC 69
Reliability Engineering and Management Consultant
FMS Reliability
http://www.fmsreliability.com

For more on this topic, please visit ASQ’s website

Variation in Continuous and Discrete Measurements

Force majeure

Q: I would appreciate some advice on how I can fairly assess process variation for metrics derived from “discrete” variables over time.

For example, I am looking at “unit iron/unit air” rates for a foundry cupola melt furnace in which the “unit air” rate is derived from the “continuous” air blast, while the unit iron rate is derived from input weights made at “discrete” points in time every 3 to 5 minutes.

The coefficient of variation (CV), for the air rate is exceedingly small (good) due to its “continuous’ nature” but the CV for iron rate is quite large because of its “discrete nature,” even when I use moving averages for extended periods of time. Hence, that seemingly large variation for iron rate then carries over when computing the unit iron/unit air rate.

I think the discrete nature of some process variables results in unfairly high assessments of process variation, so I would appreciate some advice on any statistical methods that would more fairly assess process variation for metrics derived from discrete variables.

A: I’m not sure I fully understand the problem, But I do have a few assumptions and possibly a reasonable answer for you. As you know, when making a measurement, using a discrete scale (red, blue, green; on/off, or similar), the item being measured is placed into one of the “discrete” buckets. For continuous measurements, we use some theoretically infinite scale to place the units location on that scale. For this latter type of measurement, we are often limited by the accuracy of the equipment to the level of precision the measurement can be accomplished.

In the question, you mention measurements of air from the “continuous” air blast. The air may be moving without interruption (continuously), yet the measurement is probably recorded periodically unless you are using a continuous chart recorder. Even so, matching up the reading with the unit iron readings every 3 to 5 minutes, does create individual readings for the air value. The unit iron reading is a “weights” based reading (not sure what is meant by derived, yet let’s assume the measurement is a weight scale of some sort.) Weight, like mass or length, is an infinite scale measurement, limited by the ability of the specific measurement system to differentiate between sufficiently small units.

I think you see where I’m heading with this line of thought. The variability with the unit iron reading may simply reflect the ability of the measurement process. I do not think either air rate or unit iron (weight based) is a discrete measurement, per se. Improve the ability to measure the unit iron and that may reduce some measurement error and subsequent variation. Or, it may confirm that the unit iron is variable to an unacceptable amount.

Another assumption I could make is that the unit iron is measured for the batch that then has unit air rates regularly measured. The issue here may just be the time scales involved. Not being familiar with the particular process involved, I’ll assume some manner of metal forming, where a batch of metal is created then formed over time where the unit air is important. And, furthermore, assume the batch of metal takes an hour for the processing. That means we would have about a dozen or so readings of unit air for the one reading of unit iron.

If you recall, the standard deviation formula is divided by square root of n (number of samples). In this case, there is about a 10 to 1 difference in n (10 for unit air to one for unit iron). Over many batches of metal, the ratio of readings remains at or about 10 to 1, thus impacting the relative stability of the two coefficient of variations. Get more readings for unit iron or reduce the unit air readings, and it may just even out. Or, again, you may discover the unit iron readings and underlying process is just more variable.

From the information provided, I think this provides two areas to conduct further exploration. Good luck.

Fred Schenkelberg
Voting member of U.S. TAG to ISO/TC 56
Voting member of U.S. TAG to ISO/TC 69
Reliability Engineering and Management Consultant
FMS Reliability
http://www.fmsreliability.com

For more on this topic, please visit ASQ’s website.

Could Null Hypothesis State Difference?

About ASQ's Ask the Standards Expert program and blog

Q: Does a null hypothesis always state that there is no difference?  Could there be a null hypothesis that claims there is?

In the U.S. legal system, the null hypothesis is that the accused is assumed innocent until proven guilty.  In another legal system, there might exist the possibility that the accused is assumed guilty until proven innocent.  In our system, a type 1 error would be to find an innocent man guilty.  What would be considered a type 1 error if the null hypothesis was assumed guilt?

A: Sir Ronald Fisher developed this basic principle more than 90 years ago.  As you have correctly stated above, the process is assumed innocent until proven guilty. You must have evidence beyond reasonable doubt. An alpha error (type 1) is calling an innocent person guilty. Failure to prove guilt when a person really did commit a crime is a Beta error (type 2).

What can null hypothesis tell us?  Does the confidence interval include zero (or innocence in the court example)? Instead of asking, “can you assume guilt and prove innocence?” — turn the question around and ask “does the confidence interval include some value that is guilty?”

For example, let’s say a process has an unknown mean and standard deviation, but it has customer specifications from 8-12 millimeters. Your sample measures 14 millimeters. Clearly, your sample is guilty by customer specifications. We need to prove beyond reasonable doubt that the confidence interval of the process, at some risk level (alpha), does not include guilty material. This is done by measuring the process for control.  If it is in control and not meeting customer specifications, either move the distribution, reduce the variation (through Design of Experiments, or other methods), or through some combination of both.

If the new confidence interval does not include guilt, the argument would be that you have proven, beyond reasonable doubt, that the confidence interval does not include the out-of-spec material. Under this circumstance, a type 1 error (alpha error) would be a process  mean less than the upper specification, but the confidence interval included the specification.

Bill Hooper
ASQ Certified Six Sigma Master Black Belt
President, William Hooper Consulting Inc.
Williamhooperconsulting.com
Naperville, IL

For more on this topic, please visit ASQ’s website.