Panel data in accounting and finance : theory and application

The use of models that involve longitudinal data in accounting and finance is common. However, there is often a lack of proper care regarding the criteria for adopting one model over another as well as an insufficiently detailed discussion of the possible estimators to be studied in each situation. This article presents, in conceptual and applied form, the main panel data estimators that can be used in these areas of knowledge and discusses the definition of the most consistent model to be adopted in function of the data characteristics. The models covered for short panels are the POLS with clustered robust standard errors, with between estimator, fixed effects, fixed effects with clustered robust standard errors, random effects and random effects with clustered robust standard errors. In turn, for long panels, the models discussed are fixed effects, random effects, fixed effects with AR(1) error terms, random effects with AR(1) error terms, POLS with AR(1) errors and pooled FGLS with AR(1) errors. The models are also applied to a real case, based on data from Compustat Global. At the end, the main routines for applying each of the models in Stata are presented.


INTRODUCTION
he use of models that involve data originating from several cross-sections over time (panel data) in accounting and finances is increasingly growing and important.As much of the data from companies, cities or countries are released periodically, the researcher is invited, naturally, to apply longitudinal models to the study of phenomena that suffer the influence of the differences between individuals and its own temporal evolution.
According to Marques (2000), the main advantage of the use of models in data panels refers to the control of individual heterogeneity, that is, to the possibility of measuring separately the effects generated because of the existing differences between each observation in each cross-section, as well as being possible to evaluate the evolution, for a specific individual, of the variables in a study over time.
On the other hand, still according to Marques (2000), the panel data provide a larger quantity of information, greater variability of data, lesser collinearity between the variables, greater number of degrees of freedom and greater efficiency in the estimation.The inclusion of the dimension in cross-section, in a temporal study, confers a greater variability to the data, as the use of the aggregate data results in softer series that the individual series that function as a basis.This increase in data variability contributes to the reduction of an eventual collinearity existing between the variables.
However, there is still, in this area, a lack of care as to the criteria for the adoption of one model in detriment of another, as well as the absence of a more detailed discussion about the possible estimators to be studied in each situation.In other words, the use of panel data in accounting and finances is, at times, elaborated without a deeper concern for the choice of a better model to be used, that is, little has been discussed about the adequacy of the use of the technique and about the definition of the best estimators.In this sense, the recent works of Pimentel (2009) and Jones, Kalmi and Mäkinen (2010) deserve to be highlighted.
The purpose of this article is to present, in a conceptual and structured manner, the main data estimators in a panel that can be used in accounting and finances, as well as help in the definition of the most consistent model to be adopted, as a function of the data characteristics.
Also, this article has the objective of applying these models to a real case, based on data from Compustat Global.Finally, we present the main routines for the application of each of the models in Stata, since it believes that such procedures may provide a better relation between theory and practice, and to facilitate the implementation of the models in future research.
Thus, the present study does not have the intention to suggest the application of panel data in a given situation, as this depends fundamentally on the research issue and the data available to the researcher.The purpose, if this technique is to be used, is to assist in the correct application, with a view to determining the most appropriate models to reality and focused on decision-making.
Section 1 provides a conceptual review of the main panel data estimators and makes a distinction between models in short panel (with a larger number of individuals than the analysis period) and in long panel (with the larger number of periods than the number of individuals in the study).Section 2 presents an application of the main models presented and discusses the procedure of defining the best model through a review of the results.And finally, Section 3 presents the final considerations.

PANEL DATA MODELS
There are many different models that can be used for panel data.The basic distinction between them, according to Greene (2007), is the existence of fixed or random effects.The term "fixed effects" gives a wrong idea of modeling, because, in both cases, the effects at the  Cameron & Trivedi (2009), the fixed effects models have the added complication that the regressors are correlated with the effects on the individual level and, therefore, a consistent estimation of the model parameters requires the elimination or control of the fixed effects.In this manner, a model that takes into account the specific effects of the individual for a dependent variable yit specifies that: In which xit are regressors, 0i are the specific random effects for the individual and it represents the idiosyncratic error.
With the error term being it = 0i + it and x'it correlated with the invariant error term in time (0i), it is supposed that x'it is not correlated with the idiosyncratic error it.The fixed effects model implies that E(yit|0i, xit) = 0i + x'it1, assuming that E(it|0i, xit) = 0, so that j = E(yit|0i, xit)/xj,it.The advantage of the fixed effects model is that a consistent estimator of the marginal effect of the jnth regressor of E(yit|0i, xit) can be obtained, given that xj,it varies in time.
In the random effects model, on the other hand, it is assumed that 0i is purely random, that is, it is not correlated to the regressors.The estimation, therefore, is prepared with an FGLS (feasible generalized least squares) estimator.The advantage of the random effects model is that it estimates all coefficients, even the time-invariant regressors, and, therefore, the marginal effects.Besides, E(yit|xit) can be estimated.But the major drawback is that these estimators are inconsistent if the fixed effects model is more appropriate.
As previously discussed, the dependent variable and the regressors can potentially vary simultaneously over both time and between individuals.While the variation over time or for any given individual, is known as within variance, the variation among individuals is called between variance.According to Wooldridge (2010), in the fixed effects model, the coefficient of a regressor with low variation within will be loosely estimated and will not be identified if not within variance.Therefore, it is of fundamental importance to differentiate between these variations to define the best model for panel data.
The total variation of the observations of a regressor x around the overall average x  1/  i T i  i  t x it in the data set can be decomposed in the sum of the within variation over time for each individual around xi  1/T  t x it and in the variation between individuals (for x i around x ).According to Cameron & Trivedi (2009): Notations N and iTi correspond, respectively, to the number of individuals and the total number of observations over time.When submitting the application of panel data in this article, the variances of each of the regressors will be presented and discussed.
Also according to Cameron & Trivedi (2009), the 1 parameter estimators of a fixed effects model to the equation (1) eliminate the 0i fixed effects, that is, is prepare a transformation within by differentiation of averages.In this manner, a within estimation produces a modeling with different data about the average, and one cannot estimate a coefficient of a regressor without a variation over time.Thus, the 0i fixed effects in equation (1) can be eliminated by subtracting the averages of each individual y i  x' i β 1  ε i in the corresponding model, resulting in the within model, or average model differences: In which x  T 1 T i t1 it and the within estimator is the OLS estimator (ordinary least squares) of this model.According to Cameron & Trivedi (2009), since 0i was eliminated, the OLS estimator offers consistent estimates of 1, even if 0i is correlated with xit, as is the case of the fixed effects model.
The between estimator uses only the variation between individuals (cross-sections) and is the OLS estimator of a regression of y i as a function of xi i , presented below (equation ( 3)).
By taking into account only the cross-section variations in the data, the coefficient of any regressor which is invariant between individuals may not be identified.The consistency of this estimator demands that the error term (β 0i -β 0  ε i ) not be correlated with xit, which occurs when 0i is a random effect, but not when it is a fixed effect.
According to Hsiao (2003), this estimator is rarely used because the random effects estimators end up being more consistent.
The random effects estimator, on the other hand, is a FGLS estimator in equation ( 1).
So the random effects model is the model of individual effects: With 0i ~ (0, σ 2 ) and it ~ (0, σ 2 ).In this manner, the error term it = 0i + it is correlated over time t, for a given observation i, with correlation: The random effects estimator is the FGLS estimator of 1 of equation ( 4) given the correlations of the errors in the equation ( 5).
According to Cameron & Trivedi (2009), in models with heteroscedastic and autocorrelated errors, the GLS estimator (generalized least squares) can be calculated as an OLS estimator in a model that has uncorrelated homoscedastic errors, obtained from (4) by an appropriate linear transformation.In the case of the random effects model of equation ( 4), this transformed model is given by equation ( 6). ( A FGLS estimator is obtained substituting θi, which is given for a consistent estimate indicated by: The estimator of the random effects will be consistent and fully efficient if the random effects model is appropriate, but will be inconsistent if the fixed effects model is appropriate, since the correlation between xit and 0i results in a correlation between the regressors and the error term in equation ( 6).Likewise, also according to Cameron & Trivedi (2009), if there are no fixed effects, then the random effects estimator is consistent but inefficient and therefore an estimation with clustered robust standard errors should be obtained.The expression of the estimate by feasible generalized least squares of a regression coefficient of model ( 1), assuming random effects, becomes equal to the same coefficient estimated in a fixed effects model (within estimation) if θ ˆ = 1.

Short Panel
If there are no fixed effects but the errors show a correlation inside the panel, then the random effects estimator will be consistent but inefficient and therefore an estimation with clustered robust standard errors should be obtained.In this manner, for a short panel, where T <N, an estimation with robust clustered standard errors can be obtained by considering the premise that the errors are independent among individuals and that N∞, that is, that E(it, js) = 0 for i ≠ j, that E(it, is) not be restricted and that it be heteroscedastic.
According to Cameron and Trivedi (2009), the initial step for the implementation of a model with panel data is the application of a model POLS (pooled ordinary least squares), which supposes that the regressors are exogenous and that the error term is it, instead of decomposition i + it.Therefore: The parameters of this model are estimated by OLS, but the inference requires that there be control of correlation within error it for a given individual, being prepared using robust standard errors with clustering at the individual level.

Long Panel
For long panel data, that is, with many periods for a relatively smaller number of subjects, the individual effects 0i can be incorporated into xit as dummy variables for each period according to the following model: so that there are many time effects t (monthly, quarterly or yearly effects, for example).
A model pooled, for T > N, in which the regressors xit include the intercept, the temporal effect and, possibly a vector of variable of an individual, can be written as: Since T is greater than N, it becomes necessary to specify a model that considers the presence of serial correlation error (Beck and Katz, 1995) allow the use of an AR(1) model for it over time where it is heteroscedastic (Hoechle, 2007).So: In which the terms it are not correlated in time, but with correlation between individuals different from zero (corr(it, is) = ts).
Alternatively the inclusion of one dummy variable vector for each period, it is estimated, finally, a model of individual effects with AR(1) error terms, which is a better model than that which considers i.i.d.error terms.So: Soon, according to Cameron & Trivedi (2009), this model will potentially generate more efficient parameter estimates.In this case, given the estimate of ρˆ in equation ( 11), first, it eliminates the effect of the AR(1) error and, as a result, eliminates the individual effect by applying the difference in averages.So, the modeling can consider 0i as a fixed or random effect.
Following the presentation of the panel data models, it is explained that this work will apply ten different types of modeling in order to provide a better understanding of the different types of estimators and their conditions of use, as well as present models for the study of behavior of the returns of the stock prices of companies listed on stock exchanges in Latin American countries, in a longitudinal perspective.Table 1 shows these ten different types of models.In the appendix, are routines for the application of each of these models in the Stata software.
OLS estimation with correlation control within of error it over time.

Model with Between
Estimator The between estimator only uses the variation of cross-sections and is the OLS estimator of the regression of a function of.The consistency of this estimator demands that the error term not be correlated with xit.y i x i (β 0i -β 0  ε i ) Fixed Effects The 0i parameters can be correlated with the xit regressors, which allows a limited form of The 0i terms can be correlated with the xit regressors, which allows a limited form of endogeneity.It is supposed that the errors are independent between individuals and that it is heteroscedastic.
Random Effects The 0i parameters and the it idiosyncratic error terms are independent and identically distributed (i.i.d.).The random effects estimator is the FGLS of  , given that corr(μ ,μ ) If there are no fixed effects, but the errors present correlation within, the random effects estimator is consistent, but inefficient.Therefore, clustered robust standard errors must be obtained.
Fixed Effects (AR)1 Error Terms Random Effects AR(1) Error Terms Pooled with OLS Estimation Method and AR(1) Error Terms With μ it  ρ i μ i,t-1  ε it , wherein the it are serially not correlated, but with correlation between individuals equal to corr(it, is) = ts ≠ 0.
Pooled with FGLS Estimation Method and AR(1) Error Terms Similar to pooled model with OLS estimation method, but with FGLS estimator.

AN APPLICATION
After a discussion on the main data panel estimators, we present an application in financial accounting.
Since many of the accounting and financial data present a monthly, quarterly or annual release periodicity, it is common for studies in these areas using data models on short panels, since the number of individuals (companies, for example), exceeds the number of disclosure periods of data.On the other hand, nothing prevents the researcher from basing his or her study on a sample from companies in a given sector only, or use data with greater disclosure frequency (daily, for example), which could lead to a model with data on a long panel.Either way, it is essential that the identification of this database resource is done prior to the  In this application, the use of the Compustat Global base is used in order to verify if the price-cash flow ratio is more significant than the price-earnings ratio per share to influence the monthly returns of share prices of companies in Latin America over time.According to Kennon (2010), as some investors prefer to use cash flow to make use of earnings per share for the evaluation of stock prices, since they argue that while the former is not easily manipulated, the same cannot be said for the second, this application provides an investigation on the subject, under a longitudinal perspective and with the use of several estimators.
As discussed, 10 different models of panel data will be developed with different considerations on the estimators and the error terms.The general model is given by: In which 1 and 2 represent the changes in the return of share prices when a unit of cash flow ratio (pcf) or price-to-earnings ratio per share (pe) occurs, respectively, ceteris paribus.
Below, we discuss the results of modeling, both for a short panel, as for a long panel.

Data Models for Short Panel
As the sample, in this case, it provides data from 473 companies in 118 months, the panel can be considered short (T <N) ..
Table 2 presents the variance decomposition for each of the regressors.According to table 1, note that the stock is time invariant and, therefore, presents the within variation equal to zero.ON the other hand, the variable referring time (month) is not invariant among companies, since this is an unbalanced panel and hence its between variation, even though lower than within, is not equal to zero.Of the remaining variables, only pcf presents a greater variation between individuals (between) than over time (within), but it is still not possible to declare that the within estimation will result in a loss of efficiency, since BBR, Braz.Bus.Rev. (Engl.ed., Online), Vitória, v. 10, n. 1, Art. 6, p. 127 -149, jan.-mar. 2013 www.bbronline.com.br the proportion between the within and between variances of each variable is different and the statistical significances of each of these models are not yet known.Table 1, however, provides a greater basis for the adoption of models of panel data and the application of different estimators.The columns "Minimum" and "Maximum" show, respectively, the minimum and maximum values of xit for the "general" line, the "within" line.
Note also that the pe variable is not statistically significant (sig.> 0,05) in models presented in the presence of the pcf variable.The latter, with the exception of model with the between estimator, is significant in explaining the behavior of the returns of stock prices (sig. < 0,05), confirming the argument of some analysts in favor of using this variable.
As to the pcf variable, it appears that the standard errors in fixed effects and random effects models with clustered robust standard errors are larger than the respective models without this consideration.The regressors estimated in POLS models and between offer even greater standard errors, even with the pcf variable being statistically significant (sig.< 0,05) in the POLS model.
The Breusch-Pagan LM Test, applied after the modeling of random effects, helps in the rejection of the null hypothesis that there is adaptation in the POLS model in relation to the random effects model, since  2 = 70.7 (sig. 2 = 0.000).Following through, by means of the Chow F test, the null hypothesis that there is equality of intercepts and slopes for all companies (POLS) is rejected.Therefore, these parameters differ from those obtained by means of fixed effects models, since F = 2.34 (sig.F = 0.000).Finally, the Hausman test for fixed effects assists in rejecting the null hypothesis that the random effects model provides more consistent parameter estimates, since, for this case,  2 = 17.07 (sig. 2 = 0.000).
According to Islam (1995), the main use of panel data modeling is its ability to allow differences occur between countries, which means that the results are significantly different Here 9 companies were purposely chosen so that the variable relating to time (month) would be invariant, meaning that the panel would be balanced, so that its between variation would be equal to zero.All other variables showed less variation between individuals (between) than over time (within), but it is also not possible to say that the between estimation will result in a loss of efficiency.
In the same way as was done for the short panel, table 6 shows the results of the models, considering also six different estimators.
But even allowing that the error terms are correlated between companies, it is noted that there wasn't, in this case, a reduction of standard errors of the pooled models with OLS and FGLS estimators compared with those obtained previously by means of models of fixed and random effects with AR(1) error terms.
With respect to the suitability of the models themselves, there is the statistical significance of the set of variables in cases in which were considered fixed or random effects with or without AR(1) error terms.As presented during the preparation of models for the short panel, although there is relative importance of R 2 statistics for prediction effects, their values are not significantly elevated in the models under review.
The models of random and fixed effects offer an alternative for long panel data, wherein the individual effects are considered with AR(1) error terms, and represent the best models than those that consider the error terms i.i.d., which can generate more efficient parameter estimates.In fact, models with fixed and random effects with AR(1) error terms show standard errors on the order of 30% to 50% lower than those obtained by the respective models without consideration of AR(1) error terms.
The Hausman test applied to the fixed and random effects models with AR(1) error terms assists in rejecting the null hypothesis that the random effects model provides more consistent parameter estimates, since, in this case,  2 = 10.50 (sig. 2 = 0.005).
Finally, it is worth mentioning that, in this case, the results appear to be contrary to those obtained for the short panel, that is, the pcf variable is not statistically significant (sig.> 0.05) in the presence of the pe variable, which presents itself with a negative sign.However, as the companies considered in this case are originated only from Argentina, Brazil and Mexico, a more detailed investigation on the economic reasons behind this phenomenon needs to be performed.As the negative signs of the parameters of the regressors are consistent with those presented in table 3 for these countries, it emphasizes even more the importance of the correct application of the panel models for the study of existing differences between individuals and, over time, for a given phenomenon.

Graph 2 :
Deviations of Monthly Returns in Relation to the Average of Each Company Over Time (Within Variation) Graph 3: Deviation of Monthly Returns in Relation to the General Average for Every Moment of Time (Between Variation)

Table 1 : Panel Data Models to Be Estimated
It is supposed that xit is not correlated with the it idiosyncratic error.