Remember that we said that in doing linear regression (OLS), we pick a best-fitting line to describe our data? Our criterion for 'best-fitting line' was that line which minimized the sum of squared errors. By definition, the sum of squared errors is always zero.
In more complex models, we need a more complex definition of
'best-fitting line'. Stata knows what the definition should be, for probit and
logit models, and so iterates through a maximization procedure to get as close
as possible to this definition.
In words, you could think of the technique like this: if our assumption about
the distribution of data is correct, then Stata will pick those values of
bhat which make the observation of our sample MOST
likely. This is why the technique is called maximum likelihood.